The Video/Images Integration Research Division regularly outputs video editing* architectures, and over the past nine months, travel of this nature has become even more frequent. However, most of them represent only cutting-edge progressive advances, as the core challenges are substantial.
However, this week’s new collaboration between China and Japan has created several examples that deserve a close exploration of the approach, even if it’s not necessarily a groundbreaking task.
In the video clip below (from the related project site for the paper, it – warned – may tax the browser), we see that the deep capabilities of the system do not exist in the current configuration, but the system is plausible and significantly altering the identity of the young women in the video.
Click to play. Based on the semantic segmentation mask visualized in the bottom left, the original (top left) female is converted to a prominent identity, even though this process does not realize the identity swap shown at the prompt. Source: https://yxbian23.github.io/project/video-painter/ (Note that at the time of writing, this automatic deployment and video-packed site tends to crash your browser). See the source video. If you have access, see the example project overview video at https://www.youtube.com/watch?v=hyznfsd3a0s for better solutions and more information.
This kind of mask-based editing is well established in static latent diffusion models using tools such as ControlNet. However, even if masked areas provide creative flexibility for the model, maintaining the video background consistency is much more difficult.
Click to play. Use the new VideoPainter method to change the species. See the source video. If you have access, see the example project overview video at https://www.youtube.com/watch?v=hyznfsd3a0s for better solutions and more information.
The author of the new work takes into account both Tencent’s own Brushnet architecture (which we covered last year) and ControlNet, both dealing with dual branch architectures that can separate foreground and background generation.
However, direct application of this method to the highly productive diffusion transformer (DIT) approach proposed by Openai’s SORA poses certain challenges, as the authors point out.
‘Applying (BrushNet and ControlNet architecture) to video DIT (directly) has some challenges. (Firstly, the specified) robust generation foundation and heavy model size of video dit (specified) replicates a huge video dit backbone to full/half, as the context encoder is unnecessary and computationally prohibited.
‘ (Secondly) In the pure convolutional control branch of Brushnet, the DIT token of the Masked region contains essentially background information for global attention, as it complicates the distinction between masked and unmasked regions of the DIT skeleton.
‘(Finally) controlnet has no feature injection across all layers, preventing dense background control to start the task. ”
Thus, the researchers developed a plug-and-play approach in the form of the entitled dual branch framework. VideoPainter.
VideoPainter offers a dual branch video initiation framework that enhances pre-trained DIT with a lightweight context encoder. This encoder accounts for only 6% of the backbone parameters. The authors argue that this approach is more efficient than traditional methods.
This model proposes three important innovations: A streamlined two-tier context encoder for efficient background guidance. A mask-selective integrated system that separates masks and unmasked tokens. Also, an impanting region ID resampling technique that maintains identity consistency across long video sequences.
By freezing both pre-trained DIT and context encoder while introducing ID-Adapters, VideoPainter ensures that the input of region tokens from previous clips continues throughout the video, reducing flickering and inconsistency.
The framework is designed for plug-and-play compatibility, allowing users to seamlessly integrate into their existing video generation and editing workflows.
The authors curated what they have described as the largest video to input datasets so far, to support the task of using CogVideo-5B-I2V as a generation engine. title VPDATAthe collection consists of over 390,000 clips, with a total video duration of over 886 hours. They also developed a related benchmark framework entitled vpbench.
Click to play. The example project website shows the segmentation feature with the VPDATA collection and the VPBench test suite. See the source video. If you have access, see the example project overview video at https://www.youtube.com/watch?v=hyznfsd3a0s for better solutions and more information.
New title is included VideoPainter: Start and edit videos of any length using plug and play context controlsand comes from seven authors: Tencent Arc Lab, Hong Kong China University, University of Tokyo, and University of Macau.
In addition to the aforementioned project sites, the author also releases more accessible YouTube overviews and embracing face pages.
method
The VPDATA data collection pipeline consists of collections, annotations, splits, selections, and captions.
The schema for the dataset construction pipeline. Source: https://arxiv.org/pdf/2503.05639
The source collection used for this compilation was from VideVo and Pexels, with an initial distance of around 450,000 videos being obtained.
Multiple contribution libraries and methods were organized into preprocessing stages. The recognized framework was used to provide open set video tags responsible for identifying key objects. Grounded dino was used to detect bounding boxes around identified objects. In the segment, we also used the Model 2 (SAM 2) framework to improve these coarse selections to high quality mask segmentation.
To manage scene migrations and ensure consistency in video initiation, VideoPainter uses PyScenedEtect to identify and segment clips with natural breakpoints, avoiding the destructive shifts caused by tracking the same object from multiple angles. Clips were split into 10-second intervals, with those less than 6 seconds being discarded.
Three filtering criteria were applied for data selection. Aesthetic qualityis evaluated with the Laion-Aesthetic Score Predictor. Motion intensitymeasured via light flow using a raft. and Content Safetywas verified through a stable diffusion safety checker.
One of the major limitations of existing video segmentation datasets is the lack of detailed text annotations. This is important for guiding generative models.
Researchers highlight the lack of video captions in comparable collections.
Therefore, the VideoPainter data curation process incorporates a variety of key vision language models, such as COGVLM2 and ChAT GPT-4O, to generate keyframe-based captions and detailed descriptions of mask regions.
VideoPainter enhances pre-trained DITS by introducing a custom lightweight context encoder that separates background context extraction from foreground generation, found in the top right corner of the descriptive schema below.
VideoPainter concept schema. VideoPainter’s context encoder handles noisy latency, downsampled masks, and masked video latency through the VAE, and integrates only the background token into a pre-trained DIT to avoid ambiguity. The ID Resample adapter ensures identity consistency by concatenating masked region tokens during training and resampling them from previous clips during inference.
Instead of putting a burden on the backbone with redundant processing, this encoder works with streamlined input. Noisy potential video A combination of potential video (extracted via a variational autoencoder or VAE) and downsampled masks.
Noisy latency provides a generation context, while masked video latency is consistent with the existing distribution of DIT and aims to improve compatibility.
Rather than replicating most of the models that occurred in previous works, VideoPainter integrates only the first two layers of DIT. These extracted features are reintroduced into the frozen DIT in a structured group-by-group manner. The initial layer features signal the first half of the model, and the second half improves the second half.
Additionally, the token selection mechanism reintegrates only background-related features, preventing confusion between unmasked and unmasked regions. The authors argue that this approach allows VideoPainter to improve foreground input efficiency while maintaining high fidelity to background conservation.
The authors note that the methods they propose support a variety of stylization methods, including the most popular low-rank adaptation (LORA).
Data and Testing
VideoPainter received comparable training for each text using the COGVIDEO-5B-I2V model. The curated VPDATA corpus was used at 480x720px at a learning speed of 1×10.-5.
The ID Resample adapter was trained in 2,000 steps, and the Context Encoder was trained in 80,000 steps using the ADAMW optimizer. The training was done in two stages using the formidable 64 NVIDIA V100 GPU (though this paper does not specify whether these have 16GB or 32GB of VRAM).
For benchmarking, Davis was used for random masks, and the author’s own VPBench was used for segmentation-based masks.
The VPBench dataset includes objects, animals, humans, landscapes, and diverse tasks, covering four actions. addition, remove, changeand swap. This collection features 45 6-second videos and nine videos that last an average of 30 seconds.
Eight metrics were used in the process. For masked areas, the authors used peak signal-to-noise ratio (PSNR). Learned Perceptual Similarity Metric (LPIPS); Structural Similarity Index (SSIM); Mean Absolute Error (MAE).
For text adjustment, researchers used clip similarity to assess the semantic distance between clip captions and actual perceptual content, and the accuracy of masked regions.
Fréchet Video Distance (FVD) was used to assess the general quality of the output video.
For a quantitative comparison round of video in painting, the authors set up the system against previous approaches of propeller, Cococo, and Cog-INP (Cogvideox). This test was constructed using an inter-image (I2V) backbone to input the first frame of a clip using an image input model and propagate the results into potential blending operations, consistent with the method proposed by a 2023 paper from Israel.
It is quite difficult to find video examples that are very specific to the results outlined in the paper, as the project’s website is not fully functional at the time of writing and may not have the entire example of YouTube videos related to the project packed into the project site. Therefore, we present partial static results presented in the paper and close the article with additional video examples extracted from the project site.
Quantitative comparison of Videopainter vs. Propainter, Cococo and Cog-inp for VPBench (segmentation mask) and Davis (random mask). Metrics cover masked region storage, text alignment and video quality. Red = Best, Blue = Second Best.
Of these qualitative results, the authors comment.
‘Segmentation-based VPBench, Propine, and Cococo show worst performance for most metrics, mainly because the difficulty of single-backbone architectures in balancing background preservation and foreground generation, which conflict with fully masked objects, cannot be disabled, respectively.
‘In Davis, the Random Mask Benchmark, Promainter shows improvements by leveraging partial background information. However, VideoPainter provides optimal performance across segmentation (standard and length length) and random masks through a dual branch architecture that effectively separates background preservation and foreground generation.
The author then presents a static example of a qualitative test. In that case, here are some options: In all cases, we introduce our readers to our project sites and YouTube videos to show them better solutions.
A comparison of input methods in previous frameworks.
Click to play. An example we linked from the “Results” video on the project site.
The author commented on this qualitative round for the start of the video:
‘VideoPainter shows consistent and exceptional results in video consistency, quality, and consistency with text captions. In particular, Propainter cannot generate fully masked objects, as it relies solely on background pixel propagation instead of generating it.
“Although Coco has demonstrated basic functionality, it is unable to maintain a consistent ID in painted areas (inconsistent vessel appearance and sudden changes in topography) due to a single backbone architecture that attempts to balance background preservation and foreground generation.
‘Cog-inp achieves basic input results. However, a mix of operations that cannot detect mask boundaries leads to important artifacts.
Additionally, VideoPainter can generate coherent videos over 1 minute while maintaining ID consistency through ID resampling. ”
Researchers further tested their ability to enhance VideoPainter captions, obtain improved results in this way, and placed the system against Uniedit, Ditctrl, and Revideo.
Video editing results for three previous approaches.
Author’s comments:
‘With both standard and long videos on VPBench, VideoPainter achieves excellent performance and surpasses end-to-end Revideo. This success can be attributed to a dual branch architecture that ensures excellent background preservation and foreground generation capabilities, and is complemented by complementing the resampling of region IDs, where the edited region is closely aligned with the editing instructions and maintaining the consistency of long video IDs, while maintaining high fidelity in unedited regions.
This paper presents static qualitative examples of this metric, but they do not illuminate them. Instead, we will show our readers the diverse examples that spread across the various videos published in this project.
Finally, human studies were conducted, and 30 users were asked to assess 50 randomly selected generations from the VPBench and edit subset. This example highlights background saving, prompt alignment, and general video quality.
Results of VideoPainter user study.
The author states:
“VideoPainter has significantly surpassed existing baselines and achieved a higher preference across all evaluation criteria for both tasks.”
However, they acknowledge that the quality of VideoPainter generations depends on the basic model and can be difficult with complex movements and physics. And they observe that low-quality masks or mismatched captions also have poor performance.
Conclusion
VideoPainter appears to be a suitable addition to the literature. However, typical of modern solutions is in considerable computational demand. Furthermore, many of the examples selected for presentations on the project site are far below the best examples. It would therefore be interesting to see this framework pitted against future entries and a wider range of previous approaches.
* It is worth mentioning that “video editing” in this sense does not mean “assembling various clips into sequences.” This is the traditional meaning of the term. Rather, use machine learning technology to directly or in some way change the internal content of an existing video clip.
First released on Monday, March 10th, 2025