The video/picture synthesis analysis sector often outputs video-editing* architectures, and during the last 9 months, outings of this nature have turn out to be much more frequent. That stated, most of them signify solely incremental advances on the cutting-edge, for the reason that core challenges are substantial.
Nevertheless, a brand new collaboration between China and Japan this week has produced some examples that advantage a better examination of the strategy, even when it isn’t essentially a landmark work.
Within the video-clip beneath (from the paper’s related challenge web site, that – be warned – might tax your browser) we see that whereas the deepfaking capabilities of the system are non-existent within the present configuration, the system does a superb job of plausibly and considerably altering the identification of the younger lady within the image, primarily based on a video masks (bottom-left):
Click on to play. Primarily based on the semantic segmentation masks visualized within the decrease left, the unique (higher left) lady is reworked right into a notably totally different identification, despite the fact that this course of doesn’t obtain the identity-swap indicated within the immediate. Supply: https://yxbian23.github.io/challenge/video-painter/ (remember that on the time of writing, this autoplaying and video-stuffed web site was inclined to crash my browser). Please seek advice from the supply movies, should you can entry them, for higher decision and element, or try the examples on the challenge’s overview video at https://www.youtube.com/watch?v=HYzNfsD3A0s
Masks-based modifying of this sort is well-established in static latent diffusion fashions, utilizing instruments like ControlNet. Nevertheless, sustaining background consistency in video is way tougher, even when masked areas present the mannequin with artistic flexibility, as proven beneath:
Click on to play. A change of species, with the brand new VideoPainter methodology. Please seek advice from the supply movies, should you can entry them, for higher decision and element, or try the examples on the challenge’s overview video at https://www.youtube.com/watch?v=HYzNfsD3A0s
The authors of the brand new work take into account their methodology in regard each to Tencent’s personal BrushNet structure (which we coated final yr), and to ControlNet, each of which deal with of a dual-branch structure able to isolating the foreground and background era.
Nevertheless, making use of this methodology on to the very productive Diffusion Transformers (DiT) strategy proposed by OpenAI’s Sora, brings explicit challenges, because the authors word”
‘[Directly] making use of [the architecture of BrushNet and ControlNet] to video DiTs presents a number of challenges: [Firstly, given] Video DiT’s strong generative basis and heavy mannequin dimension, replicating the total/half-giant Video DiT spine because the context encoder can be pointless and computationally prohibitive.
‘[Secondly, unlike] BrushNet’s pure convolutional management department, DiT’s tokens in masked areas inherently comprise background info as a consequence of world consideration, complicating the excellence between masked and unmasked areas in DiT backbones.
‘[Finally,] ControlNet lacks function injection throughout all layers, hindering dense background management for inpainting duties.’
Due to this fact the researchers have developed a plug-and-play strategy within the type of a dual-branch framework titled VideoPainter.
VideoPainter affords a dual-branch video inpainting framework that enhances pre-trained DiTs with a light-weight context encoder. This encoder accounts for simply 6% of the spine’s parameters, which the authors declare makes the strategy extra environment friendly than typical strategies.
The mannequin proposes three key improvements: a streamlined two-layer context encoder for environment friendly background steerage; a mask-selective function integration system that separates masked and unmasked tokens; and an inpainting area ID resampling method that maintains identification consistency throughout lengthy video sequences.
By freezing each the pre-trained DiT and context encoder whereas introducing an ID-Adapter, VideoPainter ensures that inpainting area tokens from earlier clips persist all through a video, decreasing flickering and inconsistencies.
The framework can also be designed for plug-and-play compatibility, permitting customers to combine it seamlessly into present video era and modifying workflows.
To assist the work, which makes use of CogVideo-5B-I2V as its generative engine, the authors curated what they state is the most important video inpainting dataset thus far. Titled VPData, the gathering consists of greater than 390,000 clips, for a complete video length of greater than 886 hours. Additionally they developed a associated benchmarking framework titled VPBench.
Click on to play. From the challenge web site examples, we see the segmentation capabilities powered by the VPData assortment and the VPBench take a look at suite. Please seek advice from the supply movies, should you can entry them, for higher decision and element, or try the examples on the challenge’s overview video at https://www.youtube.com/watch?v=HYzNfsD3A0s
The new work is titled VideoPainter: Any-length Video Inpainting and Enhancing with Plug-and-Play Context Management, and comes from seven authors on the Tencent ARC Lab, The Chinese language College of Hong Kong, The College of Tokyo, and the College of Macau.
Moreover the aforementioned challenge web site, the authors have additionally launched a extra accessible YouTube overview, as effectively a Hugging Face web page.
Technique
The information assortment pipeline for VPData consists of assortment, annotation, splitting, choice and captioning:

Schema for the dataset development pipeline. Supply: https://arxiv.org/pdf/2503.05639
The supply collections used for this compilation got here from Videvo and Pexels, with an preliminary haul of round 450,000 movies obtained.
A number of contributing libraries and strategies comprised the pre-processing stage: the Acknowledge Something framework was used to offer open-set video tagging, tasked with figuring out major objects; Grounding Dino was used for the detection of bounding bins across the recognized objects; and the Phase Something Mannequin 2 (SAM 2) framework was used to refine these coarse choices into high-quality masks segmentations.
To handle scene transitions and guarantee consistency in video inpainting, VideoPainter makes use of PySceneDetect to determine and section clips at pure breakpoints, avoiding the disruptive shifts usually brought on by monitoring the identical object from a number of angles. The clips had been divided into 10-second intervals, with something shorter than six seconds discarded.
For knowledge choice, three filtering standards had been utilized: aesthetic high quality, assessed with the Laion-Aesthetic Rating Predictor; movement power, measured through optical circulation utilizing RAFT; and content material security, verified by means of Steady Diffusion’s Security Checker.
One main limitation in present video segmentation datasets is the dearth of detailed textual annotations, that are essential for guiding generative fashions:

The researchers emphasize the dearth of video-captioning in comparable collections.
Due to this fact the VideoPainter knowledge curation course of incorporates various main vision-language fashions, together with CogVLM2 and Chat GPT-4o to generate keyframe-based captions and detailed descriptions of masked areas.
VideoPainter enhances pre-trained DiTs by introducing a customized light-weight context encoder that separates background context extraction from foreground era, seen to the higher proper of the illustrative schema beneath:

Conceptual schema for VideoPainter. VideoPainter’s context encoder processes noisy latents, downsampled masks, and masked video latents through VAE, integrating solely background tokens into the pre-trained DiT to keep away from ambiguity. The ID Resample Adapter ensures identification consistency by concatenating masked area tokens throughout coaching and resampling them from earlier clips throughout inference.
As a substitute of burdening the spine with redundant processing, this encoder operates on a streamlined enter: a mixture of noisy latent, masked video latent (extracted through a variational autoencoder, or VAE), and downsampled masks.
The noisy latent gives era context, and the masked video latent aligns with the DiT’s present distribution, aiming to boost compatibility.
Moderately than duplicating giant sections of the mannequin, which the authors state has occurred in prior works, VideoPainter integrates solely the primary two layers of the DiT. These extracted options are reintroduced into the frozen DiT in a structured, group-wise method – early-layer options inform the preliminary half of the mannequin, whereas later options refine the second half.
Moreover, a token-selective mechanism ensures that solely background-relevant options are reintegrated, stopping confusion between masked and unmasked areas. This strategy, the authors contend, permits VideoPainter to take care of excessive constancy in background preservation whereas bettering foreground inpainting effectivity.
The authors word that the tactic they proposes helps various stylization strategies, together with the most well-liked, Low Rank Adaptation (LoRA).
Knowledge and Checks
VideoPainter was skilled utilizing the CogVideo-5B-I2V mannequin, together with its text-to-video equal. The curated VPData corpus was used at 480x720px, at a studying charge of 1×10-5.
The ID Resample Adapter was skilled for two,000 steps, and the context encoder for 80,000 steps, each utilizing the AdamW optimizer. The coaching occurred in two phases utilizing a formidable 64 NVIDIA V100 GPUs (although the paper doesn’t specify whether or not these had 16GB or 32GB of VRAM).
For benchmarking, Davis was used for random masks, and the authors’ personal VPBench for segmentation-based masks.
The VPBench dataset options objects, animals, people, landscapes and various duties, and covers 4 actions: add, take away, change, and swap. The gathering options 45 6-second movies, and 9 movies lasting, on common, 30 seconds.
Eight metrics had been utilized for the method. For Masked Area Preservation, the authors used Peak Sign-to-Noise Ratio (PSNR); Discovered Perceptual Similarity Metrics (LPIPS); Structural Similarity Index (SSIM); and Imply Absolute Error (MAE).
For text-alignment, the researchers used CLIP Similarity each to judge semantic distance between the clip’s caption and its precise perceived content material, and likewise to judge accuracy of masked areas.
To evaluate the final high quality of the output movies, Fréchet Video Distance (FVD) was used.
For a quantitative comparability spherical for video inpainting, the authors set their system in opposition to prior approaches ProPainter, COCOCO and Cog-Inp (CogVideoX). The take a look at consisted of inpainting the primary body of a clip utilizing picture inpainting fashions, after which utilizing an image-to-video (I2V) spine to propagate the outcomes right into a latent mix operation, in accord with a technique proposed by a 2023 paper from Israel.
For the reason that challenge web site just isn’t completely practical on the time of writing, and for the reason that challenge’s related YouTube video might not function the whole lot of examples stuffed into the challenge web site, it’s moderately troublesome to find video examples which are very particular to the outcomes outlined within the paper. Due to this fact we’ll present partial static outcomes featured within the paper, and shut the article with some extra video examples that we managed to extract from the challenge web site.

Quantitative comparability of VideoPainter vs. ProPainter, COCOCO, and Cog-Inp on VPBench (segmentation masks) and Davis (random masks). Metrics cowl masked area preservation, textual content alignment, and video high quality. Pink = greatest, Blue = second greatest.
Of those qualitative outcomes, the authors remark:
‘Within the segmentation-based VPBench, ProPainter, and COCOCO exhibit the worst efficiency throughout most metrics, primarily because of the incapacity to inpaint totally masked objects and the single-backbone structure’s problem in balancing the competing background preservation and foreground era, respectively.
‘Within the random masks benchmark Davis, ProPainter exhibits enchancment by leveraging partial background info. Nevertheless, VideoPainter achieves optimum efficiency throughout segmentation (normal and lengthy size) and random masks by means of its dual-branch structure that successfully decouples background preservation and foreground era.’
The authors then current static examples of qualitative checks, of which we function a variety beneath. In all circumstances we refer the reader to the challenge web site and YouTube video for higher decision.

A comparability in opposition to inpainting strategies in prior frameworks.
Click on to play. Examples concatenated by us from the ‘outcomes’ movies on the challenge web site.
Relating to this qualitative spherical for video inpainting, the authors remark:
‘VideoPainter constantly exhibits distinctive leads to the video coherence, high quality, and alignment with textual content caption. Notably, ProPainter fails to generate totally masked objects as a result of it solely depends upon background pixel propagation as a substitute of producing.
‘Whereas COCOCO demonstrates fundamental performance, it fails to take care of constant ID in inpainted areas (inconsistent vessel appearances and abrupt terrain modifications) as a consequence of its single-backbone structure trying to steadiness background preservation and foreground era.
‘Cog-Inp achieves fundamental inpainting outcomes; nonetheless, its mixing operation’s incapacity to detect masks boundaries results in vital artifacts.
‘Furthermore, VideoPainter can generate coherent movies exceeding one minute whereas sustaining ID consistency by means of our ID resampling.’
The researchers moreover examined VideoPainter’s capacity to enhance captions and procure improved outcomes by this methodology, placing the system in opposition to UniEdit, DiTCtrl, and ReVideo.

Video-editing outcomes in opposition to three prior approaches.
The authors remark:
‘For each normal and lengthy movies in VPBench, VideoPainter achieves superior efficiency, even surpassing the end-to-end ReVideo. This success could be attributed to its dual-branch structure, which ensures wonderful background preservation and foreground era capabilities, sustaining excessive constancy in non-edited areas whereas guaranteeing edited areas carefully align with modifying directions, complemented by inpainting area ID resampling that maintains ID consistency in lengthy video.’
Although the paper options static qualitative examples for this metric, they’re unilluminating, and we refer the reader as a substitute to the varied examples unfold throughout the assorted movies revealed for this challenge.
Lastly, a human research was carried out, the place thirty customers had been requested to judge 50 randomly-selected generations from the VPBench and modifying subsets. The examples highlighted background preservation, alignment to immediate, and basic video high quality.

Outcomes from the user-study for VideoPainter.
The authors state:
‘VideoPainter considerably outperformed present baselines, reaching increased desire charges throughout all analysis standards in each duties.’
They concede, nonetheless, that the standard of VideoPainter’s generations depends upon the bottom mannequin, which might battle with complicated movement and physics; they usually observe that it additionally performs poorly with low-quality masks or misaligned captions.
Conclusion
VideoPainter appears a worthwhile addition to the literature. Typical of latest options, nonetheless, it has appreciable compute calls for. Moreover, most of the examples chosen for presentation on the challenge web site fall very far wanting the very best examples; it could due to this fact be attention-grabbing to see this framework pitted in opposition to future entries, and a wider vary of prior approaches.
* It’s price mentioning that ‘video-editing’ on this sense doesn’t imply ‘assembling various clips right into a sequence’, which is the normal that means of this time period; however moderately straight altering or not directly modifying the internal content material of present video clips, utilizing machine studying strategies
First revealed Monday, March 10, 2025
The video/picture synthesis analysis sector often outputs video-editing* architectures, and during the last 9 months, outings of this nature have turn out to be much more frequent. That stated, most of them signify solely incremental advances on the cutting-edge, for the reason that core challenges are substantial.
Nevertheless, a brand new collaboration between China and Japan this week has produced some examples that advantage a better examination of the strategy, even when it isn’t essentially a landmark work.
Within the video-clip beneath (from the paper’s related challenge web site, that – be warned – might tax your browser) we see that whereas the deepfaking capabilities of the system are non-existent within the present configuration, the system does a superb job of plausibly and considerably altering the identification of the younger lady within the image, primarily based on a video masks (bottom-left):
Click on to play. Primarily based on the semantic segmentation masks visualized within the decrease left, the unique (higher left) lady is reworked right into a notably totally different identification, despite the fact that this course of doesn’t obtain the identity-swap indicated within the immediate. Supply: https://yxbian23.github.io/challenge/video-painter/ (remember that on the time of writing, this autoplaying and video-stuffed web site was inclined to crash my browser). Please seek advice from the supply movies, should you can entry them, for higher decision and element, or try the examples on the challenge’s overview video at https://www.youtube.com/watch?v=HYzNfsD3A0s
Masks-based modifying of this sort is well-established in static latent diffusion fashions, utilizing instruments like ControlNet. Nevertheless, sustaining background consistency in video is way tougher, even when masked areas present the mannequin with artistic flexibility, as proven beneath:
Click on to play. A change of species, with the brand new VideoPainter methodology. Please seek advice from the supply movies, should you can entry them, for higher decision and element, or try the examples on the challenge’s overview video at https://www.youtube.com/watch?v=HYzNfsD3A0s
The authors of the brand new work take into account their methodology in regard each to Tencent’s personal BrushNet structure (which we coated final yr), and to ControlNet, each of which deal with of a dual-branch structure able to isolating the foreground and background era.
Nevertheless, making use of this methodology on to the very productive Diffusion Transformers (DiT) strategy proposed by OpenAI’s Sora, brings explicit challenges, because the authors word”
‘[Directly] making use of [the architecture of BrushNet and ControlNet] to video DiTs presents a number of challenges: [Firstly, given] Video DiT’s strong generative basis and heavy mannequin dimension, replicating the total/half-giant Video DiT spine because the context encoder can be pointless and computationally prohibitive.
‘[Secondly, unlike] BrushNet’s pure convolutional management department, DiT’s tokens in masked areas inherently comprise background info as a consequence of world consideration, complicating the excellence between masked and unmasked areas in DiT backbones.
‘[Finally,] ControlNet lacks function injection throughout all layers, hindering dense background management for inpainting duties.’
Due to this fact the researchers have developed a plug-and-play strategy within the type of a dual-branch framework titled VideoPainter.
VideoPainter affords a dual-branch video inpainting framework that enhances pre-trained DiTs with a light-weight context encoder. This encoder accounts for simply 6% of the spine’s parameters, which the authors declare makes the strategy extra environment friendly than typical strategies.
The mannequin proposes three key improvements: a streamlined two-layer context encoder for environment friendly background steerage; a mask-selective function integration system that separates masked and unmasked tokens; and an inpainting area ID resampling method that maintains identification consistency throughout lengthy video sequences.
By freezing each the pre-trained DiT and context encoder whereas introducing an ID-Adapter, VideoPainter ensures that inpainting area tokens from earlier clips persist all through a video, decreasing flickering and inconsistencies.
The framework can also be designed for plug-and-play compatibility, permitting customers to combine it seamlessly into present video era and modifying workflows.
To assist the work, which makes use of CogVideo-5B-I2V as its generative engine, the authors curated what they state is the most important video inpainting dataset thus far. Titled VPData, the gathering consists of greater than 390,000 clips, for a complete video length of greater than 886 hours. Additionally they developed a associated benchmarking framework titled VPBench.
Click on to play. From the challenge web site examples, we see the segmentation capabilities powered by the VPData assortment and the VPBench take a look at suite. Please seek advice from the supply movies, should you can entry them, for higher decision and element, or try the examples on the challenge’s overview video at https://www.youtube.com/watch?v=HYzNfsD3A0s
The new work is titled VideoPainter: Any-length Video Inpainting and Enhancing with Plug-and-Play Context Management, and comes from seven authors on the Tencent ARC Lab, The Chinese language College of Hong Kong, The College of Tokyo, and the College of Macau.
Moreover the aforementioned challenge web site, the authors have additionally launched a extra accessible YouTube overview, as effectively a Hugging Face web page.
Technique
The information assortment pipeline for VPData consists of assortment, annotation, splitting, choice and captioning:

Schema for the dataset development pipeline. Supply: https://arxiv.org/pdf/2503.05639
The supply collections used for this compilation got here from Videvo and Pexels, with an preliminary haul of round 450,000 movies obtained.
A number of contributing libraries and strategies comprised the pre-processing stage: the Acknowledge Something framework was used to offer open-set video tagging, tasked with figuring out major objects; Grounding Dino was used for the detection of bounding bins across the recognized objects; and the Phase Something Mannequin 2 (SAM 2) framework was used to refine these coarse choices into high-quality masks segmentations.
To handle scene transitions and guarantee consistency in video inpainting, VideoPainter makes use of PySceneDetect to determine and section clips at pure breakpoints, avoiding the disruptive shifts usually brought on by monitoring the identical object from a number of angles. The clips had been divided into 10-second intervals, with something shorter than six seconds discarded.
For knowledge choice, three filtering standards had been utilized: aesthetic high quality, assessed with the Laion-Aesthetic Rating Predictor; movement power, measured through optical circulation utilizing RAFT; and content material security, verified by means of Steady Diffusion’s Security Checker.
One main limitation in present video segmentation datasets is the dearth of detailed textual annotations, that are essential for guiding generative fashions:

The researchers emphasize the dearth of video-captioning in comparable collections.
Due to this fact the VideoPainter knowledge curation course of incorporates various main vision-language fashions, together with CogVLM2 and Chat GPT-4o to generate keyframe-based captions and detailed descriptions of masked areas.
VideoPainter enhances pre-trained DiTs by introducing a customized light-weight context encoder that separates background context extraction from foreground era, seen to the higher proper of the illustrative schema beneath:

Conceptual schema for VideoPainter. VideoPainter’s context encoder processes noisy latents, downsampled masks, and masked video latents through VAE, integrating solely background tokens into the pre-trained DiT to keep away from ambiguity. The ID Resample Adapter ensures identification consistency by concatenating masked area tokens throughout coaching and resampling them from earlier clips throughout inference.
As a substitute of burdening the spine with redundant processing, this encoder operates on a streamlined enter: a mixture of noisy latent, masked video latent (extracted through a variational autoencoder, or VAE), and downsampled masks.
The noisy latent gives era context, and the masked video latent aligns with the DiT’s present distribution, aiming to boost compatibility.
Moderately than duplicating giant sections of the mannequin, which the authors state has occurred in prior works, VideoPainter integrates solely the primary two layers of the DiT. These extracted options are reintroduced into the frozen DiT in a structured, group-wise method – early-layer options inform the preliminary half of the mannequin, whereas later options refine the second half.
Moreover, a token-selective mechanism ensures that solely background-relevant options are reintegrated, stopping confusion between masked and unmasked areas. This strategy, the authors contend, permits VideoPainter to take care of excessive constancy in background preservation whereas bettering foreground inpainting effectivity.
The authors word that the tactic they proposes helps various stylization strategies, together with the most well-liked, Low Rank Adaptation (LoRA).
Knowledge and Checks
VideoPainter was skilled utilizing the CogVideo-5B-I2V mannequin, together with its text-to-video equal. The curated VPData corpus was used at 480x720px, at a studying charge of 1×10-5.
The ID Resample Adapter was skilled for two,000 steps, and the context encoder for 80,000 steps, each utilizing the AdamW optimizer. The coaching occurred in two phases utilizing a formidable 64 NVIDIA V100 GPUs (although the paper doesn’t specify whether or not these had 16GB or 32GB of VRAM).
For benchmarking, Davis was used for random masks, and the authors’ personal VPBench for segmentation-based masks.
The VPBench dataset options objects, animals, people, landscapes and various duties, and covers 4 actions: add, take away, change, and swap. The gathering options 45 6-second movies, and 9 movies lasting, on common, 30 seconds.
Eight metrics had been utilized for the method. For Masked Area Preservation, the authors used Peak Sign-to-Noise Ratio (PSNR); Discovered Perceptual Similarity Metrics (LPIPS); Structural Similarity Index (SSIM); and Imply Absolute Error (MAE).
For text-alignment, the researchers used CLIP Similarity each to judge semantic distance between the clip’s caption and its precise perceived content material, and likewise to judge accuracy of masked areas.
To evaluate the final high quality of the output movies, Fréchet Video Distance (FVD) was used.
For a quantitative comparability spherical for video inpainting, the authors set their system in opposition to prior approaches ProPainter, COCOCO and Cog-Inp (CogVideoX). The take a look at consisted of inpainting the primary body of a clip utilizing picture inpainting fashions, after which utilizing an image-to-video (I2V) spine to propagate the outcomes right into a latent mix operation, in accord with a technique proposed by a 2023 paper from Israel.
For the reason that challenge web site just isn’t completely practical on the time of writing, and for the reason that challenge’s related YouTube video might not function the whole lot of examples stuffed into the challenge web site, it’s moderately troublesome to find video examples which are very particular to the outcomes outlined within the paper. Due to this fact we’ll present partial static outcomes featured within the paper, and shut the article with some extra video examples that we managed to extract from the challenge web site.

Quantitative comparability of VideoPainter vs. ProPainter, COCOCO, and Cog-Inp on VPBench (segmentation masks) and Davis (random masks). Metrics cowl masked area preservation, textual content alignment, and video high quality. Pink = greatest, Blue = second greatest.
Of those qualitative outcomes, the authors remark:
‘Within the segmentation-based VPBench, ProPainter, and COCOCO exhibit the worst efficiency throughout most metrics, primarily because of the incapacity to inpaint totally masked objects and the single-backbone structure’s problem in balancing the competing background preservation and foreground era, respectively.
‘Within the random masks benchmark Davis, ProPainter exhibits enchancment by leveraging partial background info. Nevertheless, VideoPainter achieves optimum efficiency throughout segmentation (normal and lengthy size) and random masks by means of its dual-branch structure that successfully decouples background preservation and foreground era.’
The authors then current static examples of qualitative checks, of which we function a variety beneath. In all circumstances we refer the reader to the challenge web site and YouTube video for higher decision.

A comparability in opposition to inpainting strategies in prior frameworks.
Click on to play. Examples concatenated by us from the ‘outcomes’ movies on the challenge web site.
Relating to this qualitative spherical for video inpainting, the authors remark:
‘VideoPainter constantly exhibits distinctive leads to the video coherence, high quality, and alignment with textual content caption. Notably, ProPainter fails to generate totally masked objects as a result of it solely depends upon background pixel propagation as a substitute of producing.
‘Whereas COCOCO demonstrates fundamental performance, it fails to take care of constant ID in inpainted areas (inconsistent vessel appearances and abrupt terrain modifications) as a consequence of its single-backbone structure trying to steadiness background preservation and foreground era.
‘Cog-Inp achieves fundamental inpainting outcomes; nonetheless, its mixing operation’s incapacity to detect masks boundaries results in vital artifacts.
‘Furthermore, VideoPainter can generate coherent movies exceeding one minute whereas sustaining ID consistency by means of our ID resampling.’
The researchers moreover examined VideoPainter’s capacity to enhance captions and procure improved outcomes by this methodology, placing the system in opposition to UniEdit, DiTCtrl, and ReVideo.

Video-editing outcomes in opposition to three prior approaches.
The authors remark:
‘For each normal and lengthy movies in VPBench, VideoPainter achieves superior efficiency, even surpassing the end-to-end ReVideo. This success could be attributed to its dual-branch structure, which ensures wonderful background preservation and foreground era capabilities, sustaining excessive constancy in non-edited areas whereas guaranteeing edited areas carefully align with modifying directions, complemented by inpainting area ID resampling that maintains ID consistency in lengthy video.’
Although the paper options static qualitative examples for this metric, they’re unilluminating, and we refer the reader as a substitute to the varied examples unfold throughout the assorted movies revealed for this challenge.
Lastly, a human research was carried out, the place thirty customers had been requested to judge 50 randomly-selected generations from the VPBench and modifying subsets. The examples highlighted background preservation, alignment to immediate, and basic video high quality.

Outcomes from the user-study for VideoPainter.
The authors state:
‘VideoPainter considerably outperformed present baselines, reaching increased desire charges throughout all analysis standards in each duties.’
They concede, nonetheless, that the standard of VideoPainter’s generations depends upon the bottom mannequin, which might battle with complicated movement and physics; they usually observe that it additionally performs poorly with low-quality masks or misaligned captions.
Conclusion
VideoPainter appears a worthwhile addition to the literature. Typical of latest options, nonetheless, it has appreciable compute calls for. Moreover, most of the examples chosen for presentation on the challenge web site fall very far wanting the very best examples; it could due to this fact be attention-grabbing to see this framework pitted in opposition to future entries, and a wider vary of prior approaches.
* It’s price mentioning that ‘video-editing’ on this sense doesn’t imply ‘assembling various clips right into a sequence’, which is the normal that means of this time period; however moderately straight altering or not directly modifying the internal content material of present video clips, utilizing machine studying strategies
First revealed Monday, March 10, 2025