A better generation of AI videos by shuffling frames during training

This week’s new paper at ARXIV addresses an issue that anyone who has adopted a Hunyuan Video or WAN 2.1 AI video generator will ever come across. Time abnormalthe generation process tends to suddenly speed up, confuse, omit, or confuse key moments in the generated video.

Click to play. Some of the temporary glitches familiar to users of the new wave of the generated video system, highlighted in new papers. On the right is the improvement effect of the new flux flow approach. Source: https://haroldchen19.github.io/fluxflow/

The above video features an excerpt from an example test video from a (or rather confusing) project site from the paper. We see that the author’s method (drawn to the right of the video) has improved some of the more familiar issues.

The first example features “two kids playing two kids” generated by Cogvideox, showing that native generations are rapidly jumping over several important micromovements and speeding up child activity down to the “comic” pitch. In contrast, the same dataset and methods produce better results with new preprocessing techniques. Flux Flow (To the right of the video image below):

Click to play.

In the second example (using NOVA-0.6B), we see that the central movement involving cats is somehow damaged or rather underestimated during the training phase.

Click to play.

This syndrome is one of the most frequently reported insects of HV and WAN, where movement or subjects become “stuck” and in various image and video integration groups.

Some of these issues are related to video caption issues in the source dataset. However, the authors of new works make a compelling argument that instead focusing their efforts on the temporal quality of training data and addressing the challenges from that perspective can have useful results.

As mentioned in a previous article on video captions, certain Sports It is especially difficult to stand out at a critical moment. This means that important events (such as slam dunks) do not pay the necessary attention when training.

Click to play.

In the example above, the generator system does not know how to reach the next stage of movement, and transits illogically from one pose to the next, changing the attitude and geometry of the player in the process.

These are the big moves lost in training, but equally vulnerable are much smaller, important moves such as the flapping of butterfly wings.

Click to play.

Unlike slam dunks, wing flapping is not “rare” but rather a sustained, monotonous event. However, the movement is very fast and is extremely difficult to establish temporarily, resulting in inconsistency in the sampling process.

These are not particularly new issues, but they are now attracting more attention now that powerful generation video models are available to enthusiasts for local installations and free generations.

The Reddit and Discord communities initially treated these issues as “user related.” This is an understandable estimation, as the system in question is very new and minimally documented. Therefore, various experts suggest a variety (not always effective) remedies for some glitches recorded here, such as changing the settings of different components of the Hunyuan Video (HV) and WAN 2.1 Comfyui workflow.

In some cases, both HV and WAN generate rather than generating rapid movement slow motion. Proposals from Reddit and ChatGpt (which primarily utilize Reddit) include changing the number of frames in the requested generation or fundamentally lowering the frame rate*.

This is all desperate. The new truth is that we still don’t know the exact cause or exact remedy for these problems. Obviously, it’s only a short stop to suffer and avoid the generation setting (especially when it reduces the quality of the output, such as when this is too low). It’s also good to make sure the research scene is dealing with this quickly with new issues.

So, in addition to our perspective on how this week’s captions affect training, let’s take a look at new papers on temporary normalization and improvements that may provide current generated video scenes.

The central idea is pretty simple and subtle, so there’s nothing bad about it. Nevertheless, the paper is slightly padded to reach the prescribed eight pages. Skip this padding if necessary.

The native generation fish of the Videocrafter framework are static, and the modified version of Fluxflow captures the required changes. Source: https://arxiv.org/pdf/2503.15417

New title is included Temporary regularization will strengthen the video generatorand eight researchers at Everlin AI, including Hong Kong University of Science and Technology (HKUST), University of Central Florida (UCF), and University of Hong Kong (HKU).

(At the time of writing, there are some issues with the project site that accompany the paper)

Flux Flow

The central idea behind it Flux FlowThe author’s new pre-training schema is to overcome a wide range of problems. Flickering and Time contradiction By shuffling blocks and groups of blocks in time frame order once the source data is exposed to the training process.

The central idea behind Fluxflow is to move blocks and groups of blocks into unexpected non-simultaneous positions as a form of data augmentation.

The paper explains:

‘ (artifact) is due to basic limitations. Despite leveraging large datasets, current models often rely on simplified temporal patterns (e.g., fixed walking direction or repeating frame transitions) in training data rather than learning diverse and plausible temporal dynamics.

“This problem is further exacerbated by the lack of explicit temporal enhancement during training, and models tend to overfit with false temporal correlations (e.g., “frame #5 must follow #4”) rather than generalizing across diverse motion scenarios. ”

Most video generation models described by the authors are still too many image It almost ignores the temporal axis, focusing on synthesis, spatial fidelity. Techniques such as tripping, flipping, and color jitter have helped improve static image quality, but are not the right solution when applied to video. The illusion of movement depends on consistent transitions between frames.

The resulting problems include flicker textures, jarring cuts between frames, repeated or oversimplified motion patterns.

Click to play.

The paper argues that while some models, including stable video diffusion and ramagens, compensate with increasingly complex architectures or designed constraints, these are at a cost in terms of computation and flexibility.

Because temporary data augmentation has already proven useful in videos Understand Tasks (frameworks such as FineCliper, Sefar, Svformer, etc.) are surprising that this tactic rarely applies in generative contexts.

Destructive behavior

Researchers argue that simple and structured confusion in a temporary order during training helps the model better generalize realistic and diverse movements.

‘Training on impaired sequences allows generators to recover plausible trajectories and learn to effectively normalize temporal entropy. Fluxflow bridges the gap between temporal enhancement of identification and generation, providing a plug-and-play enhancement solution for temporally plausible video generation while improving overall (quality).

“Unlike existing methods of introducing architectural changes or relying on post-processing, Fluxflow works directly at the data level and introduces controlled temporal perturbations during training.”

Click to play.

The authors state that frame-level perturbations introduce finer breakdowns within the sequence. This kind of confusion is not different from masking augmentation where sections of data are randomly blocked, and overfitting data points to promote better generalization.

test

The central idea here is, due to its simplicity, it doesn’t run on full-length papers, but there is still a test section that you can look at.

The authors tested four queries related to improving time quality while maintaining spatial fidelity. Ability to learn motion/light flow dynamics. Maintain the temporal quality of foreign generations. Sensitivity to key hyperparameters.

The researchers applied flux flow to three U-net-based generation architectures in the form of videocrafter2. DIT based in the form of Cogvideox-2b. In the form of an AR base, NOVA-0.6B.

For a fair comparison, they fine-tuned the base model of the architecture with Fluxflow as an additional training phase for one epoch on the OpenVidhd-0.4m dataset.

The model was evaluated against two general benchmarks. UCF-101. and vbench.

For UCF, Fréchet video distance (FVD) and inception score (IS) metrics were used. In the case of VBench, researchers focused on temporal quality, frame-by-frame quality, and overall quality.

Quantitative initial assessment of flux flow frames. “+Original” indicates training without flux flow, and “+num x 1” indicates different flux flow frame configurations. The best results are covered. The second best is underlined for each model.

Commenting on these results, the author stated:

‘As evident from the metrics on the tab, both Fluxflow-Frame and Fluxflow-Block greatly improve the temporal quality. 1, 2 (i.e., FVD, subject, flicker, motion, dynamic) and qualitative results (image below).

“For example, surfers chasing their tails in VC2, surfers riding waves in CVX will become noticeably liquid in flux flow. Importantly, these temporal improvements are achieved without sacrificing spatial fidelity, as evidenced by the sharp details of splashes, smoke marks and wave textures, along with spatial and overall fidelity metrics.

Below are the selections from the qualitative results referenced by the author (see the original paper for full results and better resolution):

Choice from qualitative results.

This paper suggests that both frame-level and block-level perturbations improve temporal quality, but frame-level methods tend to perform better. This is due to more refined granularity, allowing for more accurate temporal adjustments. In contrast, block-level perturbations can introduce noise as spatial and temporal patterns within the block are tightly coupled and reduce their effectiveness.

Conclusion

The paper, along with a caption collaboration from Beitedan Tunshua released this week, revealed that the obvious drawbacks of the new generation of generated video models may not be due to user errors, institutional failures, or funding restrictions.

Until recently, the results of freely available, downloadable, generated video systems were so compromised that no major effort trajectory emerged from the enthusiast community (especially because the issues were fundamental and not trivial).

We are so close to the long-standing predicted age of purely A-generated light-electoral video outputs, it is clear that both the research and casual communities are deeply and more productive in solving the remaining problems. If you’re lucky, these are not unruly obstacles.

* Note that the native frame rate of WAN is only 16FPS, and depending on my own issue, the forum suggests that it is using flow frames or other AI-based refloating systems to reduce the frame rate to something as low as 12fps, then scattering gaps between such sparse numbers of frames.

First released on Friday, March 21, 2025