Remarkable advances in human-driven AI videos

Note: The project page for this work contains 33 pages that autoplay half the half gigabyte high resolution video, which loaded my system into unstable. For this reason, I don’t link directly to it. If selected, readers can find the URL in a paper summary or PDF.

One of the main objectives of current video integration research is to generate full AI-driven video performance from a single image. This week, a new paper from the Ordinance Intelligent Creation could thus far be the most comprehensive system of its kind, allowing for full-body and semi-body animations that combine expressive facial details with accurate large-scale movements, while improving identity consistency.

In the example below, a performance derived from a single image (top right) is displayed, driven by an actor (top left). This provides a very flexible and dexterous rendering. There is no normal problem with creating large movements or “predicting” about blocked areas (i.e., photographs of clothing and parts of the horns of the clothing or face that must be speculated or invented because they are not invented).

Audio content. Click to play. Performance usually comes from two sources, including Lip-Sync, which is the protection of a dedicated auxiliary system. This is a reduced version from the source site (see notes at the start of the article – applicable to all other built-in videos here).

As each clip progresses, you can see some residual challenges regarding identity persistence, but this is the first system I’ve seen in maintaining (but not always) for duration without using Lora.

Audio content. Click to play. A further example of the Dreamactor project.

New system, title Dreamactorwith a three-part hybrid control system, dedicated to facial expressions, head rotation and core skeleton design, so the sides of the face also respond to AI-driven performance that doesn’t suffer at the expense of others.

Below you can see one of these facets: Spinning your headtaking action. The colored balls at the corners of each thumbnail on the right show a kind of virtual gimbal that defines head orientation independent of facial movement and representation.

Click to play. The multicolored balls visualized here represent the axis of rotation of the avatar’s head, and the representation is driven by another module to inform the actor’s performance (see at the bottom left).

One of the most interesting features of the project is its ability to directly derive lip sync movements directly from the audio, although it is not included properly in the testing of the paper. This is a feature that works abnormally even when you’re not driving an actor’s video.

The researchers took on the best incumbent in this pursuit. This reports that Dreamactor, such as the highly lauded runway Act-One and Liveportrait, can achieve better quantitative results.

Quantitative results are not necessarily empirical standards, as researchers can set their own standards. However, the accompanying qualitative tests seem to support the author’s conclusions.

Unfortunately, this system is not intended to be published. Additionally, the only value that the community can potentially draw from work is to potentially replicate the methodology outlined in the paper (has a prominent effect on the source Google Dreambooth, which was equally closed in 2022).

The paper states:

“Human image animations have social risks, such as being misused to make fake videos. The proposed technology can be used to create fake videos of people, but existing detection tools (Demamba, dormant) can find these fakes.

“To reduce these risks, clear ethical rules and responsible use guidelines are required, strictly limiting access to the core model and code to prevent misuse.

Naturally, this type of ethical consideration is useful from a commercial standpoint, as it provides a rationale for APi-Only access to the model and can monetize it. Bytedance has already done this in 2025 by making it available for paid credit on Dreamina’s website. So this seems likely to be a consequence, as DreamActor could be a more powerful product. What remains is the extent to which the principles can support the open source community, as far as they are explained in the paper.

New paper titled Dreamactor-M1: Overall, Expressive and Robust Human Image Animation with Hybrid Guidanceand comes from the researchers on Route 6.

method

The DreamActor system proposed in this paper aims to generate human animations from reference images and driving videos using a diffused transformer (DIT) framework adapted to latent space (with a clear, stable diffusion flavor, but only cites the landmark release publication of 2022).

Rather than relying on external modules to handle reference conditioning, the authors fuse appearance and motion features directly within the DIT backbone, allowing interactions across space and time through attention.

New System Schema: DreamActor encodes poses, facial movements, and appearances into individual latents, combining them with noiseed video latencies generated by 3D VAE. These signals are fused within the diffusion transformer using the use of self and crossing, with shared weights throughout the branch. This model is supervised by cleaning the potential of the video with the removed output. Source: https://arxiv.org/pdf/2504.01724

To do this, the model uses a preprocessed 3D mutation autoencoder to encode both the input video and the reference image. These majors are patched, connected, fed to the DIT and processed jointly.

This architecture separates from the general practice of attaching a secondary network for reference injection. This was an influential, animation approach, and everyone animated two projects.

Instead, Dreamactor builds a fusion into the main model itself, simplifying the design while enhancing the flow of information between the appearance and the motion cues. The model is then trained using flow matching rather than standard diffusion purposes (flow matching trains directly predict the velocity field between the data and noise and match the diffusion model by skipping score estimation).

Hybrid Motion Guidance

The hybrid motion guidance method that notifies neural rendering combines a 3D body skeleton with a pose token derived from the head sphere. An implicit face representation extracted by the prerequisite face encoder. A reference appearance token sampled from the source image.

These elements are integrated within the diffusion transformer using a clear attentional mechanism, allowing the system to adjust global movement, facial expressions and visual identity throughout the generation process.

Rather than relying on face landmarks in these first cases, Dreamactor appears to be able to use implicit facial expressions to guide representation generation, freeing identity from expressions and giving more control over facial dynamics while releasing head poses.

To create these representations, the pipeline first detects and harvests the face area of each frame of the driving video and resizes it to 224 x 224. The trimmed faces are processed by face motion encoders preprocessed with the PD-FGC dataset and conditioned by the MLP layer.

The PD-FGC, employed in DreamActor, generates a talking head from the reference image by drawing the solution angles of lip sync (from audio), head pose, eye movement, and representation (from individual videos), allowing for accurate and independent manipulation of each. Source: https://arxiv.org/pdf/2211.14506

As a result, there is a series of face motion tokens that are injected into the diffusion transformer via the cross-attention layer.

The same framework also supports Audio-driven In the variant, individual encoders are trained to map directly to motion tokens. This allows you to generate synchronized facial animations with lip movements without driving video.

Audio content. Click to play. Lip sync, purely derived from audio, without reference to driving actors. The only text input is the static photo seen on the top right.

Second, to control head pose independently of facial expressions, the system introduces a representation of a 3D head sphere (see the video embedded earlier in this article). This separates face dynamics from global head movement, increasing accuracy and flexibility during animation.

The head sphere is generated by extracting 3D facial parameters such as rotation and camera pose from driving video using faceverse tracking methods.

Schema for the Faceverse project. Source: https://www.liuyebin.com/faceverse/faceverse.html

These parameters are used to render spheres of color projected onto a 2D image plane spatially aligned with the driving head. The size of the sphere matches the reference head, and its color reflects the head orientation. This abstraction reduces the complexity of learning 3D head motion and helps maintain a stylized or exaggerated head shape of the characters drawn from the animation.

Visualization of control spheres affecting head orientation.

Finally, to guide the movement of the whole body, the system uses a 3D body skeleton with adaptive bone length normalization. Body and hand parameters are estimated using 4dhumans and hand focused Hamer, both working with SMPL-X body models.

SMPL-X applies parametric meshes to the entire human body within the image, and in cooperation with the putative pose and expression, using the mesh as a volume guide to enable pose awareness manipulation. Source: https://arxiv.org/pdf/1904.05866

From these outputs, key joints are selected, projected into 2D, and connected to a line-based skeleton map. Unlike methods such as Champion, which renders full body meshes, this approach avoids impose predefined shape assumptions and relies solely on skeletal structures, encouraging the model to infer body shape and appearance directly from the reference image, reduce bias to fixed body types, and improve generalization between the scope of poses and builds.

During training, the 3D body skeleton is connected to the head ball and passes through the pose encoder. The pause encoder is combined with a noisy video lazy rent to output the functionality and generate the noise token used by the diffusion transformer.

At inference, the system explains skeletal disparities between subjects by normalizing bone length. SeedIT’s preprocessed image editing model converts both the reference and drive images into a standard standard configuration. Next, use rtmpose to extract the skeletal percentage. This is used to adjust the driving skeleton to match the anatomy of the reference subject.

An overview of the inference pipeline. Pseudo-references may be generated to enrich the appearance cue, but hybrid control signals (implicit facial movements and explicit poses from the head sphere and body skeleton) are extracted from the driving video. They are then fed into the DIT model to generate animated output, separating surface movement from the body pose and allowing audio to be used as a driver.

Exterior guidance

To increase the fidelity of the appearance, especially in obstructed or rarely visible areas, the system complements the main reference image of the pseudo-reference sampled from the input video.

Click to play. The system predicts the need to accurately and consistently render blocked areas. This is roughly the same as what we saw in this kind of project with the CGI style bitmap texture approach.

These additional frames are selected for pose diversity using RTMPOSE and filtered using CLIP-based similarity to maintain consistency with subject IDs.

All reference frames (primary and pseudo) are encoded by the same visual encoder and fused through an autocatalytic mechanism, allowing the model to access a cue of complementary appearance. This setup improves coverage of details such as profile views and limb textures. Pseudo-references are always used during training and optionally during inference.

training

Dreamactor was trained in three stages to gradually introduce complexity and improve stability.

In the first stage, only the 3D body skeleton and the 3D head sphere were used as control signals except for facial representation. This allowed the base video generation model initialized from MMDIT to adapt to human animation without being overwhelmed by fine-grained control.

In the second stage, an implicit face representation was added, but all other parameters were frozen. At this point, only the face motion encoder and the facial attention layer were trained, allowing the model to learn isolated and expressive details.

In the final stage, none of the parameters appeared due to joint optimization of the overall appearance, pose, and facial dynamics.

Data and Testing

During the testing phase, the model is initialized from a prerequisite image to video dit checkpoint^† Trained in three stages: 20,000 steps per first two stages and 30,000 steps per third stage.

To improve the generalization of various periods and resolutions, video clips were randomly sampled at lengths of 25-121 frames. These were resized to 960x640px while maintaining the aspect ratio.

Training was performed on 8 (China-centric) Nvidia H20 GPUs, each with 96GB of VRAM, and the ADAMW optimizer was used using a learning rate of 5E-6.

Inference, each video segment contained 73 frames. To maintain consistency between segments, the final potential from one segment is reused as the initial potential of the next segment, contextualizing the task as sequential image-to-video generation.

Classifier-free guidance was applied at a weight of 2.5 for both the reference image and motion control signals.

The authors constructed a training dataset (source not mentioned in the paper) containing 500 hours of videos sourced from a variety of domains, characterized by instances of dance, sports, film, and public speaking. The dataset is designed to capture a wide range of human movements and representations, with an even distribution between full body and half body shots.

To improve facial synthesis quality, Nersemble was incorporated into the data preparation process.

An example of a Nersemble dataset used to augment the data in DreamActor. Source: https://www.youtube.com/watch?v=a-oawqbzldu

For evaluation, researchers used the dataset as a benchmark to assess generalization across a variety of scenarios.

Model performance was measured using standard metrics from previous work. Structural Similarity Index (SSIM); Similarity of Trained Perceptual Image Patches (LPIPS); Peak Signal-to-Noise Ratio (PSNR) for Frame Level Quality. Fréchet video distance (FVD) was used to assess temporal consistency and overall video fidelity.

The authors conducted experiments on both body animation and portrait animation tasks, all using a single (target) reference image.

For body animation, the DreamActor-M1 was compared to animation by anyone. Champion; imitation, disposal.

Quantitative comparison with rival frameworks.

While PDFs provide static images as visual comparisons, one of the videos on the project site may highlight the differences more clearly.

Audio content. Click to play. A visual comparison of the entire Challenger framework. The driving video can be seen in the top left, and the author’s conclusion that Dreamactor produces the best results appears to be reasonable.

For portrait animation tests, the model was rated against Liveportrait. x-portrait; Skyreels-A1; and act-one.

Quantitative comparison of portrait animations.

The authors note that the method wins quantitative tests and argue that it is also qualitatively superior.

Audio content. Click to play. An example of a comparison of portrait animations.

Perhaps the third and final clips shown in the video above show a less convincing lip sync compared to some rival frameworks, although of very high general quality.

Conclusion

To predict the need for textures that are implied but not actually present in the only target images that promote these recreation, bytedance addressed one of the biggest challenges facing diffusion-based video generation, i.e. consistent, persistent textures. After completing such an approach, the next logical step is to create a reference atlas in some way from the first generated clip that can be applied to different generations afterwards, in order to maintain its appearance without the lora.

Such an approach is effectively an external reference, but this is no different from the texture mapping of traditional CGI techniques. Additionally, the quality of realism and validity is much higher than the old methods can be obtained.

That said, the most impressive aspect of Dreamactor is its combination of a three-part guidance system. This is to fill in a creative way the traditional gap between face-focused human synthesis and body-focused human synthesis.

Some of these core principles are simply whether they can be leveraged with more accessible products. At the moment, Dreamactor appears to be destined to provide another integration as a service due to its usage limitations and the unreality of extensive experimenting with commercial architectures.

* My alternative to the author’s hyperlink. Inline quotes

^†As mentioned earlier, it is not clear that a stable diffusion flavor was used in this project.

First released on Friday, April 4, 2025