InsighthubNews
  • Home
  • World News
  • Politics
  • Celebrity
  • Environment
  • Business
  • Technology
  • Crypto
  • Sports
  • Gaming
Reading: Hunyuancustom brings single image video deepfake with audio and lip sync
Share
Font ResizerAa
InsighthubNewsInsighthubNews
Search
  • Home
  • World News
  • Politics
  • Celebrity
  • Environment
  • Business
  • Technology
  • Crypto
  • Sports
  • Gaming
© 2024 All Rights Reserved | Powered by Insighthub News
InsighthubNews > Technology > Hunyuancustom brings single image video deepfake with audio and lip sync
Technology

Hunyuancustom brings single image video deepfake with audio and lip sync

May 9, 2025 26 Min Read
Share
mm
SHARE

This article discusses the new release of the multimodal Hunyuan Video World Model called “Hunyuancustom”. The wide coverage of the new paper, combined with many of the issues of many sample videos on the Project Page*, is limited to more general coverage than usual, and to the limited playback of the enormous amount of video material that comes with this release (many videos require important re-editing and processing to improve layout readability).

Furthermore, note that this paper refers to the API-based generation system Kling as “Keling.” To be clear, I refer to “kling” instead.

Tencent is in the process of releasing a new version of the Hunyuan Video Model. Hunyuancustom. The new release appears to be able to make the Hunyuan Lora model redundant by allowing users to create “Deepfake” style video customizations. single image:

Click to play. Prompt: “A man is listening to music and cooking snail noodles in the kitchen.” A new approach compared to both Klings and open source methods, including Kling, a key opponent in this field. Source: https://hunyuancustom.github.io/(Warning: CPU/memory-intensive site!)

The left column of the video above shows a single source image provided to Hunyuancustom, followed by an interpretation of the new system prompt in the second column. The remaining columns show the results of various proprietary and Foss systems. vidu;pika;hailuo; and WAN-based Skyreels-A2.

The video below shows you the rendering of three scenarios that are essential for this release. People + Objects; Single Character Emulation;and Virtual Trion (People + Clothes):

Click to play. Three examples edited from materials on the Hunyuan Video Support Site.

You can notice a few things from these examples, but it is primarily related to system dependency A single source image, Instead of multiple images of the same subject.

In the first clip, the man is essentially still facing the camera. He dips his head downwards and downwards to avoid exceeding 20-25 degrees rotation, and tends to go beyond that, so the system needs to start guessing what he sees in general. This is difficult and is probably impossible to accurately measure from the front image.

In the second example, the little girl is smile In the rendered video, as if she was in a single static source image. Again, with this only image as reference, Hunyuancustom must make a relatively uninformed guess as to what her “rested face” looks like. Furthermore, her face does not deviate from her camera-facing stance beyond her previous example (“Man eats potato chips”).

In the final example, we see that the source material (the woman and clothing she is encouraged to wear is not a perfect image, so we are cropping the scenario where the rendering fits.

The point is that the new system can process multiple images ( People + Potato Chipsor People + Clothes), it clearly does not allow multiple angles or alternative views A single characterwhich allows it to accommodate a variety of expressions and unusual angles. So, to this extent, the system could struggle to replace the growing ecosystem of the LORA model that has been born around Hunyuanvideo since its release last December. This helps Hunyuanvideo generate consistent characters from all angles and is expressed in the training data set (typically the 20-60 images).

Wired for sound

For audio, Hunyuancustom leverages the LatentsYnc system (known for being difficult for enthusiasts to set up and get good results) to get lip movements that match the audio and text that the user provides.

The audio works. Click to play. Various examples of lip sync from the Hunyuancustom supplementary site edited together.

At the time of writing, there are no examples of English, but these seem pretty good. This is the way to create them if it is easy to install and accessible.

Editing an existing video

The new system provides very impressive results for video to video (V2V, or VID2VID) editing. The existing (actual) video segments are masked and wisely replaced with the subject given to a single reference image. Below is an example of a supplementary material site.

Click to play. Only the central object is targeted, but anything remaining around it is also changed in the Hunyuancustom VID2VID path.

As we can see, and as the standard for the VID2VID scenario, The whole video The target area, i.e. the most modified in the gorgeous toys, but to some degree changed by the process. Perhaps you can develop a pipeline to create such a transformation under a garbage matte approach that leaves most of the video content the same as the original. This is what Adobe Firefly does under the hood and works very well, but it’s a poor process in the FOSS generation scene.

That being said, most of the alternative examples provided do a better job targeting these integrations, as you can see in the assembled edit below.

Click to play. It is a diverse example of spacing content using VID2VID in Hunyuancustom, showing outstanding respect for the material of the target.

A new start?

This initiative is the development of the Hunyuan Video Project, not a hard pivot away from its development stream. The project’s extensions are introduced as separate architectural inserts rather than wiping away structural changes, aiming to allow models to maintain identity fidelity across the frame without relying on them. Subject specific Fine tweaks so that Lora and text inversion approach.

See also  Recover and edit human images with AI

So, to be clear, Hunyuancustom is not trained from scratch, but rather a fine tuning of the December 2024 Hunyuanvideo Foundation model.

Those who developed Hunyuanvideo Loras may wonder if it still works with this new edition, or if they need more customization features than is built into this new release, they need to reinvent the Lora Wheel.

In general, a large amount of fine-tuning release of hyperscale models will change the weight of the model so that in newly applied models, LORAS created for previous models will not function properly or at all.

But sometimes the popularity of fine-tuning challenges its origins: one example of how fine-tuning can be effective forkwith a dedicated ecosystem and its own followers, it is a stable diffused XL (SDXL) pony diffusion tuning. Pony now has over 592,000 downloads in the ever-changing Civitai domain. It uses pony (not SDXL) as the base model and features a huge amount of lora that requires pony when inference.

release

New paper project page (with title) Hunyuancustom: Multimodal-driven architecture for customized video generation) As I write, it features a link to a GitHub site that appears to be functional and includes all the code needed for a local implementation and the required weights, and the proposed timeline (still importantly, the Comfyui integration).

At the time of writing, it is 404 to embrace the presence of the face of the project. However, there are API-based versions that allow you to demo your system as long as you can provide WeChat scancodes.

As is evident in Hunyuancustom, this elaborate and widespread use of such a wide variety of projects is rarely seen.

Two models will be announced on the GitHub page. 8) 720px1280px version that requires GPU peak memory and 512px896px version that requires 60GB of GPU peak memory.

Repository status “The minimum GPU memory you need is 24GB for 720px1280px129f, but it’s very slow. I recommend using a GPU with 80GB of memory to use better production quality.” – and iterate that the system is only tested on Linux.

Since its official release, the early Hunyuan video models have been quantized to a size that can run on less than 24GB of VRAM, and it seems reasonable to assume that the new model will be adapted by the community to more consumer-friendly forms and will also be adapted quickly for use on Windows systems.

Due to time constraints and the overwhelming amount of information that comes with this release, we can only take a wider range of things rather than looking at this release in detail. Nonetheless, let’s pop a bit of Hunyuancustom’s hood.

Please look at the paper

Clearly compliant with the GDPR framework, Hunyuancustom’s data pipeline incorporates both synthetic and open source video datasets, including OpenHumanVid. human, animal, plant, Scenery, vehicle, Objects, Architectureand anime.

An overview of Hunyuancustom Data Construction Pipeline’s diverse contribution packages from the release paper. Source: https://arxiv.org/pdf/2505.04512

Initial filtering starts with PyscenedEtect. This segments the video into a single shot clip. Next, use TextBPN-Plus-Plus to remove any videos that contain excessive text, subtitles, watermarks, or logos on your screen.

To address resolution and duration discrepancies, the clip length is standardized to 5 seconds and resized to 512 pixels or 720 pixels on the short side. Aesthetic filtering is processed using Koala-36M, and custom thresholds are applied to custom datasets curated by the researchers of the new paper.

The subject extraction process combines the QWEN7B Lead Language Model (LLM), the Yolo11x Object Recognition Framework, and the popular Insightface architecture to identify and validate human identity.

For non-human subjects, use QWENVL and grounded SAM 2 to extract the associated bounding boxes. This will be discarded if it is too small.

Example of semantic segmentation using grounded SAM 2 used in the Hunyuan Control project. Source: https://github.com/idea-research/grounded-sam-2

Multi-subject extraction uses Florence2 for Bounding Box Annotation, SAM 2 grounded for segmentation, followed by clustering and temporal segmentation of the training frame.

The processed clips are further enhanced by annotations that provide layered metadata such as descriptions and camera motion cues, using a proprietary structured signage system developed by the Hunyuan team.

A mask enhancement strategy, which involves conversion to bounding boxes, reduced overfitting during training, ensuring that the model adapts to a variety of object shapes.

Audio data was synchronized using the aforementioned LatentsYnc, and clips were discarded when the sync score fell below the minimum threshold.

See also  Deepseek-Prover-V2: Filling the gap between informal and formal mathematical inference

We excluded scoring videos below 40 (on Hyperiqa’s bespoke scale) using the blind image quality assessment framework Hyperiqa. The valid audio tracks were then processed with Whisper to extract the functionality for downstream tasks.

The author incorporates the Llava Language Assistant Model during the annotation phase, highlighting the central position that this framework is in Hunyuancustom. Llava generates image captions, helps to align visual content with text prompts, and helps to support the construction of coherent training signals across modalities.

The Hunyuancustom framework supports video generation consistent with conditioned IDs on text, images, audio and video inputs.

By leveraging Llava’s Vision-Language alignment capabilities, the pipeline acquires an additional layer of semantic consistency between visual elements and textual descriptions.

Custom Video

To allow video generation based on reference images and prompts, two modules centered around Llava are created, allowing you to first adjust the input structure of Hunyuanvideo to accept images along with text.

This includes formatting the prompt in a way that either embeds the image directly or tags it with a short ID description. I used a separator token to stop embedding images from overwhelming the prompt content.

Because Llava’s visual encoder tends to compress or discard fine-tuned spatial details during image and text functionality (especially when converting a single reference image to a general semantic embedding), Identity Enhancement Module It’s incorporated. The performance of this module in community testing can be significant, as almost all video latent spread models have difficulty maintaining identity without Lora, even with 5-second clips.

In any case, the reference image is resized and encoded using causal 3D-Vae from the original Hunyuanvideo model, with its potential potential being latent in the video across the timeline, with spatial offset applied to prevent the image from being reproduced directly in the output.

The models were trained using flow matching using noise samples drawn from logit normal distributions, and the network was trained to recover the correct video from these noisy potentials. Both the Llava and the video generator were fine-tuned together, allowing images and prompts to guide the output into a more fluent way, keeping subjects’ identity consistent.

For multi-subject prompts, each image text pair is embedded individually, assigns clear temporal locations, distinguishes identity and supports the generation of involved scenes multiple Interacting subjects.

Health and vision

Hunyuancustom uses both user input audio and text prompts to generate audio/voice, allowing characters to speak within scenes that reflect the written settings.

To support this, the IDINTINTINTANGLED AUDIONET module introduces audio functionality without destroying embedded identity signals from reference images and prompts. These features are consistent with the compressed video timeline, divided into frame-level segments, and injected using a spatial cross-measurement mechanism that isolates each frame, maintains subject consistency, and avoids temporal interference.

The second time injection module provides finer control over timing and motion, works in parallel with Audione, maps audio features to specific regions of the latent sequence, and converts them to per-token motion offsets using multi-layer perceptrons (MLPs). This allows gestures and facial movements to follow the rhythm and emphasis of the audio input.

Hunyuancustom allows you to directly edit subjects in an existing video and replace or insert people and objects into scenes without having to reconstruct the entire clip from scratch. This can be useful for tasks that involve changing appearance and movement in a targeted way.

Click to play. Further examples of supplementary sites.

To promote efficient thematic replacement in existing videos, the new system avoids resource intensive approaches of modern methods such as VACEs, which are currently popular, or those that merge the entire video sequence together. This keeps the process relatively light and allows external video content to guide the output.

Small neural networks handle the alignment between clean input video and the noisy latency used in generation. This system tests two ways to inject this information. Merge the two sets of features before compressing again. Adds features for each frame. The author avoids quality loss without changing the computational load if the second method works better.

Data and Testing

In the test, the metrics used were: The Arcface ID consistency module extracts face embeddings from both the reference image of the generated video and each frame, and calculates the average cosine similarity between them. Subject similarityby sending Yolo11x segments to Dino 2 for comparison. Clip-B, Text Video Alignment. Measures the similarity between the prompt and the generated video. Again, Clip-B calculates the similarity of each frame and its adjacent frame and its first frame, and calculates temporal consistency. and Dynamic degreedefined in vbench.

As shown previously, Hailuo was a competitor for baseline closure sources. Vidu 2.0; Kling (1.6); and Pika. Competing Phos frameworks were VACE and Skyreels-A2.

Key video customization methods across model performance assessment ID consistency (Face-SIM), subject similarity (Dino-SIM), text video alignment (Clip-BT), temporal consistency (TEMP-CONSIS), and exercise intensity (DD) compared to Hunyuancustom. Optimal and optimal results are shown in bold and underlined respectively.

Of these results, the author states:

See also  Researchers discovered malware in fake Discord Pypi packages downloaded more than 11,500 times

‘Our (Hunyuancustom) achieves the highest ID consistency and subject consistency. It also achieves comparable results with quick follow and temporal consistency. (hailuo) has the highest clip score as it is suitable for text instructions with ID consistency alone, at the expense of consistency of non-human subjects (worst dinosaurs). Regarding dynamic dingley, (vidu) and (vace) performance is lower, but this may be due to the small model size.

Although the project site is saturated with comparative videos (the layout seems to be designed for website aesthetics rather than simple comparisons), there is currently no video equivalent to the static results packed in PDF for early qualitative tests. Although we include it here, we encourage readers to take a closer look at the video on the project site.

A comparison of object-centered video customization from the paper. Viewers should (as always) refer to the source PDF for better resolution, but in this case the video on the project site could be a more prominent resource.

The author commentes here:

‘(Vidu), (Skyreels A2), and our method achieves quick alignment and subject consistency with relatively good results, but our video quality is superior to Vidu and Skyreels, thanks to the good video generation performance of the base model, i.e. (Hunyuanvideo-13B).

“In commercial products, (Kling) has good video quality, but the first frame of the video has a copy-paste (issue), subjects are too fast (blurred) and the viewing experience is insufficient.”

The author further comment that Pika works poorly in terms of temporal consistency, introducing subtitle artifacts (impacts from poor data curation where textual elements of video clips are allowed to contaminate core concepts).

Hailuo maintains and states his facial identity, but he is unable to maintain his overall body consistency. In the open source method, researchers argue that VACE is unable to maintain identity consistency, but Hunyuancustom claims that it creates videos with strong identity preservation while maintaining quality and diversity.

Next, the test was performed Customizing multi-subject videofor the same candidate. As in the previous example, the results of flattened PDFs are not print equivalents for the videos available on the project site, but are unique among the results presented.

Comparison using customisation of multi-subject video. See the PDF for better details and resolution.

The paper states:

‘(Pika) can generate a specified subject, but shows instability in the video frame, causing the man to disappear in one scenario, prompting the woman to not open the door. (vidu) and (vace) partially capture human identity, but lose important details of non-human objects, indicating the limitations of representing non-human subjects.

‘(Skyreels A2) experiences severe frame instability with significant changes in chips and numerous artifacts in the right scenario.

“In contrast, our Hunyuancustom effectively captures both human and non-human subject identities, generates videos that conform to specified prompts, maintaining visual quality and stability.”

A further experiment was “virtual human advertising,” where the framework was entrusted to integrate products with people.

An example of “product placement” of nerves from the qualitative test round. See the PDF for better details and resolution.

For this round, the author states:

(Results) shows that Hunyuancustom effectively maintains human identity while preserving the details of the target product, including text.

“In addition, the interaction between humans and products appears natural, and the video closely follows the given prompt, highlighting the substantial potential of Hunyuancustom in the production of advertising videos.”

One area where video results were very useful was the qualitative round of audio-driven subject customization. The character speaks the corresponding audio from the text and the scenes and postures described.

Partial results given in the audio round – in this case the results of the video may be desirable. This article is large and difficult to deal with, so only the top half of the PDF is reproduced here. See the source PDF for better details and resolution.

The author argues:

‘Previous audio-driven human animation methods enter human images and audio, where human posture, outfits, and environments remain consistent with the given image, and can’t generate videos in other gestures or environments, and (limits) applications.

‘… (Our) Hunyuancustom allows for audio-driven human customization. Here we talk about the audio that corresponds to the characters in the scenes and postures described in the text, allowing for more flexible, controllable, audio-driven human animation. ”

Further testing (see PDF for all details) included rounding up the new system against VACE and Kling 1.6 for video subject exchange.

Test the video to video to video exchange. See the source PDF for better details and resolution.

Of these, the last test presented in the new paper, the researcher commented:

‘Vace suffers from boundary artifacts due to strict adherence to the input mask, destroying the continuity of unnatural subject shapes and movement. (Kling) shows a copy-pusting effect, in contrast. This effect causes subjects to be overlaid directly into the video, resulting in inadequate integration with the background.

“In comparison, Hunyuancustom effectively avoids boundary artifacts, achieves seamless integration with video backgrounds, and maintains strong identity preservation that delivers excellent performance in video editing tasks.”

Conclusion

This is an attractive release. In particular, the ever-changing discontent enthusiast scene has dealt with more dissatisfaction these days, namely the lack of lip sync, so the increased capacity of realism in systems like Hunyuan Video and WAN 2.1 could potentially give a new dimension of authenticity.

The layout of almost all comparison video examples on the project site makes it quite difficult to compare the capabilities of Hunyuancustom with previous candidates, but it should be noted that a very small number of projects in the video integration space have the courage to pit themselves in testing with Kling. Tencent appears to have made a pretty impressive progress towards this incumbent.

* The problem is that some videos are very wide, short and high resolution, so they don’t play on standard video players like VLC or Windows Media Players, and they display a black screen.

First released on Thursday, May 8, 2025

Share This Article
Twitter Copy Link
Previous Article What can you give to Americans on Mother's Day? More than a baby bonus What can you give to Americans on Mother’s Day? More than a baby bonus
Next Article Cotton asks Gabbard not to share Inter with Germany. Cotton asks Gabbard not to share Inter with Germany.

Latest News

mm

AI is giving pets a voice: The future of cat health care begins with one photo

Artificial intelligence is revolutionizing the way we care for animals.…

May 15, 2025
5 BCDR Essentials for Effective Ransom Defense

5 BCDR Essentials for Effective Ransom Defense

Ransomware has evolved into a deceptive, highly tuned, dangerous and…

May 15, 2025
mm

Anaconda launches the first unified AI platform to redefine enterprise-grade AI development

Anaconda Inc., a longtime leader in Python-based data science, has…

May 14, 2025
Microsoft fixed 78 flaws and exploited five zero-days. CVSS 10 bug affects Azure DevOps servers

Microsoft fixed 78 flaws and exploited five zero-days. CVSS 10 bug affects Azure DevOps servers

On Tuesday, Microsoft shipped fixes to address a total of…

May 14, 2025
mm

Why language models are “lost” in conversation

A new paper from Microsoft Research and Salesforce found that…

May 13, 2025

You Might Also Like

Clouds with AzureChecker
Technology

Storm-1977 uses Azurechecker to hit the education cloud and deploy over 200 crypto mining vessels

2 Min Read
Meta starts the llamafirewall framework and stops AI jailbreak, injection, and safe code
Technology

Meta starts the llamafirewall framework and stops AI jailbreak, injection, and safe code

3 Min Read
Chinese Smishing Kit
Technology

The power of China’s Smithing Kits targets users in eight states, widespread toll fraud campaigns

6 Min Read
Ripple's xrpl.js npm Package Backdoored
Technology

Ripple’s XRPL.JS NPM package becomes a backdoo in the background, stealing private keys in major supply chain attacks

2 Min Read
InsighthubNews
InsighthubNews

Welcome to InsighthubNews, your reliable source for the latest updates and in-depth insights from around the globe. We are dedicated to bringing you up-to-the-minute news and analysis on the most pressing issues and developments shaping the world today.

  • Home
  • Celebrity
  • Environment
  • Business
  • Crypto
  • Home
  • World News
  • Politics
  • Celebrity
  • Environment
  • Business
  • Technology
  • Crypto
  • Sports
  • Gaming
  • World News
  • Politics
  • Technology
  • Sports
  • Gaming
  • About us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms of Service
  • About us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms of Service

© 2024 All Rights Reserved | Powered by Insighthub News

Welcome Back!

Sign in to your account

Lost your password?