The struggle for customizing zero shots in the generated AI

If you want to place yourself in popular image or video generation tools, but the basic model is not well known enough to recognize you, you should train a Low Rank Adaptation (LORA) model using your own collection of photos. Once created, this personalized LORA model allows the generative model to include its identity in future output.

This is commonly called Customization Image and Video Integrated Research Sector. It appeared a few months after the emergence of stable diffusion in the summer of 2022. Google Research’s Dreambooth project offers a high gigabyte customization model with closed source schemas that are quickly adapted by enthusiasts and released to the community.

The LORA model continues quickly, offering easier training and much lighter file sizes, and quickly dominates the stable spread and customisation scenes of successors, with minimal or cost of quality.

Rinse and repeat

The problem is, as mentioned before, whenever new models come out, a new generation of Lora is needed to train, which represents a considerable amount of friction for Lora producers.

Thus, the zero-shot customization approach has become a strong chain of recent literature. In this scenario, rather than curating the dataset and training your own submodel, you simply inject one or more photos of the subject into the generation, and the system interprets these input sources into the blend output.

Below, in addition to face swapping, this type of system (using Pulid here) can also incorporate ID values into style transfers.

Examples of facial ID metastasis using the pyrad system. Source: https://github.com/tothebeginning/pulid?tab=readme-ov-file

Replacing a labor-intensive, vulnerable system like Lora with a popular adapter is a great (and popular) idea, but it’s also challenging. The extreme attention to detail and coverage obtained in the LORA training process is extremely difficult to mimic with one-shot IP adapter style models, where the level of detail and flexibility of LORA must match the level of detail and flexibility without the prior advantage of analyzing a comprehensive set of identity images.

Hyperly

With this in mind, there is an interesting new paper from Bytedan, which proposes a system for generating real LORA codes. On the Flycurrently unique among Zero Shot solutions.

The input image is on the left. That right is a flexible range of output based on source images, effectively producing deepfakes from actors Anthony Hopkins and Anne Hathaway. Source: https://arxiv.org/pdf/2503.16944

The paper states:

Adapter-based techniques such as IP-Adapter freeze the parameters of the underlying model and allow zero-shot inference using plug-in architectures, but often show a lack of naturalness and reliability that is not overlooked in portrait synthesis tasks.

‘ (We) introduces a parameter-efficient adaptation generation method that generates LORA weights using an adaptive plugin network, namely Hyperlora, and combines LORA’s excellent performance with the zero-shot capabilities of the adapter scheme.

“We achieve zero-shot personalized portrait generation (supports both single and multiple image inputs) with high photorealism, fidelity and editing capabilities through carefully designed network structures and training strategies.”

The most useful, trained system can be used with existing control nets, allowing for a high level of specificity of production.

Timothy Charomet has an unexpectedly hilarious look in “The Shining” (1980), based on three input photos by Hyperlora, with the control netmask defining the output (coordinating with the text prompt).

Witedan has a reasonable record in this regard as to whether the new system will be available to end users, releasing a very powerful Latentsync lip-sync framework and just released the Infiniteyou framework.

Negatively, this paper does not indicate an intention to release it. Also, the training resources needed to recreate the work are extremely difficult, making it difficult for the enthusiast community to replicate (as we did with DreamBooth).

New paper titled Hyperlora: Parameter-efficient Adaptive Generation for Portrait Integrationand comes from seven researchers in the Bytedance and Bytedance’s dedicated intelligent creation department.

method

The new method utilizes the stable diffusive latent diffusion model (LDM) SDXL as the basis model, but the principles generally appear to be applicable to diffusion models (though training requirements – see below – may make it difficult to apply to generated video models).

Hyperlora’s training process is divided into three stages, each designed to separate and store specific information by the weight they learn. The purpose of this ring fence procedure is to prevent the identity-related features from being contaminated by irrelevant elements such as clothing and background.

Hyperlora concept schema. The model is split into “hyperID-LORA” for identity functions and into “hyperbase lora” of background and clothing. This separation reduces functional leakage. During training, the SDXL base and encoder will be frozen, and only the Hyperlora module will be updated. Inference only requires ID-LORA to generate personalized images.

The first stage is completely focused on learning ‘Base-lora’ (bottom left of the schema image above), captures the details related to your identity.

To implement this separation, researchers intentionally blurred the faces of the training images, allowing the model to latch into things like backgrounds, lighting, poses, etc, but not identity. This “warm-up” stage acts as a filter, removing low-level distractions before identity-specific learning begins.

In the second stage, “id-lora” (top left of the above schema image) is introduced. Here, two parallel paths are used to encode the facial identity. Clip VIT for structural features and Insightface AntelopeV2 encoder for more abstract identity representation.

A Transient Approach

The clip feature helps the model converge quickly, but it fits over risk, while antelope embedding is more stable, but slower training. Therefore, the system begins by relying heavily on clips and gradually relying on phases of the antelope to avoid instability.

In the final stage, the attention layer of clip guide is completely frozen. Only the attention module in the AntelopeV2 chain continues training, allowing the model to improve identity preservation without decreasing the fidelity or generality of previously learned components.

This step-by-step structure is essentially an attempt at dismantling. The functions of identity and nonidentity are first separated and then independently refined. This is a systematic response to the normal failure mode of personalization. Identity drift, low editing potential, and overfitting to incidental features.

While you’re weighed

After Clip VIT and AntelopeV2 extract both structural and identity-specific features from a particular portrait, the features obtained are Perceiver Resampler (derived from the IP adapter project mentioned above) – Trans-based module that maps features to a compact set of coefficients.

Two separate resamplers are used. One is used to generate the base lora weight (encoding the background and non-identity elements), and the other is used to generate the ID-LORA weight (focusing on facial identity).

Schema for the Hyperlora network.

The output coefficients are then linearly combined with the set of trained LORA base matrices to generate the complete LORA weights without the need to fine tune the base model.

This approach allows the system to generate personalized weights Completely on the spotexploits LORA’s ability to directly modify the behavior of the base model while using only image encoder and lightweight projection.

Data and Testing

To train Hyperlora, researchers used a subset of 4.4 million face images, best known as the data source for the current 2022 stable diffusion model.

I used Insightface to exclude non-portorate faces and multiple images. Next, I annotated the images with the BLIP-2 caption system.

Regarding data augmentation, images were randomly cropped around the face, but were always focused on the facial area.

Each Lora rank had to house itself in memory available in the training setup. So, although the LORA rank of ID-LORA was set to 8 and the base lora rank was set to 4, we used an 8-stage gradient accumulation to simulate a larger batch size than is actually possible in hardware.

The researchers sequentially trained the Baselora, ID-LORA (CLIP), and ID-LORA (ID-Embedded) modules with 20K, 15K, and 55K iterations, respectively. During ID-LORA training, they were sampled from three conditioning scenarios with a probability of 0.9, 0.05, and 0.05.

The system was implemented using Pytorch and Diffusers, and the complete training process ran for about 10 days on a 16 Nvidia A100 GPU*.

comfyui test

The authors built a workflow on the Comfyui synthesis platform and compared high chlorine to three rival methods. The IP adapter mentioned above in the form of the IP-Adapter-FaceID-Portrait framework. And the above pyrad. Consistent seed, prompts, and sampling methods were used across all frameworks.

The authors point out that adapter-based (not LORA-based) methods generally require guidance (CFG) scales that do not include low classifiers, while LORA (including Hyperlora) is more tolerant in this regard.

Therefore, for a fair comparison, the researchers used open source SDXL fine-tuned checkpoint variants Leo Sam’s Hello World The whole test. The Unsplash-50 image dataset was used for quantitative testing.

metric

For fidelity benchmarks, the authors measured facial similarity using cosine distances between clip image embedding (Clip-I) and individual identity embedding (ID sim), extracted via curriculum face, a model not used during training.

Each method produced four high-resolution headshots per identity in the test set, and the results were averaged.

Editability was assessed by comparing Clip-I scores between outputs with or without ID modules (see how much the ID constraints changed the image). And by measuring the clip image text alignment (Clip-T) covering 10 prompt variations Hairstyle, accessories, clothesand background.

The authors included the ARC2Face Foundation model in the comparison. This is a baseline trained with fixed captions and cropped facial area.

For Hyperlora, two variants were tested. One uses only the ID-LORA module, the other uses both ID and Base-Lora, the latter weighted at 0.4. The base lora has improved fidelity, but it is slightly constrained.

Results of the first quantitative comparison.

Of the quantitative tests, the author commented.

‘Base Rora helps to improve fidelity, but limits editability. Our design separates image functions into different loras, but it is difficult to avoid leaking to each other. Therefore, you can adjust the weights of the base roll to adapt to a variety of application scenarios.

‘Our Hyperlora (Full and ID) achieves the best, second-best facial fidelity, while Instantid shows the advantage of face ID similarity, but with lower fidelity.

“The similarity of face IDs is more abstract and Facefidelity reflects more detail, so both these metrics need to be considered together to assess fidelity.”

Qualitative tests bring to the forefront various trade-offs involved in essential propositions (note that there is no space to reproduce all images for qualitative results.

Qualitative comparison. From top to bottom, the prompts used were “white shirts” and “wolf ears” (see paper for additional examples).

Here the author comments:

“The skin of the portraits produced by the IP adapter and InstantID has clearly generated textures.

“This is a common drawback of adapter-based methods. Pulid improves this issue by weakening intrusion into the base model and outweighing IP-Adapter and InstantID, but still suffering from blurry and lack of details.

“In contrast, instead of introducing a special attention module, LORA directly changes the weights of the base model. It usually produces highly detailed and optically realistic images.”

The authors argue that Hyperlora can directly modify the weights of the base model rather than relying on external attentional modules, thus preserving the nonlinear capacity of traditional Lora-based methods, providing the advantages of fidelity and improving the capture of subtle details such as pupil color.

In a qualitative comparison, this paper argues that Hyperlora’s layout is more coherent, better aligned with the prompt, similar to that produced by Pulid, but is particularly stronger than Instantid or IP-Adapter (which could not follow the prompt or produce unnatural compositions).

A further example of the ControlNet generation using Hyperlora.

Conclusion

The consistent stream of various one-shot customization systems over the past 18 months is now based on the quality of despair. Few offerings have made significant advances at the cutting edge. And those who have moved it a little further tend to have exorbitant training requirements and very complex or resource-intensive reasoning requirements.

Hyperlora’s own training regime is as wasteful as many similar entries these days, but at least one is finished with a capable model To this Remove from the box and customize it.

From the supplementary material of the paper, it should be noted that Hyperlora’s inference speed is better than IP adapters, but worse than the other two previous methods, and that these numbers are based on NVIDIA V100 GPUs that are not typical consumer hardware (though the new “domestic” NVIDIA GPUs match or exceed the V100’s maximum 32GB of VRAM).

The speed of inference of competing methods is in milliseconds.

It’s fair to say that zero-shot customization remains an open question from a practical standpoint, as Hyperlora’s critical hardware requirements are undoubtedly at odds with the ability to generate a single underlying model that is probably true long-term.

* Represents 640GB or 1280GB of VRAM, depending on the model used (this is not specified)

First published on Monday, March 24, 2025