New collaboration between University of California Merced and Adobe offers cutting edge advancement Complete human image – A very well-studied task of “escape” the obstructed or hidden parts of people’s images for purposes such as virtual tryon, animation, photo editing, etc.
In addition to repairing corrupted images and modifying them on the whims of the user, human image completion systems such as Completeme can impose new clothing (through a supplementary reference image, as in the middle column of these two examples). These examples are from an extensive supplemental PDF of the new paper. Source: https://liagm.github.io/completeme/pdf/supp.pdf
New approach to the title Completed: Reference-based human image completed“suggest” the system using supplemental input images. Content must replace hidden or missing sections of human depictions (and therefore applicability to fashion-based trion frameworks):
The Completeme system allows reference content to be adapted to obscure or blocked parts of a human image.
The new system includes a dual U-Net architecture and Regional focused attention (RFA) That marshal blocks resources in the associated area of the image repair instance.
Researchers also offer a new, challenging benchmark system designed to assess reference-based completion tasks (completion is part of the existing, ongoing research strand of computer vision, although it had not previously had a benchmark schema).
Tests often show up in user surveys for scales, with most metrics, and overall new methods emerged first. In certain cases, the rival method was completely foxed by a reference-based approach.
From supplementary material: The AnyDoor method is particularly difficult to determine how to interpret the reference image.
The paper states:
“The extensive experiments in our benchmarks show that they outweigh both fully reference-based and non-reference-based cutting-edge methods in terms of complete metrics, qualitative results, and user research.
“In particular, in challenging scenarios involving complex poses, complex clothing patterns and distinctive accessories, our model consistently achieves excellent visual fidelity and semantic consistency.”
Sadly, the github presence in the project does not contain any code or promises. The initiative also has a modest project page, and appears to be framed as a unique architecture.
A further example of the subjective performance of a new system over the previous method. More details will be explained later in the article.
method
The completeme framework is supported by reference u-net, which handles the integration of supplementary material into processes that correspond to a wider range of processes for obtaining the final result, as shown in the concept schema below.
A conceptual schema for completion. Source: https://arxiv.org/pdf/2504.20042
The system encodes the first masked input image into a latent representation. At the same time, the reference U-NET processes multiple reference images, each displaying different body regions, to extract detailed spatial features.
These features pass through area-centered attention blocks embedded in the “full” u-net and are selectively masked using the corresponding region mask, ensuring that the model pays attention only to the relevant regions of the reference image.
The masked functionality integrates with global clip-derived semantic features through isolated mutual participation, allowing the model to reconstruct content that is missing in both detailed detail and semantic coherence.
To enhance realism and robustness, the input masking process combines random grid-based occlusion with human body shape masks, each applied with equal probability, increasing the complexity of the missing areas where the model needs to be completed.
Reference only
Usually, the way reference-based images started was dependent Semantic Level Encoder. This kind of project includes the clip itself and dinov2. Both extract global features from reference images, but often lose the fine spatial details needed to preserve accurate identity.
From the release paper of the old DINOV2 approach included in the comparison tests of the new study: the colored overlay shows the first three principal components of Principal Component Analysis (PCA) applied to image patches within each column. Despite differences in pose, style, or rendering, the corresponding areas (wings, limbs, wheels, etc.) are consistently consistent, indicating the model’s ability to learn part-based structures without supervision. Source: https://arxiv.org/pdf/2304.07193
Completeme addresses this aspect via a special reference u-net that was initialized from stable diffusion 1.5, but works without diffused noise steps*.
Each reference image covering various body regions is encoded into detailed potential features through this u-net. Global semantic features are also extracted individually using Clip, and both feature sets are cached for efficient use during attention-based integration. Therefore, the system is flexible to accommodate multiple reference inputs while retaining fine grained appearance information.
Orchestration
Cohesive U-NET manages the final stages of the completion process. Fits from the stable diffusion 1.5 in-painting variant, masked source images are input in a potential form and retrieved with detailed spatial features drawn from the reference image and global semantic features extracted by the clip encoder.
These various inputs are grouped together through RFA blocks and play an important role in deriving the focus of the model into the most relevant regions of the reference material.
Before entering a note mechanism, the reference function explicitly masks to remove unrelated areas and concatenates them with potential representations of the source image to ensure that attention is directed as accurately as possible.
To enhance this integration, the completion incorporates a separate cross-measurement mechanism adapted from the IP-ADAPTER framework.
Part of its embedded in the completion, the IP-Adapter is one of the most successful and frequently stacked projects from the turbulent developments of the past three years in potential diffusion model architectures. Source: https://ip-adapter.github.io/
This allows the model to handle spatially detailed visual features and broader semantic contexts via separate attention streams. This, combined later, results in a consistent reconstruction that the author argues and preserves both identity and fine details.
benchmark
In the absence of a modest dataset for reference-based human completion, researchers made their own suggestions. The (nameless) benchmark was constructed by curating selected image pairs from Wopse Dataset, devised for Adobe Research’s 2023 Unihuman Project.
Examples of poses in Adobe Research 2023 Unihuman Project. Source: https://github.com/adobe-research/unihuman?tab = readme-ov-file#data-prep
The researchers manually drew the source mask to show the entrance area and finally obtained 417 tripartite image groups that comprise the source image, mask, and reference image.
Two examples of groups that were initially derived from the reference wass dataset were extensively curated by researchers in the new paper.
The authors used the LLAVA Large Language Model (LLM) to generate a text prompt describing the source image.
The metrics used were broader than usual. In addition to the usual peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) and perceptual image patch similarity (in this case, LPIPS to assess masked regions), researchers used Dino for similarity scores. A dreamlike evaluation of the result; and clip.
Data and Testing
To test the work, the authors utilized both the default stable diffusion V1.5 model and the 1.5 in-painting model. The system’s image encoder used the Clip Vision model along with the projection layer. This is a modest neural network that reconstructs or aligns the clip output to the dimensions of internal functionality used in the model.
Training was performed on 30,000 iterations compared to the eight NVIDIA A100s† GPU, supervised by mean square error (MSE) loss and supervised with a batch size of 64 and a learning rate of 2×10-5. Various elements were randomly dropped throughout the training to prevent the system from overfitting the data.
The dataset was changed from part to the entire dataset based on the DeepFashion-Multimodal Dataset.
An example from parts to whole dataset used to develop data curated for completion. Source: https://huanngzh.github.io/parts2whole/
The author states:
To meet our requirements, we (reconstructed) the training pair using occlusion images with multiple reference images that capture different aspects of human appearance and short text labels.
“Each sample of the training data includes six appearance types: upper body clothing, lower body clothing, whole body clothing, hair or hat, face and shoes. Masking strategy applies 50% random grid masking 1-30 times, while the other 50% use human body masks to increase the complexity of masking.
“After the construction pipeline, we got 40,000 image pairs for training.”
Rival advance Not referenced The methods tested were the completion of a large blocked human image (LOHC) and plug-and-play images to enter the model Brushnet. The reference-based models tested were painted for each example. Any door; leftrefill; and mimic.
The authors began with a quantitative comparison of previously defined metrics.
Results of the first quantitative comparison.
Regarding quantitative assessment, the authors note that completion achieves the highest scores on most perceptual metrics including Clip-I, Dino, Dreamsim, and LPIP, which aims to capture semantic alignment and appearance fidelity between the output and the reference image.
However, the model does not exceed all baselines overall. Notably, BrushNet scores the best in Clip-T, Leftrefill leads in SSIM and PSNR, and slightly outperforms in Clip-I.
Completeme shows consistently strong results overall, but performance differences are sometimes conservative, with certain metrics still being guided by competing in previous ways. Although perhaps not unfair, the authors frame these results as evidence of perfectly balanced intensity across both structural and perceptual dimensions.
The illustrations of qualitative tests conducted for the study are too many to replicate here. Readers should refer to the extensive supplemental PDF, including not only source papers but many additional qualitative examples.
We highlight the main qualitative examples presented in the main paper and the selection of additional cases drawn from the supplemental image pool mentioned earlier in this article.
Initial qualitative results presented in the main paper. See the source paper for better resolution.
Of the above qualitative results, the author commented.
‘ Given masked input, these non-reference methods generate plausible content for masked regions using image priors or text prompts.
“However, as shown in the red box, they cannot reproduce certain details such as tattoos or unique clothing patterns, as they do not have a reference image to guide the reconstruction of the same information.”
The second comparison below focuses on four reference-based methods per paint: AnyDoor, Leftrefill, and MimicBrush. Here, only one reference image and a text prompt were provided.
Qualitative comparison with reference-based methods. Completeme generates a more realistic completion and better store certain details from the reference image. The red box highlights areas of particular interest.
The author states:
‘Given masked human images and reference images, other methods can generate plausible content, but in many cases, they cannot accurately store contextual information from references.
“In some cases, they generate unrelated content or incorrectly map corresponding parts from the reference image. In contrast, they effectively complete the masked area by accurately storing the exact same information and correctly mapping the corresponding parts of the human body from the reference image.
To assess how well the models matched with human perception, the authors conducted a user study that included 15 annotators and 2,895 sample pairs. Each pair compared the complete output to one of the baselines on the reference baselines per paint.
The annotator evaluated each outcome based on the visual quality of the completed area and the extent to which the identity function was preserved from reference. Here we evaluated the overall quality and identity and got the full results.
Results of user surveys.
Conclusion
If anything, the qualitative results of this study are compromised by the vast amount, as in-depth research shows that, in this relatively niche, but passionate neuroimage editing area, the qualitative results of this study are undermined by the vast amount.
However, there is a bit of extra attention and zoom in on the original PDF. Understand how much the system adapts the reference material to the blocked area (in almost all cases) compared to previous methods.
Readers are highly encouraged to carefully examine the initial confusion, if not the overwhelming avalanche of the results presented in the supplementary material.
* It is interesting to note that at present, the badly outdoor v1.5 release remains a researcher’s favorite – partly because of testing like legacy, but that is because it is most likely that it is least censored for all stable diffusion iterations, and because it does not share the Sensourias Holobe of the FOSS Flux Release.
† No VRAM specifications are given – either 40GB or 80GB per card.
First released on Tuesday, April 29th, 2025