Adobe’s Firefly Latent Diffusion Model (LDM) is undoubtedly one of the best available today, but Photoshop users who have tried out its generation capabilities will find it not easy to do Edit an existing image – Completely instead Alternatives User selection area using images based on user text prompts (despite being skilled at integrating the resulting sections into the context of images).
In the current beta version, Photoshop allows at least the reference image to be incorporated as partial image prompts. This makes Adobe’s flagship products a type of functionality that stable, spreading users have enjoyed for over two years thanks to third-party frameworks such as Controlnet.
The current beta version of Adobe Photoshop will allow reference images to be used when generating new content in selections, but at this point it’s a hit and miss event.
This illustrates an open problem in image synthesis research. This shows that the spreading model has to edit existing images without implementing a full-scale “reimagining” of user-specified selections.
This diffusion-based inpaint follows the user’s prompts, but completely reinvents the source subject matter without considering the original image (except to blending new generations and environments). Source: https://arxiv.org/pdf/2502.20376
This problem occurs because LDM generates images through iterative removal. Here, each stage of the process is conditioned to a user-supplied text prompt. When text prompt content is converted to an embedded token and using a sensitive model such as stable diffusion or flux, containing hundreds of thousands (or millions) of roughly matching embeddings associated with the prompt, the process has the desired calculated conditional distribution. And each step taken is a step towards this “conditional distribution target.”
That’s the image from text – because the scenario where users “want the best” cannot accurately convey what a generation will look like.
Instead, many people are trying to use the powerful generation capabilities of LDM to edit existing images, but this requires a balanced act between fidelity and flexibility.
When images are projected into the latent space of the model by methods such as Ddim Inversion, the goal is to restore the original as closely as possible, while allowing meaningful editing. The problem is that the more accurate the image is reconstructed, the more the model adheres to the itself original It makes structure and major modifications difficult.
In common with many other diffusion-based image editing frameworks proposed in recent years, Renoise architecture makes it difficult to actually change the appearance of the image with just the strange signs that a bow tie appears at the base of the cat’s throat.
On the other hand, if the process prioritizes editability, the model loosens the original grip and makes the introduction of changes easier, but at the expense of overall consistency with the source image:
The mission has been achieved, but for most AI-based image editing frameworks it is transformation rather than tweaking.
It is a problem that even a considerable amount of Adobe resources struggle to deal with it, so it is possible to reasonably consider the challenges to be noteworthy and may not allow simple solutions if any.
Close inversion
Therefore, the examples of new papers released this week caught my attention. This work offers cutting edge improvements in this field and proves that subtle and refined editing can be applied to images projected into the potential space of the model, without overwhelming the original content of the source image.
When a tight inversion is applied to an existing inversion method, the source selection is considered in a much more detailed way, and the transformation fits the original material instead of overwriting it.
LDM enthusiasts and practitioners may recognize this type of outcome, as they can be created in complex workflows using external systems such as ControlNet or IP-Adapter.
In fact, a new way – dubbing Close inversion – In fact, we use IP-Adapters along with dedicated face-based models for human depictions.
An example of creating an application edit to source material from the original 2023 IP adapter paper. Source: https://arxiv.org/pdf/2308.06721
Therefore, achieving a tightly inverted signal is to proceed with a complex technique to a single drop-in plug-in modality that can be applied to existing systems that contain many of the most popular LDM distributions.
Naturally, this means using tight inversion (TI), like an auxiliary system that leverages the source image as a conditioning factor for the edited version, rather than relying solely on the exact text prompt.
A further example of the tight inversion ability to apply true blended editing to source materials.
The authors acknowledge that their approach is not free from the traditional and continuous tension between fidelity and editability in diffusion-based image editing techniques, but report cutting-edge results when injecting TI into existing systems.
New title is included Tight Inversion: Image Conditional Inversion for Real Image Editingand comes from five researchers from Tel Aviv University and SNAP Research.
method
Initially, we use a large-scale language model (LLM) to generate a set of different text prompts where images are generated. The above mentioned DDIM inversion is then applied to each image There are three text conditions: A text prompt used to generate the image. A shortened version of the same thing. and null (empty) prompt.
If inverted noise is returned from these processes, the image will be played again under the same conditions, with no classifier-free guidance (CFG).
Inverting DDIM scores across different metrics with different prompt settings.
As you can see from the graph above, scores for different metrics improve as text length increases. The metric used was peak signal-to-noise ratio (PSNR). L2 distance; Structural similarity index (SSIM); Perceptual image patch similarity (LPIPS) was learned.
Image awareness
Effectively tight inversion changes the way the host diffusion model is edited by conditioning the process of inversion of the image itself rather than relying solely on text.
Typically, inverting an image into the noise space of a diffused model requires estimating the starting noise that reconstructs the input when removed. Standard methods use a text prompt to guide this process. However, an incomplete prompt can lead to errors, loss of details, or structure changes.
Instead, Tight Inversion uses an IP adapter to feed visual information into the model, reconstructing images more accurately, converting source images into conditioning tokens, and projecting them into inverted pipelines.
These parameters are editable. Increasing the impact of the source image will make the reconstruction almost perfect, but reducing it will allow for more creative changes. This helps invert both subtle changes, such as changing the color of the shirt, or important edits such as replacing objects.
The author states:
“Note that tight inversion can be easily integrated with previous inversion methods (e.g. friendly DDPM, edits by Renoise) (switching the native diffusion core of IP adapter change models), and (and) tight inversion consistently improves such methods in terms of both reconstruction and editability.”
Data and Testing
Researchers evaluated TI for their ability to reconstruct and edit real-world source images. All experiments used stable diffusion XL using the DDIM scheduler as outlined in the original stable diffusion paper. All tests used a 50 removal step with a default guidance scale of 7.5.
IP-ADAPTER-PLUS SDXL VIT-H was used for image conditioning. In a few tests, researchers used SDXL turbo in the Euler scheduler, conducted experiments on Flux.1-dev, and conditioned models for the latter case using RF inverse in 28 steps.
This was only used when characterising human faces, as this was a domain that emulsifiers were trained to deal with. It is noteworthy that specialized subsystems are used for this one rapid type, but suggests that our reasonable interest in generating human faces depends solely on the broad weight of the underlying model, such as stable violations.
Reconstruction tests were performed for qualitative and quantitative assessments. The image below shows a qualitative example of DDIM inversion.
Qualitative results of DDIM inversion. Each row shows a very detailed image alongside the reconstructed version, with each step using more accurate conditions gradually during inversion and removal. The more accurate conditioning, the better the quality of the reconstruction. The far right column shows the best results, which achieve the highest fidelity, with the original image itself being used as a condition. CFG was not used at any stage. See the source documentation for better resolution and more information.
The paper states:
“These examples highlight that adjusting the inversion process on the image will significantly improve reconstruction in highly detailed areas.
‘In particular, in the third example (image below), this method successfully reconstructs the tattoo on the back of the right boxer. Additionally, the boxer’s leg pose is more accurately preserved, allowing you to see the leg tattoo.
Further qualitative results of DDIM inversion. Descriptive conditions improve DDIM inversion, especially since image conditioning outweighs text in complex images.
The authors also tested tight inversions as drop-in modules for existing systems and pitted a modified version for baseline performance.
The three systems tested were the ddim inversion and RF inversion mentioned above. Also, Renoise shares several authors with the papers under discussion here. DDIM results are not difficult to obtain 100% reconstruction, so researchers focused solely on editability.
(Qualitative results images are formatted in a way that is difficult to reproduce here, so please refer to the source PDF for richer coverage and better resolution, despite some choices posted below)
Left, qualitative reconstruction results for close inversion with SDXL. Right, reconstruction with flux. The layout of these results in published work makes it difficult to reproduce here, so please refer to the source PDF for a real impression of the differences obtained.
Here the author comments:
‘As described, integrating tight inversions with existing methods consistently improves reconstruction. (For example) our method accurately reconstructs the railing in the example on the left, and reconstructs the man with a blue shirt in the example on the right (Figure 5 of the paper).
The authors also quantitatively tested the system. In line with previous works, they used the MS-COCO validation set and note that the results (shown below) improved reconstruction across all metrics in all ways.
Compare system performance metrics with or without tight inversions.
Next, the author tested the system’s capabilities. edit Pit it against the baseline version of the photo, previous approach prompt2prompt; edit friendly DDPM. LED-ITS ++; and rf inverted.
Below we select the qualitative results of the SDXL and Flux paper (and for more examples, we present readers to the rather compressed layout of the original paper).
Choices from vast qualitative outcomes (or rather confusing) spread throughout the paper. For improved resolution and meaningful clarity, we will introduce readers to the source PDF.
The authors argue that tight inversion is always superior to existing inversion techniques by creating a better balance between reconstruction and editingability. Standard methods such as Ddim Inversion and Renoise state that they can recover images well. The paper states that when editing is applied, it often has difficulty maintaining fine details.
In contrast, tight inversions take advantage of image conditioning to lock the model’s output closer to the original output, preventing unwanted distortion. The authors say that even when competing approaches produce reconstruction. appear To be precise, the introduction of editing often leads to artifacts and structural inconsistencies, and its severe inversions alleviate these issues.
Finally, quantitative results were obtained by evaluating the close inversion to the MagicBrush benchmark using DDIM inversion and LEDITS++ measured with Clip SIM.
A quantitative comparison of close inversions against magic brush benchmarks.
The author concludes:
‘In both graphs, the trade-off between image storage and adherence to target editing is apparent (observed). Tight Inversion offers better control over this trade-off and saves input images better as you align with edits (prompts).
“Note that clip similarity above 0.3 between the image and the text prompt indicates a plausible alignment between the image and the prompt.”
Conclusion
Though one of the most troubling challenges in LDM-based image synthesis does not represent a “breakthrough,” Tight Inversion integrates many burdensome supplementary approaches into a unified way of AI-based image editing.
The presented results show that tension between editability and fidelity has not stopped in this way, but is particularly diminished. Considering that the central challenge this work addresses can ultimately be awkward when addressed in its own terms (rather than looking beyond the LDM-based architecture of future systems), a tough inversion represents a state-of-the-art, welcome progressive improvement.
First released on Friday, February 28th, 2025