Evaluate image realism using AI hallucinations

A new study from Russia proposes an unconventional method of detecting unrealistic AI-generated images by intentionally utilizing hallucination trends rather than improving the accuracy of large-scale visual language models (LVLMs).

A novel approach extracts multiple “atomic facts” about images using LVLMS and applies natural language inference (NLI) to systematically measure inconsistencies between these statements.

Two images from the hoop! Dataset with statements automatically generated by the LVLM model. The image on the left is realistic and leads to a consistent explanation, while the unusual right image hallucinates the model and produces inconsistent or incorrect descriptions. Source: https://arxiv.org/pdf/2503.15948

You are asked to evaluate the realism of the second image, and the LVLM can see it something The camel depicted has three knots, which is inherently unknown, so it’s weird.

However, LVLM is first confused > 2Humps and > 2 animalsbecause this is the only way you can see three knobs in one “camel photo”. It then does not detail the very thing that appears to have hallucinated something even more unlikely than the three humps (i.e., “two heads”) and caused that suspicion.

Researchers in the new study found that LVLM models can perform this type of assessment natively and on par (or better) with models fine-tuned for this type of task. Because fine-tuning is complex, expensive and quite vulnerable in terms of downstream applicability, the discovery of native use, one of the biggest obstacles in the current AI revolution, is a refreshing twist on the general trends of the literature.

Open evaluation

The authors say that the importance of approach is that it can be unfolded in Open Source Framework. Advanced and high investment models such as ChatGpt Can (accepting the paper) may offer better results in this task, but the controversial real value of the literature of the majority of us (and especially the enthusiasts and the VFX community) could potentially incorporate and develop new breakthroughs into local implementations. Conversely, everything appointed to its own commercial API system is subject to withdrawals, arbitrary price increases, and censorship policies that are more likely to reflect corporate concerns than user needs and liability.

New paper titled Don’t fight hallucinations, use them: Estimate image realism using NLI rather than atom factsand comes from five researchers from the Skolkovo Institute of Science and Technology (Skoltech), the Moscow Physics and Technology Institute, and the Russian companies MTS AI and Airi. The work has an attached Github page.

method

The author uses Israeli/US hoops! Project dataset:

An example of impossible images from a hoop! Dataset. It is noteworthy that how these images assemble plausible elements, and their impossibility must be calculated based on the concatenation of these incompatible facets. Source: https://whoops-benchmark.github.io/

The dataset consists of 500 composite images and over 10,874 annotations. It is specifically designed to test your understanding of common sense inference and composition of AI models. It was created in collaboration with designers responsible for generating challenging images from text such as Midjourney and Dall-E series via image systems.

More examples from the hoop! Dataset. Source: https://huggingface.co/datasets/nlphuji/whoops

The new approach works in three stages: First, you are asked to generate multiple simple statements called “Atomic Facts” that describe the image, especially LLAVA-V1.6-MISTRAL-7B. These statements are generated using a variety of beam searches to ensure output variation.

Diverse beam searches generate a variety of better caption options by optimizing highly diversified goals. Source: https://arxiv.org/pdf/1610.02424

Each generated statement is then systematically compared with all other statements using natural language inference models. This assigns a score that reflects whether the statement pairs are associated with each other, conflicting or neutral.

Contradictions indicate hallucinations or unrealistic elements within the image.

The schema for the detection pipeline.

Finally, this method aggregates these pairwise NLI scores into a single “reality score” and quantifies the overall consistency of the generated statements.

Researchers have investigated various aggregation methods as clustering-based approaches work best. The authors applied the K-Means clustering algorithm to separate individual NLI scores into two clusters and selected the centroid of gravity of the low-value cluster as the final metric.

Using two clusters distinguishes between binary nature of the classification task, i.e. unrealistic images. The logic is similar to simply choosing the lowest score overall. However, clustering allows metrics to represent average inconsistencies across multiple facts rather than relying on a single outlier.

Data and Testing

Researchers tested the system with a hoop! Baseline benchmarks (i.e. cross-validation) using rotational test splits. The models tested were blip2 flant5-xl and blip2 flant5-xxl in splits, zero-shot format (i.e. no additional training).

For baseline following the instructions, the author urged LVLMS with phrases “Is this unusual? Please explain it briefly in a short sentence.”previous research determined it to be effective in discovering unrealistic images.

The models evaluated were Llava 1.6 Mistral 7b, Llava 1.6 Vicuna 13b, and instructblips in two sizes (7/13 billion parameters).

The test procedure was at the heart of 102 pairs of realistic and unrealistic (“strange”) images. Each pair consisted of one regular image and one contrary to common sense counterpart.

Three human annotators label the images, reaching a consensus of 92%, indicating a strong human consensus about what constitutes “strangeness.” The accuracy of the evaluation method was measured by its ability to correctly distinguish between realistic and unrealistic images.

The system was evaluated using 3x cross-validation and randomly shuffled the data with fixed seeds. The author adjusted the weights of the seduction score (logically agreeing statements) and inconsistency scores (logically competing statements) during training, and the “neutral” score was fixed to zero. The final accuracy was calculated as the average across all test splits.

Comparison of different NLI models and aggregation methods on a subset of five generated facts measured with accuracy.

Regarding the first results above, the paper states:

“The (‘Clust’) method stands out as one of the best performances. This means that the aggregation of all contradiction scores is important rather than focusing solely on extreme values. Furthermore, the largest NLI model (NLI-Deberta-V3-Large) suggests that it performs all other things for all aggregation methods, and captures the essence of the problem more effectively.

The authors find that optimal weights consistently support contradictions to seduction, indicating that contradictions are more beneficial for distinguishing unrealistic images. Those methods outperform all other zero-shot methods tested and are closely closer to the performance of the finely tuned BLIP2 model.

Performances of different approaches regarding hoops! benchmark. The finely tuned (ft) methods are displayed at the top, with the zero shot (zs) methods listed below. The model size indicates the number of parameters, and the accuracy is used as the evaluation metric.

They also pointed out somewhat unexpectedly that DestandBlip performed better than comparable LLAVA models when given the same prompt. Recognizing the excellent accuracy of GPT-4O, this paper highlights the author’s preference for demonstrating practical and open source solutions, and appears to be able to reasonably assert novelty in explicitly exploiting hallucinations as a diagnostic tool.

Conclusion

However, the author acknowledges the project’s debt for the 2024 Faith Score outing, a collaboration between the University of Texas University of Dallas and Johns Hopkins University.

A diagram of how Faithscore evaluation works. First, descriptive statements within the response generated by LVLM are identified. These statements are then categorized into individual atomic facts. Finally, the facts of the atoms are compared with the input image to check their accuracy. The underlined text emphasizes objective descriptive content, while the blue text shows hallucination statements, allowing FaithScore to provide an interpretable measure of de facto accuracy. Source: https://arxiv.org/pdf/2311.01477

Faithscore measures the fidelity of LVLM generation explanations by verifying consistency to image content, but the new paper’s method explicitly utilizes LVLM hallucinations to detect unrealistic images through inconsistencies in generated facts using natural language inference.

The new work naturally relies on the eccentricity and hallucination temperament of current language models. If model development requires the creation of a completely non-holling model, even the general principles of new work will no longer be applicable. However, this remains a challenging outlook.

First released on Tuesday, March 25th, 2025