Teach AI to give better video criticism

Large-scale vision language models (LVLMS) are useful aids in interpreting some of the more inexplicable or challenging submissions in the computer vision literature, but there is one area in which they are ham-worthy. Examples of video That comes with new papers*.

This is an important aspect of submission, as scientific papers often aim to generate excitement through persuasive texts and/or visuals.

However, for projects with video integration, the author risks showing actual video output or rejecting work. And in these demonstrations, the gap between bold claims and real-world performance is mostly revealed.

I read a book but didn’t see a movie

Most popular API-based major language models (LLMS) and large-scale vision language models (LVLMS) are not involved in direct analysis of video content In some wayqualitative or otherwise. Instead, you can only analyze the relevant transcripts – and perhaps strictly apply threads and other comments Text– Base supplementary materials.

The diverse objections of GPT-4O, Google Gemini, and confusion come when asked to directly analyze the video without relying on transcripts or other text-based sources.

However, LLM may hide or deny that they cannot actually watch the video unless they call it.

Asked to provide a subjective assessment of the video related to the new research paper, ChatGPT-4o confesses that he cannot view the video in person, in the end.

Models such as CHATGPT-4O are multimodal and can at least be analyzed Individual Photos (frames extracted from the video, see image above), even this has some issues. First, there is a lack of evidence to give credibility to LLM qualitative opinions.

Secondly, if most of the problems with the generated video are not the case, Time Aspects that are completely lost in frame grabs – therefore, inspection of individual frames does not serve their purpose.

Finally, LLM can only be based (again) on the “value judgment” that is assumed (again) in absorbing, for example, text-based knowledge of deepfake images and art history. In such cases, the knowledge of the trained domain allows LLM to correlate the visuality of the analyzed image with the embedding learned based on human Insight:

The FakeVLM project provides targeted deep fercate detection via a specialized multimodal vision language model. Source: https://arxiv.org/pdf/2503.14905

This does not mean that LLM cannot retrieve information directly from the video. For example, using an auxiliary AI system, such as Yolo, LLM can identify objects in a video. Alternatively, if you are trained for a multimodal function above average, you can do this directly.

However, the only way LLM can subjectively evaluate videos (i.e. “That’s not true to me.”) either apply loss function-based metrics known to reflect human opinions well, or are informed directly by human opinions.

Loss functions are mathematical tools used during training to measure how far away the model’s predictions are from the correct answer. Provides feedback that guides model learning. The higher the error, the higher the loss. As training progresses, the model adjusts parameters to reduce this loss, gradually improving its ability to make accurate predictions.

The loss function uses both tuning algorithms designed to adjust the training of the model and to evaluate the output of the AI model (such as evaluation of simulated photorealic content from the generated video model).

Conditional Vision

One of the most popular metric/loss functions is the Fréchet Inception distance (FID). This evaluates the quality of the images produced by measuring similarity between distributions (which means here.How images spread or group by visual features’) And that of the actual image.

Specifically, FID uses the (often criticized) V3 classification network to calculate statistical differences using mean and covariance between features extracted from both image sets. A lower FID score indicates that the generated images are more similar to the actual images, meaning improved visual quality and diversity.

However, FID is relatively inherently and undoubtedly self-referential. To improve this, a later conditional FRéchet distance (CFD, 2021) approach differs from FID by comparing the generated images with the actual images and assessing the scores based on how well the sets match. Additional terms(inevitably subjective) class labels and input images.

In this way, CFID explains not only the overall realism and the diversity of itself, but how accurately the image meets the intended conditions.

2021 Example of CFD outings. sOurce: https://github.com/michael-soloveitchik/cfid/

CFD follows recent trends for burning qualitative human interpretations into loss functions and metric algorithms. Such anthropocentric approaches ensure that the resulting algorithm is not “soulless” or merely mechanical, but also presents many problems. Possibility of bias. The burden of updating the algorithms along with new practices and the fact that this removes the possibility of a consistent comparison criterion across the project over the years. Budget limits (Less human contributors make decisions more specific, while more numbers can prevent useful updates due to costs).

cfred

This brings us to a new paper from the US that is clearly provided Conditional Frechet Distance (CFRED), a new take on CFD designed to better reflect human preferences by evaluating both visual quality and text image alignment.

Partial results from a new paper: Image rankings (1-9) by various metrics for the prompt “living room with sofa and laptop computer”. Green highlights the top human rate model (Flux.1-dev), with purple highlighting the lowest (SDV1.5). Only CFRED matches human rankings. See the source paper for the complete results. There is no room for reproduction here. Source: https://arxiv.org/pdf/2503.21721

The authors argue that existing methods of intertext synthesis, such as inception scores (IS) and FID, are insufficient to human judgment, as they measure only the image without considering how the image matches the prompt.

“For example, consider a dataset with two images. One is a dog and one cat, each paired with the corresponding prompt. A complete text-to-image model that accidentally exchanges these mappings (i.e. generates cats due to dog prompts and vice versa) would be close to zero FID despite inconsistency with the intended prompt, as the overall distribution of cats and dogs is maintained.

“This shows that CFRED improves image quality rating and conditioning of input text, improving correlation with human preferences.”

Testing the paper shows that the author’s proposed metric, CFRED, is consistently achieved with human preferences over FID, FDDINOV2, Clipscore, and CMMD in three benchmark data sets (Partiprompts, HPDV2, and COCO).

Concepts and methods

The authors note that the current gold standard for evaluating text-to-image models involves collecting human preference data through crowdsourced comparisons, similar to the methods used for large-scale language models (such as the LMSYS arena).

For example, Partiprompts Arena uses 1,600 English prompts to present participants with pairs of images from different models and ask them to select the image they like.

Similarly, the Arena Leaderboard of Images from Text employs user comparisons of model output to generate rankings via ELO scores. However, collecting assessment data for this type of human is expensive and slow, leading some platforms like Partiprompts Arena and stops updates entirely.

The Artificial Analytical Image Arena Leaderboard ranks the currently estimated generation visual AI leaders. Source: https://artificialanalysis.ai/text-to-image/arena?tab=leaderboard

Although there are trained alternatives for historical human preference data, the effectiveness of human preferences for assessing future models remains uncertain as human preferences evolve continuously. As a result, automated metrics such as FID, Clipscore, and the author’s proposed CFRED seem likely to remain important assessment tools.

The authors assume that both the actual and generated images conditioned at the prompt are followed by the Gaussian distribution. Each is defined by conditional means and covariance. CFRED measures the expected Frechet distance between prompts Between these conditions distributions. This can be formulated directly from the perspective of conditional statistics or by combining cross-covariances with unconditional statistics and prompts.

By incorporating the prompt in this way, CFRED can evaluate both the realism of the image and the consistency of the specified text.

Data and Testing

To assess how closely CFRED correlated with human preferences, the authors used image rankings from multiple models prompted with the same text. Their ratings are based on two sources. The Human Preference Score v2 (HPDV2) test set includes nine generated images and one Coco Ground Truth image. The aforementioned Partiprompts Arena contains output from four models spanning 1,600 prompts.

The authors collected scattered arena data points into a single data set. If the actual images were not ranked the highest in human ratings, they used the highest rated images as references.

To test the new model, we sampled 1,000 prompts from Coco trains and verification sets, guaranteeing overlap with the HPDV2, and image generation using nine models from the Arena Leaderboard. The original COCO image served as a reference in this part of the assessment.

The CFRED approach was assessed through four statistical indicators. FID. fddinov2; Clipscore; and cmmmd. It was also evaluated against four learning metrics trained on human preference data: the Aesthetic Score. Imagereward; HPSV2; and lawmakers.

The authors evaluated the correlation with human judgments from both ranking and scoring perspectives. For each metric, model scores were reported and rankings were calculated for consistency with human evaluation results.

Previous work on human preference learning measured performance using the accuracy of ranks per item, which calculates the ranking accuracy of each image text pair before averaging the results.

The authors instead used a to evaluate cfred global Rank accuracy to assess ranking performance for the complete dataset overall. For statistical metrics, they derived the ranking directly from the raw scores. Also, for indicators trained for human preference, we first averaged the rankings assigned to each model across all samples, and then determined the final rankings from these averages.

10 frameworks used in the first test: glide. Coco; Fusedream; Darle 2; vqgan+clip; cogview2; Stable diffusion v1.4; vq-diffusion; Stable diffusion v2.0; and Raffite.

Model rankings and scores for HPDV2 test sets using statistical metrics (FID, FDDINOV2, CLIPSCORE, CMMD, and CFRED) and human preference-trained metrics (aesthetic scores, imager words, HPSV2, and MPS). The best results are bold, with the second best underlined.

Of the first results, the author commented:

‘Cfred achieves the highest consistency with human preferences and reaches a correlation of 0.97. Among the statistical indicators, CFRED achieves the highest correlation, comparable to HPSV2 (0.94). This is an explicitly trained model of human preference. HPSV2 is trained on an HPSV2 training set containing four models in the test set, and, given that it was used with the same annotator, essentially encodes a particular human preference bias in the same setting.

‘In contrast, CFRED achieves equal or superior correlation to human assessments without human preference training.

“These results show that CFRED offers a more reliable ranking among diverse models compared to standard automated metrics and metrics explicitly trained with human preference data.”

Of all metrics evaluated, CFRED achieved the highest rank accuracy (91.1%), demonstrating the author’s claim – strong integrity with human judgment.

HPSV2 continued at 88.9%, while FID and FDDINOV2 produced a competitive score of 86.7%. Trained metrics on human preference data generally match human ratings, but CFRED has proven to be the most robust and reliable overall.

Below are the results of this second round of tests at the PartiPrompts Arena using SDXL. Kandinsky 2; Würstchen; Carlo v1.0.

Model rankings and scores for Palchiprom using statistical metrics (FID, FDDINOV2, Clipscore, CMMD, and CFRED) and human prioritized metrics (aesthetic scores, imager words, and MPS). The best results are bold, with the second best underlined.

Here, the paper states:

‘In the statistical indicators, CFRED achieves the highest correlation with human ratings (0.73), and both FID and FDDINOV2 reach a correlation of 0.70. In contrast, clip scores show a very low correlation (0.12) with human judgment.

“In the trained category of human preferences, HPSV2 has the strongest alignment and achieves the highest correlation (0.83), followed by Imager Word (0.81) and MPS (0.65). These results highlight that while CFRED is a robust automatic metric, HPSV2 is the most effective in capturing trends in human evaluation in the PartoMpts field.

Finally, the authors performed evaluations on the COCO dataset using nine most recent text-to-image models. Flux.1 (dev); playgroundv2.5; Janus Pro; Stable diffusion variation SDV3.5-L turbo, 3.5-L, 3-M, SDXL, 2.1, and 1.5.

Human preference rankings were fed from the text and from the image leaderboard and given as ELO scores.

Model rankings of randomly sampled COCO prompts using automatic metrics (FID, FDDINOV2, CLIPSCORE, CMMD, and CFRED) and human prioritized metrics (Aesthetic Score, Immerger Word, HPSV2, and MPS). Accuracy for ranks below 0.5 shows more inconsistencies than matching pairs, with the best results being bold and the second best results being underlined.

Regarding this round, the researchers stated:

Among the statistical indicators (FID, FDDINOV2, CLIP, CMMD, and proposed CFRED), only CFRED shows a strong correlation with human preference, achieving a correlation of 0.33 and a non-trivial rank accuracy of 66.67%. ‘This result is only facing CFRED as the third-aligned metric overall, with human preference trained metrics, imager words, HPSV2, and MPS.

‘In particular, all other statistical metrics showed that they were rather weak in consistency with ELO rankings, resulting in reversed rankings to rank ACC. Less than 0.5.

“These findings emphasize that CFRED is sensitive to both visual fidelity and rapid consistency, enhancing its value as a practical, training-free alternative for benchmarking of text-to-image generation.”

The authors also tested Inception V3 as the backbone, bringing attention to the ubiquity of the literature and found that InceptionV3 works reasonably well, but found it outweighs transformer-based backbones such as the DinoV2-L/14 and VIT-L/16, which are more consistently aligned with human rankings.

A victory that shows how often each image backbone ranks match the true human-derived rankings of the CoCo dataset.

Conclusion

Although human loop solutions are the best approach to the development of metrics and loss functions, it is clear that the scale and frequency of updates required for such schemes will continue to make them impractical, perhaps until extensive public participation in assessments is generally incentivized. Or, it is being implemented as with Captchas.

The author’s new system’s reliability removes more than many human-involved approaches these days, but still relies on alignment with human judgment. Therefore, the legitimacy of CFRED still remains in human preference data (obviously, the claim that CFRED reflects human-like ratings is due to the lack of such benchmarks).

Perhaps it is wrong to make the current criteria for “realism” in the generated output to metric functions over time, as the definition of this concept is currently being attacked by a new wave of generated AI systems and set frequent and significant revisions.

* At this point, I usually include an example descriptive video example, perhaps from a recent academic submission. But that’s mean. Anyone who has been trolling Arxiv’s generic AI output for more than 10-15 minutes has already come across a supplemental video.

† A total of 46 image backbone models were used in the experiment, but not all of them are considered in the graphed results. See the thesis appendix for the complete list. Listed on the tables and diagrams.

First released on Tuesday, April 1st, 2025