InsighthubNews
  • Home
  • World News
  • Politics
  • Celebrity
  • Environment
  • Business
  • Technology
  • Crypto
  • Sports
  • Gaming
Reading: Large language models remember datasets to test
Share
Font ResizerAa
InsighthubNewsInsighthubNews
Search
  • Home
  • World News
  • Politics
  • Celebrity
  • Environment
  • Business
  • Technology
  • Crypto
  • Sports
  • Gaming
© 2024 All Rights Reserved | Powered by Insighthub News
InsighthubNews > Technology > Large language models remember datasets to test
Technology

Large language models remember datasets to test

May 17, 2025 14 Min Read
Share
mm
SHARE

When relying on AI to encourage viewing, reading and purchasing, new research shows that some systems are based on these results. Memory Instead of learning to make useful suggestions rather than skills, models often recall items from the dataset used to evaluate them, overestimating performance and recommendations that may be outdated or missing from the user.

In machine learning, we use test splits to see if the trained model has learned to solve similar problems, but is not identical to the trained material.

Therefore, if a new AI ‘Dog-breed perception’ model is trained on a dataset of 100,000 photos of a dog, it usually features the attached 80/20 split – 80,000 photos to train the model. 20,000 photographs were suppressed and used as material to test the completed model.

If AI training data inadvertently includes the “secret” 20% section of test splitting, the model already knows the answers and thus facilitates these tests (already seeing 100% of domain data). Of course, this does not exactly reflect how the model will perform later on with new “live” data in the production context.

Movie spoilers

The problem of AI cheating on exams is taking a step forward with the scale of the model itself. Today’s systems are trained on vast, indiscriminate web scrapped corpus such as common crawls, so the chances that benchmark datasets (i.e., 20% retention back) will slip into the training mix are no longer edge cases, but rather the syndrome known as default-data contamination. And at this scale, manual curation that could catch such an error is logistically impossible.

This case is investigated in a new paper by Politecnico di Bari in Italy. Researchers focus on the major role of Movielens-1M, the recommended dataset for a single film.

This particular dataset is so widely used in testing recommended systems that it can make the test pointless by being present in the memory of the model. In fact, what appears to be intelligence could actually be a simple recall, and what appears to be an intuitive recommended skill is a statistical echo that reflects previous exposures.

The author states:

‘Our findings show that LLM has extensive knowledge of Movielens-1M datasets, items, user attributes, and interaction history. In particular, a simple prompt allows the GPT-4o to recover nearly 80% (the name of most movies in the dataset).

“None of the models inspected have this knowledge, suggesting that Movielens-1M data is likely to be included in the training set. We observed a similar trend when obtaining user attributes and interaction history.

It has a short new paper title Does LLMS remember the recommended dataset? Preliminary research on Movielens-1Mand comes from six Politecnico researchers. A pipeline to replicate their work is now available on Github.

See also  NTT Research launches new physics for artificial intelligence groups at Harvard

method

To understand whether the model in question is truly learning or simply recall, the researchers began by defining the meaning of memorization in this context, and by testing whether the model could retrieve certain information from the Movielens-1M dataset when prompted in the correct way.

If the model shows the movie’s ID number and you can create its title and genre, it counts as something to memorize the item. If details about the user (age, occupation, postal code, etc.) can be generated from the user ID, it will also be counted as a user’s memorization. And if it could replicate the user’s ratings of the next film from a previous known sequence, it was taken as evidence that the model might be recalling Specific interaction dataRather than learning general patterns.

Each of these forms of recalls was tested using carefully written prompts created to fine-tune the model without providing new information. The more accurate the response, the more likely the model had already encountered that data while training.

Zero-shot prompt for the evaluation protocol used in new papers. Source: https://arxiv.org/pdf/2505.10212

Data and Testing

To curate appropriate datasets, the authors have investigated the recent papers of two ACM Recsys 2024 and ACM SIGIR 2024 at the main conference of the field. Movielens-1M is quoted at least one in five submission amounts. This was not a surprising result, but a confirmation of the dataset’s dominance, as previous studies have reached similar conclusions.

Movielens-1m consists of three files. movie. thatlists films by ID, title and genre. users.datmaps the user ID to the basic biographical field. and Ratings.datrecords who evaluated what.

To investigate whether this data was remembered by a large-scale language model, researchers turned to the prompt technique first introduced in the paper Extract training data from large language modelsand later adapted to subsequent work. A bag of tricks for training data extraction from language models.

This method is direct: ask questions that reflect the dataset format and check if the model answers correctly. Zero Shot, way of thinkingand A few shot prompts The last method, which was tested and shows some examples in the model, was the most effective. Even if a more elaborate approach could lead to higher recalls, this was thought to be sufficient to reveal what was remembered.

A few prompts that can test whether a model can reproduce a specific movielens-1m value when queried in a minimal context.

To measure memorization, researchers defined three forms of recall: item, userand Exchange. These tests looked at whether the model could retrieve the film title from the ID, generate user details from the user ID, and predict the next rating of the user based on previous ones. Each was scored using coverage metrics* that reflected how much dataset could be reconstructed through the prompt.

See also  Evogene and Google Cloud unveils basic models for the design of generative molecules, pioneering a new era of life science.

The model I tested was the GPT-4O. gpt-4o mini; GPT-3.5 turbo; llama-3.370b; llama-3.23b; llama-3.21b; llama-3.1405b; llama-3.170b; and llama-3.18b. Everything was set to zero, TOP_P was set to 1, and both frequency and presence penalties were set to invalid. Fixed random seeds ensured consistent output throughout the run.

Models grouped by percentage, version, and sorted by parameter count of Movielens-1M entries obtained from Movie.dat, users.dat, and Ratings.dat.

To investigate how deeply absorbed Movielens-1M was, researchers prompted each model of accurate entries from three (mentioned) files in the dataset. movie. that, users.datand Ratings.dat.

The results of the first test above reveal sharp differences not only in the GPT and Lama families, but also in the overall model size. The GPT-4O and GPT-3.5 turbos easily retrieve most of the dataset, but most open source models recall only a small portion of the same material, suggesting an uneven exposure to this benchmark.

These are not small margins. Over all three files, the strongest file recalls rather than simply outperforming the weaker model The whole part Movielens-1m.

For GPT-4O, coverage was high enough to suggest that nontrivial shares of the dataset were directly remembered.

The author states:

‘Our findings show that LLM has extensive knowledge of Movielens-1M datasets, items, user attributes, and interaction history.

“In particular, a simple prompt allows the GPT-4o to recover almost 80% of the Movie::Title Record. None of the models inspected have this knowledge, suggesting that Movielens-1M data is likely to be included in the training set.

“We observed a similar trend in obtaining user attributes and interaction history.”

The authors then tested the impact of memorization on recommended tasks by encouraging each model to act as a recommendation system. We compared the benchmark performance and output with seven standard methods. userknn. itemknn; BPRMF;EasyrRandom with lightgcn; MostPop;

The Movielens-1M dataset split 80/20 into training and test sets using a vacation 1-out sampling strategy to simulate actual usage. The metric used was hit rate (HR@(n)); and ndcg (@(n)):

Recommended accuracy for standard baseline and LLM-based methods. Models are grouped by family and ordered by parameter counts, with bolded font indicating the highest score within each group.

Here, some large language models outperform traditional baselines on all metrics, the GPT-4o establishes a wide lead in all columns, and even medium-sized models such as the GPT-3.5 Turbo and the Llama-3.1 405b consistently surpass benchmarking methods such as BPRMF and LightGCN.

Among the Llama variants, performance has changed significantly, but the Llama-3.2 3B stands out, with the best HR@1 in that group stands out.

The authors suggest that the results suggest that the stored data can be converted to the measurable benefits of the recommended style prompt, especially for the strongest model.

See also  APT Campaigns, Browser Hijacking, AI Malware, Cloud Breach, Important CVE

In additional observations, the researcher continues.

“While the performance of the recommendations appears to be significant, an interesting pattern is revealed when comparing Table 2 and Table 1. Within each group, models with higher memorization demonstrate superior performance on the recommended task.

‘For example, GPT-4O outperforms GPT-4O MINI, while Llama-3.1405b outperforms Llama-3.1 70b and 8b.

“These results highlight that evaluating LLMs on datasets that are leaked to training data can be driven by memorization rather than generalization.”

Regarding the effect of model scale on this problem, the authors not only retained more Movielens-1M datasets, but also performed stronger in downstream tasks, but also observed clear correlations between size, memorization, and recommended performance using larger models.

For example, llama-3.1405b showed an average memory rate of 12.9%, while Llama-3.18b retained only 5.82%. This nearly 55% reduction in recall corresponded to a 54.23% decrease in NDCG and a 47.36% decrease in HR across the evaluation cutoff.

Patterns retained throughout – there was obvious performance when memorization was reduced.

‘These findings suggest that increasing model scale leads to greater memorization of the dataset, improving performance.

“As a result, large models offer better recommended performance, but also present risks associated with potential leaks of training data.”

The final test looked at whether memorization reflected the popularity bias burned into Movielens-1M. Items were grouped by frequency of interaction, and the chart below shows that the larger model consistently favored the most popular entries.

Item coverage with three popular tier models: Top 20% of the most popular. Moderately popular. And the lowest interaction item at the bottom 20%.

The GPT-4O has acquired 89.06% of the top ranked items, but only 63.97% of the most popular items. The GPT-4O Mini and Small Lama models showed much lower coverage in all bands. The researchers say this trend suggests that memory amplifies existing imbalances in training data as well as expanding model size.

They continue:

‘Our findings reveal a prominent popularity bias in LLMS, with the top 20% of popular items being much easier to obtain than the bottom 20%.

“This trend highlights the impact of training data distributions where popular films are overrepresented, leading to disproportionate memorization by the model.”

Conclusion

The dilemma is no longer novel. As training sets grow, the chances of curating them decrease in the opposite rate. Movielens-1m enters these vast corpus without supervision, perhaps among many others.

The problem repeats at any scale and resists automation. Solutions require not only effort, but human judgment, the slow, false kind that machines cannot supply. In this regard, new papers are not a future method.

* Coverage metrics in this context are the percentages that indicate how well a language model can reproduce when asked the correct type of question. The recall will be successful if the model is prompted with the movie ID and responded with the correct title and genre. The total number of successful recalls is divided by the total number of entries in the dataset to create a coverage score. For example, if the model correctly returns 800 pieces of information out of 1,000 items, its coverage is 80%.

First released on Friday, May 16th, 2025

Share This Article
Twitter Copy Link
Previous Article Learn smarter ways to protect modern applications Learn smarter ways to protect modern applications
Next Article Newsom says relief from the budget crisis is "non-starter." The bass remains as desired Newsom says relief from the budget crisis is “non-starter.” The bass remains as desired

Latest News

iPhone Spyware, Microsoft 0-Day, Tokenbreak Hack, AI Data Leaks, etc.

iPhone Spyware, Microsoft 0-Day, Tokenbreak Hack, AI Data Leaks, etc.

Some of the biggest security issues start quietly. There are…

June 16, 2025
mm

Why LLMS is thinking too much about simple puzzles, but give up on hard puzzles

Artificial intelligence has made incredible advances with large-scale language models…

June 15, 2025
JSFireTruck JavaScript Malware

Over 269,000 websites infected with JSFiretruck JavaScript malware

Cybersecurity researchers are paying attention to "large campaigns" that undermine…

June 15, 2025
You need to know what features you need with 6 new ChatGPT projects

You need to know what features you need with 6 new ChatGPT projects

The ChatGPT project has just received the most significant update…

June 14, 2025
AsyncRAT and Skuld Stealer

Discord Invite Link Hijacking offers Asyncrat and Skuld Stealer targeted at crypto wallets

The new malware campaign is taking advantage of the weaknesses…

June 14, 2025

You Might Also Like

mm
Technology

AI-driven cloud cost optimization: strategies and best practices

9 Min Read
mm
Technology

AI helps keep fossil fuels alive

8 Min Read
MSP SimpleHelp Flaws to Deploy Ransomware
Technology

Dragonforce exploits SimpleHelp flaws to deploy ransomware across customer endpoints

6 Min Read
mm
Technology

AI Status in 2025: Important Takeaways from Stanford’s Latest AI Index Report

10 Min Read
InsighthubNews
InsighthubNews

Welcome to InsighthubNews, your reliable source for the latest updates and in-depth insights from around the globe. We are dedicated to bringing you up-to-the-minute news and analysis on the most pressing issues and developments shaping the world today.

  • Home
  • Celebrity
  • Environment
  • Business
  • Crypto
  • Home
  • World News
  • Politics
  • Celebrity
  • Environment
  • Business
  • Technology
  • Crypto
  • Sports
  • Gaming
  • World News
  • Politics
  • Technology
  • Sports
  • Gaming
  • About us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms of Service
  • About us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms of Service

© 2024 All Rights Reserved | Powered by Insighthub News

Welcome Back!

Sign in to your account

Lost your password?