A recent paper in the LG AI research suggests that the “open” dataset used to train AI models may provide a false sense of security. We found that four of the five AI datasets labeled “commercially usable” actually contain hidden legal risks.
Such risks range from the inclusion of private copyrighted materials to restrictive licensing conditions deeply buried in dataset dependencies. If the findings of the paper are accurate, companies relying on public datasets may need to rethink their current AI pipeline or risk downstream legal exposure.
Researchers propose fundamental and potentially controversial solutions. AI-based compliance agents can scan and audit dataset history.
The paper states:
‘This paper argues that the legal risks of AI training datasets cannot be determined simply by checking surface-level licensing conditions. Intensive end-to-end analysis of dataset redistribution is essential to ensure compliance.
“Analyses like this exceed human capabilities due to their complexity and scale, so AI agents can fill this gap by running it at a greater speed and accuracy. Without automation, significant legal risks remain largely unheard, putting ethical AI development and compliance with regulations.
“We encourage the AI research community to recognize end-to-end legal analysis as a fundamental requirement and adopt an AI-driven approach as a viable path to scalable dataset compliance.”
Researchers’ automation systems examining 2,852 common datasets that appear to be commercially available under individual licenses have found that, in fact, only 605 (about 21%) are legally safe for commercialization when all components and dependencies are tracked.
New paper titled Don’t trust the license you see – Dataset compliance requires large AI-powered lifecycle tracingand from eight researchers in LG AI Research.
Rights and mistakes
The authors highlight the challenges faced by companies promoting AI development in increasingly uncertain legal environments, as previous academic “fair use” ideas about dataset training replace fractured environments where legal protections are unknown and safe ports are no longer guaranteed.
As one publication recently pointed out, companies are becoming increasingly defensive about their training data sources. Comments by author Adam Buick*:
‘(while) openai has disclosed the main data sources for GPT-3. Papers introducing GPT-4 It was revealed The data that the model was trained was a mixture of “publicly available data (such as Internet data) and data licensed from third-party providers.”
“The motivation behind this move from transparency has not been clarified in specific details by AI developers.
“Openai justified its decision not to publish details about GPT-4 based on concerns about the ‘competitive environment and the safety impact of large-scale models’.
Transparency can be a dishonest term – or simply a false term. For example, Adobe’s flagship Firefly Generative Model is trained in inventory data that Adobe has the right to abuse, and perhaps reassuring its customers about the legality of the system’s use. Thereafter, some evidence has been revealed that the Firefly datapot has been “enriched” with data that may be copyrighted from other platforms.
As discussed earlier this week, initiatives are increasingly designed to ensure license compliance for datasets, including datasets that simply cut YouTube videos with flexible Creative Commons licenses.
The problem is that, as new research shows, its own license can be incorrect or incorrectly granted.
Finding open source datasets
When contexts are constantly changing, it is difficult to develop evaluation systems such as the author’s Nexus. Therefore, the paper states that the Nexus Data Compliance Framework system is based on “various precedents and legal basis at this point.”
Nexus uses AI drive agents called Auto Compliance For automatic data compliance. AutoCompliance consists of three important modules: Navigation module for web exploration. Question and Answer (QA) module for information extraction. Scoring module for legal risk assessment.
Autocompliance starts with a user-supplied web page. AI extracts important details, searches for related resources, identifies license conditions and dependencies, and assigns legal risk scores. Source: https://arxiv.org/pdf/2503.02784
These modules feature fine-tuned AI models, including the ExaONE-3.5-32B-Instruct model trained with synthetic and human labeling data. AutoCompliance uses a database for cache results to improve efficiency.
Autocompliance starts with a user-supplied dataset URL, treats it as a root entity, searches for license conditions and dependencies, and recursively traces linked datasets to construct a license dependency graph. Once all connections are mapped, it calculates a compliance score and assigns a risk classification.
The data compliance framework outlined in the new work is a wide range of identifications† Entity types involved in the data lifecycle Dataset,forms the core input for AI training. Data Processing Software and AI Modelsis used for data conversion and utilization. and Platform Service Providersmaking data processing easier.
By taking into account these various entities and their interdependencies, the system will assess legal risks overall and will include a broader ecosystem of components involved in AI development beyond the memorization assessment of the dataset’s license.
Data compliance assesses legal risks throughout the complete data lifecycle. Assign scores based on dataset details and 14 criteria, classify individual entities, and aggregate risk across dependencies.
Training and Metrics
The authors randomly subsampled to randomly subsample 216 items to make up the test set, embracing their faces, and extracted URLs for the top 1,000 most downloaded datasets.
The EXAONE model is fine-tuned with the author’s custom dataset, using the question answer module using navigation and synthetic data and scoring module using human sign data.
Ground Truth Labels were created by five legal experts trained for at least 31 hours in a similar task. These human experts manually identified dependencies and license terms for 216 test cases and refined the findings through discussion.
The trained, human-calibrated autocompliance system tested against ChatGPT-4O and Perplexity Pro has discovered more dependencies, especially within the license terms.
216 Accuracy to identify dependencies and license conditions for evaluation datasets.
The paper states:
‘Autocompliance is significantly outperforming all other agents and human experts, achieving accuracy of 81.04% and 95.83% for each task. In contrast, both CHATGPT-4O and Perplexity Pro show relatively low accuracy for source and licensed tasks, respectively.
“These results highlight the excellent performance of autocompliance, demonstrate the effectiveness of handling both tasks with incredible accuracy, and demonstrate the substantial performance gap between AI-based models and human experts in these domains.”
From an efficiency perspective, the autocompliance approach took only 53.1 seconds to execute, in contrast to 2,418 seconds for comparable human assessments on the same task.
Furthermore, running the assessment costs USD 0.29 compared to the human expert’s $207. However, please note that this is based on renting a GCP A2-Megagpu-16GPU node every month at a rate of $14,225 per month. This type of cost-effectiveness is primarily associated with large-scale operations.
Dataset Survey
For analysis, researchers selected 3,612 datasets combining the 3,000 most downloaded datasets that embraced the 612 datasets of the 2023 Data Provenance Initiative.
The paper states:
We identified a total of 17,429 unique entities from ‘3,612 target entities. There, 13,817 entities appeared as direct or indirect dependencies of the target entity.
“Empirical analysis assumes that an entity and its license dependency graph have a single-tiered structure if there are one or more dependencies, and if the entity does not have dependencies and multi-tiered structures.
“Of the 3,612 target datasets, 2,086 (57.8%) had a multilayer structure, while the other 1,526 (42.2%) had a non-dependency single-layer structure.”
Copyright protected datasets may only be redistributed with legal authority that may arise from licenses, copyright law exceptions, or terms of contract. Improper redistributing can lead to legal consequences, including copyright infringement and breach of contract. Therefore, clear identification of non-compliance violations is essential.
Distribution violations found under Citation Criterion 4.4 in the paper. of data compliance.
In this study, 9,905 cases of redistribution of non-compliant datasets were split into two categories. 83.5% are expressly prohibited under the terms of the license, making redistributable a clear legal violation. 16.5% were related to datasets with competing license conditions. This condition theoretically allowed redistribution, but failed to meet the required conditions, creating downstream legal risks.
The authors acknowledge that the risk standards proposed in Nexus are not universal and may vary depending on jurisdiction and AI applications, and future improvements should focus on adapting to changes in global regulations while improving AI-driven legal reviews.
Conclusion
This is a paper that is almost friendly to Prolix, but it addresses perhaps the biggest delay factor in AI’s current industry adoption.
Under the DMCA, violations can involve large legal fines By case Basics. Potential legal liability is really important when violations can hit millions, as if researchers found them.
Furthermore, companies that can prove to have benefited from upstream data cannot assert ignorance as an excuse (as usual) at least in the influential US market. They currently don’t have a realistic tool to permeate the effects of the maze buried in open source dataset licensing agreements.
The problem with developing a system like Nexus is that it is challenging enough to coordinate it on a state-by-state basis within the US or nationally within the EU. The prospect of creating a truly global framework (a kind of “interpol for the origins of datasets”) is undermined by the fact that the state of both these governments and current laws in this regard is constantly changing, not just the conflicting motivations of the diverse governments involved.
* My alternative to hyperlinks to author citations.
† Six types are prescribed in the paper, but the last two are not defined.
First released on Friday, March 7, 2025