Among the jobs AI may eliminate in the near future may be data science itself, if research from Meta engineers bears fruit.
They’ve created a new method, called Autodata, that builds high-quality training and evaluation data through the power of iteration and a two-step method that evaluates and modifies the very methods that do the work.
An initial test with an agent showed better reasoning compared to today’s synthetic dataset creation methods. Meta-optimization boosted these results further.
The end result: Higher quality model training by front-loading ‘inference-time compute’—essentially letting the agent burn more GPU cycles during the data creation to produce a more potent training set.
“Overall, this direction has the potential to change how we build AI data,” the researchers enthuse, in a blog post describing their work.
Unfortunately, much like other attempts to automate away jobs, these innovations still require a human in the loop, if only to check the results.
Recursive Self Improvement and Automated Data Synthesis
The workflow of these agents is pretty much based on what an actual human data scientist does today: Create a dataset from sources, inspect data for fidelity, measure performance, generate insights, repeat until the results are deemed satisfactory.
These days, optimizing LLM performance is largely done by the models themselves, by creating a set of synthetic data that can be tested. Synthetic data is especially useful for identifying edge cases and long-tail scenarios. But it still requires humans to tag the results.
Meta first created the Self-Instruct agent to create synthetic data with “zero or few-shot prompting,” with no human intervention required.
“An agent acting as a data scientist is tasked with the act of constructing and curating data, performing the actions a human data scientist would in order to create high-quality data,” the researchers write.
Using an LLM, an agent calls a set of methods to ground itself on documents and other sources to reduce hallucination and increase diversity. A chain-of-thought reasoning method constructs complex scenarios that can be tested to ensure the agent has contextualized the data correctly. Data quality is maintained through the usual filtering, evolution and refinement methods.
Autodata’s secret sauce to improving the results is a second agentic loop, one that inspects the work of the first “inner loops” to identify ways to improve the workflow.
Two Solvers at Battle Inside the Agent
LLM agents inspecting their own results is nothing new. It is similar to each time a data scientist asks an agent to generate questions about a body of material to test the knowledge of the agent (a grading key or “rubric” in the educational field). Autodata automates this process into a self-running loop.
The main agent tests its results through four sub-agents. A challenger agent, which can use a separate LLM from the main orchestrating agent, creates training data using a detailed prompt from the main agent. The challenger feeds it to two solvers, representing the current state of AI.
The weak solver is based on a slim model that is not expected to answer the question. The strong solver represents the state of the art of current AI. A verifier method judges the results.
If the weak solver can solve the test, it means that there is nothing new for the target model to learn. The sweet spot is for the training data to stump the weak solver but be solved by the strong solver.
If successful, the entire chain of thought is captured (in JSON) as a success, which can then be used as a data point to train the target model via reinforcement learning. This outer loop also identifies failures in the harness methods that can be corrected or improved.
This is how the overall system improves itself, according to the researchers. In effect, the process is one of evolutionary programming, where the results improve the code itself.
Agent Heal Thyself
In a trial run, Meta processed over 10,000 papers to generate 2,100 successful results. The validation pass rate jumped from 12.8% to 42.4% using 126 accepted iterations out of 233 total.
Not that the agents haven’t tried to hack these tests. Some even modified the weak prompt to be even weaker, a problem “which we have partially addressed, but have plans to investigate stronger safeguards,” the researchers wrote. Testing needs to be improved, they noted, so it will better validate truly generalizable reasoning, not just specific experimental numbers from the paper.
There is also a problem with power usage, a looming issue for data center-hungry AI providers. The researchers don’t address this inference tax in their web post in any significant detail – though hopefully they will in the soon-to-be posted ArXiv paper.
A typical agent may generate the requested work in a single pass. In contrast, Autodata inspects each sample paper multiple times, requiring more GPU computation than a single run. Generating 2,100 insights from reading 10,000 papers three to five times each must add up the cycles. Presumably, this would save energy down the road in the training phase.
A Test-Driven Development Future for AI
These Meta researchers are not alone in the attempt to self-empower their agents. Agent0 is a research effort for agents to self-evolve without human-labeled data. Researchers from University of North Carolina at Chapel Hill, Salesforce and Stanford University are behind this effort.
For the web page analysis, an effort led by University of Pennsylvania looked at how agents could generate new capabilities through synthetic training. And last month, Meta itself published work on a separate effort, Hyperagents, which details a way for agents to repeatedly generate and then evaluate self-modified variants.
Meta is encouraging others to try this set of methods with other tasks and models. Ultimately, Autodata can be used with any kind of data (math, code, general instruction).
As for the fate of the data scientist, their jobs are safe…for now.
The researchers admit that “removing humans completely from the loop is unlikely to be desirable in current full model training pipelines.” So there will be work, for the time being, checking the work of the AI.

