Have we reached a tipping point where it is now more efficient to let agents refine AI models, rather than humans? AI researcher Andrej Karpathy thinks so, based on the startling results from a recent experiment he ran. 

He gave an autonomous AI autoresearch agent control of his LLM training harness, llm.c, to find heretofore undiscovered model optimizations. Autonomously, the agent ran experiments, then used the results to plan the next set of experiments.

Karpathy was surprised that the script worked so well “on top of what I thought was already a fairly manually well-tuned project.”

The agent was given a metric and a training script. It could make changes to the model’s training code (“train.py”), and then train the model on data for no more than five minutes. If the resulting metric improved from the change, the agent committed that code, then used it as the new baseline to repeat the process.

In this automated approach, the agent found 20 ways to decrease the errors (“validation loss”) on a depth-12 model.

He then applied the results to a larger, depth-24 model, one competing in the “Time to GPT-2” leaderboard (a competition to match the performance specs of 2019’s GPT-2 model). The new changes resulted in an ~11% boost in training efficiency.

“You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales,” he wrote.

Autotuning Models

Karpathy is one of the most influential AI thinkers these days. He was one of the founding members of OpenAI. He was also senior director of AI for Tesla from 2017 to 2022, directing the company toward a data-first strategy. He currently is running an AI company he founded, Eureka Labs.

Karpathy has fine-tuned models by hand for well over a decade. 

It’s grueling work. “You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc.,” he explained

The way the script automated this process was “wild,” he explained. In its two-day run, it conducted over 700 experiments. The code “is now a self-modifying binary that has grown beyond human comprehension,” he wrote on the project’s GitHub page

The agent was not doing groundbreaking research, but like a supercharged autocorrect, it found subtle errors even the diligent Karpathy had missed. It revealed that Karpathy messed up AdamW betas, failed to tune the network weight decay schedule, and didn’t apply regularization to value embeddings, among other gaffes. 


The next step, for Karpathy, is to run multiple agents at once, a current area of research for him. 

This, he feels, is where all the AI frontier labs are heading. Instead of one file to modify, they will need to optimize thousands.  “It’s the final boss battle,” he predicted, surmising that humans no longer need be in the loop of fine-tuning AI models. The machines can take it from here.  

“One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of ‘group meeting,’” Karpathy wrote on the project page. “That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies.”

The job of humans? Contribute at the edges.

Autoresearch That Optimizes Weights

Karpathy’s experiment caught the eye of Shopify CEO Tobi Lutke who tested it to refine an internal model. The results led to a model with 0.8 billion parameters outperform one twice its size. “OK this thing is totally insane,” he enthused on X.

What is significant about Karpathy’s approach is that it “modifies training code, architectures, and hyperparameters to produce a better model,” wrote Google DeepMind developer elevations staff engineer Philipp Schmid, in a blog post. Previous approaches were limited to “frozen models and APIs where you cannot change weights.”

Nonetheless the fray at r/singularity group at Reddit were a bit more cautious in their evaluations. One reader noted that the agent may start optimizing for the leaderboard, which would skew results. 

Another observer pointed out that there is no evidence that this approach would lead to some qualitatively novel architectures, that it may be limited to “just tweaking knobs in a predefined search space.”

Karpathy envisions this approach has implications beyond model building. Any metric that can be evaluated and improved could benefit from this approach. “The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement,” he wrote on X.