Human memory is a living, evolving thing, always soaking up new information and testing it against reality. AI agents, should they wish to mimic human learning, will need to do the same.

Today, LLMs are largely fixed as frozen models. Updating them with new information, via backpropagation, is expensive. And any new information that an agent learns along the way is lost when the session ends.

A group of 17 researchers, from multiple universities, have created an agent framework, called Memento-Skills, to run agents that learn new skills as they run, through experience.

Such an agent-designing agent “autonomously constructs, adapts, and improves task-specific agents through experience,” according to the paper, which the researchers plan to present at the International Conference on Machine Learning (ICML) 2026 in July.

These agents work on a “reflective” read-write loop. When a new command prompt arrives, a skill router picks the best agent for the job. The agent evaluates its own success at completing the task, and if found lacking, it attempts to find new ways to solve the issue. If a derived skill is successful, it is run through a unit test and is added to the skill library.

If an appropriate agent is not found, the router can create a new agent for the task.

“Through iterative skill generation and refinement, the system progressively improves its own capabilities,” the researchers write.

This result produced a 26.2% relative improvement on the General AI Assistants benchmark (GAIA) and 116.2% gain on Humanity’s Last Exam (HLE).

Getting Agents to Think

The current state of the art at the frontier AI labs is to make agents as reliable as possible. Anthropic just launched managed agents, which doesn’t make agents smarter, but rather strengthens the underlying infrastructure so if a containerized agent fails it can be replaced with another.

The researchers behind Memento-Skills give each agent a cache, a space in memory, to write “skill memories.” This is not just a log, but an active skill library. If the task was successful, the skill’s utility score, or credit score, is increased.

You might think that building a skill library would be a monolith of if…else statements. But the skills are declarative, not imperative. The skill doesn’t tell the LLM what to do, it just instructs the LLM to find the best way to do it, via a semantic search through the given toolset.

Should a tool fail at its job, the orchestrator’s failure-attribution selector reduces the skill’s credit score and examines the execution trace to identify the issue behind the failure. Did the problem come from a tool, or from the planning or reasoning stages?

Its “skill rewriter” then adds constraints and alternative strategies to the skill. Any new skills produced by the orchestrator must go through a unit-test gate (using synthetic data) to ensure it works well with the other skills.

In effect, Memento-Test prescribed an entire continuous integration (CI) loop for the skill. This approach is called the Automatic Unit-Test Gate.

Starting with 5 skills and some GAIA test cases, Memento-Agent was able to produce a library of 41 skills. With the HLE, the Memento-Agent was able to spin up 235 skills (a testament to HLE’s wider embedding space).

Still, there are many ways agents can still get things wrong, as another, unrelated, paper on building verifiers from Microsoft summarized. Phantom criteria, cascading errors, and evidence hallucination can all corrupt a skill library. Malicious prompt injection could be another issue, though presumably the unit tests would find them before they are implemented.

And what sorts of skills will these agents produce? What about problems that require radical approaches? Agent-generated content doesn’t take large intuitive leaps. Instead, the LLM refines what is already known to work.

“Humans are essential for discovering the core structural principles,” the Microsoft researchers wrote. “AI is better at the fine-grained tuning that extracts the remaining performance once those principles exist.”

The researcher admits that Memento-Tasks can take additional processing time, at least initially as building and verifying a new skill can take some time in addition to the task itself. Over the long term, they argue, Memento agents would be faster than standard agents, which pay a reasoning tax for every new task.

Putting Memento-Skills into Action

The authors’ implementation Memento-Skills is available on GitHub.

In a local environment, running Memento-Skills requires a sandbox and a file folder to store the results. By default, it uses SQLite to store the vectors for the skill library, and, of course you’ll need a key to a LLM to provide the reasoning.

Out-of-the-box the agents only have a barest of tools for web search, terminal support and file I/O. From there you can ask it questions from a command line, and the more tasks you assign, the larger its skill base grows.