The goalposts for large language models (LLMs) are shifting, as users are running more extended agentic tasks these days.

China’s top AI frontier lab Z.AI (formerly Zhipu AI) has launched the latest version of its open source LLM, called the General Language Model (GLM), with this user base in mind.

According to the company, GLM-5.1 has been optimized for “long-horizon tasks,” or looped work that can run up to eight hours for a single project, from early-stage planning to production-grade results.

The shift in focus will enable the model to take on “higher-value tasks” such as system building and performance optimization, according to the company. Z.AI worked to improve general intelligence, real-world coding, and complex task execution of its model.

This shift in emphasis reflects the emergence of long-running agentic tasks, over the initial use of AI by people who just need a single answer or a simple task completed.

“Previous models—including GLM-5—tend to exhaust their repertoire early: they apply familiar techniques for quick initial gains, then plateau. Giving them more time doesn’t help,” the Z.AI blog explained.

The model breaks complex problems into simpler tasks, then evaluates the results and looks for and then fixes bottlenecks. It is comfortable with ambiguous problems and stays on task even as it revises the strategy through repeated iterations and thousands of tool calls.

“The longer it runs, the better the result,” the blog boasted.

GLM-5.1 on the Bench

The company asserts that the GLM-5.1 closely matches the capabilities of Anthropic’s Claude Opus 4.6.

Using three benchmarks (SWE-Bench Pro, Terminal-Bench 2.0 and NL2Repo) measuring overall capability and coding performance, GLM-5.1 scored an aggregated score of 54.9, beating Gemini 3.1, Qwen 3.6-Plus and MiniMax M2.7. It fell behind GPT 5.4 (which scored 58) and Opus 4.6 (57.5).

The context window is limited to 200,000 tokens, with a maximum output of 128,000 tokens.

“This capability is not simply about having a longer context window,” the model overview stated. “It requires the model to maintain goal alignment over extended execution, reducing strategy drift, error accumulation, and ineffective trial and error, and enabling truly autonomous execution for complex engineering tasks.”

In particular, the model can easily slide into an “experiment–analyze–optimize” loop for long tasks. Rather than producing a single result, it can “run benchmarks, identify bottlenecks, adjust strategies, and continuously improve results through iterative refinement.”

Iteration is Key

The company claimed that GLM-5.1 built a Linux desktop system from scratch within 8 hours. This task flummoxes most models, which may return the barest skeleton of a distribution. But the model was able to review its own output to look for missing features, broken interactions and places where the product could be improved. The result was a richer desktop, so the company claimed.

Long-term optimization also helped GLM to ace KernelBench, which evaluates how effectively it can refactor a sample of PyTorch code for faster responses while maintaining identical outputs. With such a job, most models produce quick gains but taper off with repeated iterations, but GLM-5.1 kept innovating longer than any other model except Opus 4.6.

It iterated longer on the Vector DB Bench as well. This coding benchmarks a model’s effectiveness at building a vector database given a Rust skeleton and a set of APIs.

Typically, the test must be finished within 50 iterations.

Z.AI removed the 50 iteration limit to find out if the model would continue to find new ways to optimize. It continued to improve the database through more than 600 iterations, and 6,000 tool calls. The results were six times better than the best result in a 50-turn session.

More iterations force the model to look for new approaches to solving a thorny problem.

With Vector DB, the model would periodically shift strategies when the number of improvements started leveling off. Around iteration 90, the model shifted from full-corpus scanning to inverted file cluster (IVF), a more efficient probing algorithm. At iteration 240, it introduced a two-stage pipeline.

“The optimization trajectory shows a characteristic staircase pattern: periods of incremental tuning within a fixed strategy, punctuated by structural changes that shift the performance frontier,” the blog stated.

YouTube Reviewers Tackle GLM-5.1

Matthew Miller, founder and CEO of Vibe-Coding UI platform BridgeMind, tested GLM-5.1 on some front-end design work such as designing a game, using his company’s benchmarks, He found the design work to be sleeker and more functional to the work GPT-5.4 produced.

To test the claims of GLM-5.1’s long-horizon capabilities, Povilas Korop, who heads the YouTube channel AI Coder Daily, set the LLM on a task to build a checklist application from a single prompt, using Laravel PHP editor.

GLM5.1 successfully completed the task, including testing, though it took about 20 minutes. Apparently, the LLM was not trained on React Flux and Livewire front-end frameworks, which slowed its roll somewhat.

In contrast, Claude Opus 4.6 was able to complete the job within six minutes, and delivered a more polished product. Korop also fed the GLM code into Opus for review, which found no major errors but did find areas where GLM could have optimized better.

In Korop’s view, GLM-5.1 did deliver a “first draft” of the application, but was “slower” and had “some hiccups on the way.” Nonetheless, the test showed that the model “can run long tasks,” he said.

Korop ran the model through OpenRouter model aggregator and gateway. The task cost him US$2.15, he estimated.

Z.AI also hosts its version of the model as a subscription service, starting at $27 per quarter. Made the new model available on April 7.

BridgeMind’s Miller did notice a potential problem, however, in that the GLM-5.1 context window is limited to 200,000 tokens, which is, according to Miller, woefully insufficient for the long-running vibe coding work that GLM-5.1 model was built for.