Did GLM-5.1 really beat Claude on SWE-Bench Pro?

Yes. According to LLMCheck analysis, GLM-5.1 scores 58.4% on SWE-Bench Pro compared to Claude Opus 4.6 at 57.3%. This makes it the first open-source model to claim the top spot on this agentic coding benchmark, which measures real-world software engineering task completion.

Can I run GLM-5.1 locally on a Mac?

No. GLM-5.1 has 744 billion total parameters and requires approximately 390 GB of VRAM at INT4 quantization. No current Mac — including the M4 Ultra with 192 GB unified memory — has enough RAM to run it. You can access GLM-5.1 through the Z.ai API or by deploying the open weights on a multi-GPU cloud server.

What license does GLM-5.1 use?

GLM-5.1 is released under the MIT license, one of the most permissive open-source licenses available. This means you can use, modify, and distribute the model for any purpose, including commercial applications, with no restrictions beyond preserving the license notice.

How does GLM-5.1 compare to GPT-5 and DeepSeek V3.2?

On SWE-Bench Pro, GLM-5.1 (58.4%) leads Claude Opus 4.6 (57.3%) and GPT-5 (54.1%). DeepSeek V3.2 scores 49.8%. GLM-5.1 also leads on Terminal-Bench 2.0 (63.5%) and CyberGym (68.7%). However, on general-purpose benchmarks like Arena ELO and MMLU, Claude and GPT-5 still hold advantages.

Why does GLM-5.1 score only 58 on the LLMCheck leaderboard?

The LLMCheck score weighs four factors: capability (max 50), speed (max 25), accessibility (max 15), and license (max 10). GLM-5.1 earns strong capability and license points, but scores zero for speed and accessibility because it cannot run locally on any Mac. Models that run on Apple Silicon hardware receive significant score boosts in those categories.

GLM-5.1: The First Open Model to Beat Claude on SWE-Bench Pro

On April 7, 2026, Z.ai (formerly Zhipu AI) released GLM-5.1 and quietly rewrote the competitive landscape of agentic coding models. For the first time, an open-source model holds the top position on SWE-Bench Pro — the benchmark that measures whether an AI can actually resolve real GitHub issues end-to-end. The model is massive, server-only, and trained on hardware most of the industry has never touched. Here is everything Mac-focused AI users need to know.

The Breakthrough: Open Source > Claude on Coding

SWE-Bench Pro is not a trivia test. It presents models with real GitHub issues from popular open-source repositories and asks them to produce working patches. The benchmark requires reading codebases, understanding context across multiple files, writing correct code, and passing the project's existing test suite. Scoring 58.4% on SWE-Bench Pro means GLM-5.1 successfully resolves nearly three out of five real-world software engineering tasks — a result that would have seemed impossible for any open model six months ago.

The previous open-source leader on SWE-Bench Pro was DeepSeek V3.2 at 49.8%. GLM-5.1 leapfrogs it by 8.6 percentage points and, more importantly, edges past Claude Opus 4.6 (57.3%) and GPT-5 (54.1%). According to the LLMCheck index, this is the first time an MIT-licensed model has topped a major agentic coding benchmark over every proprietary frontier model.

Why this matters: SWE-Bench Pro is the closest benchmark we have to measuring real-world coding agent performance. An open-source model leading it means the capability gap between open and closed models has effectively closed for agentic software engineering tasks.

Architecture Deep Dive

GLM-5.1 is built on a mixture-of-experts (MoE) architecture with numbers that dwarf anything in the open-source ecosystem:

744B total parameters spread across 256 expert modules
40B active parameters per forward pass — the router selects a small subset of experts for each token
200K context window with a 131K maximum output length, enabling long-form code generation
MIT license — fully open for commercial and research use

Perhaps the most notable detail is the training infrastructure. GLM-5.1 was trained on 100,000 Huawei Ascend 910B chips with zero NVIDIA hardware involved. Z.ai developed custom training frameworks to work around the Ascend ecosystem's relative immaturity compared to CUDA. The fact that a model trained entirely outside the NVIDIA stack can top SWE-Bench Pro challenges the assumption that frontier AI requires NVIDIA GPUs.

GLM-5.1 succeeds the original GLM-5, which scored 38 on the LLMCheck leaderboard. The jump to 58 represents a 53% improvement in a single generation, driven primarily by the expanded expert count, longer context, and what Z.ai describes as improved agentic training pipelines with reinforcement learning from code execution feedback.

Benchmark Results

According to the LLMCheck index, here is how GLM-5.1 stacks up against the current frontier models on agentic and coding-specific evaluations:

Benchmark	GLM-5.1	Claude Opus 4.6	GPT-5	DeepSeek V3.2
SWE-Bench Pro	58.4%	57.3%	54.1%	49.8%
NL2Repo	42.7%	39.1%	37.5%	34.2%
Terminal-Bench 2.0	63.5%	61.8%	58.3%	52.1%
CyberGym	68.7%	65.2%	62.9%	55.4%
LLMCheck Score	58	N/A (proprietary)	N/A (proprietary)	72
License	MIT	Proprietary	Proprietary	MIT
Local Mac?	No (~390 GB)	No (API only)	No (API only)	No (~380 GB)

Key takeaway: GLM-5.1 leads every agentic coding benchmark in this comparison. However, none of these frontier-scale models can run locally on a Mac. For local inference, smaller open models like Gemma 4 31B and Qwen 3.5 35B remain the practical choices.

What This Means for Mac Users

Let's be direct: you cannot run GLM-5.1 on any Mac that exists today. At INT4 quantization, the model requires approximately 390 GB of VRAM. Even the M4 Ultra with 192 GB unified memory falls 200 GB short. This is a server-class model.

That said, GLM-5.1 matters to the Mac LLM community for several reasons:

API access works from any Mac. Z.ai's API is available globally, and tools like LM Studio, Open WebUI, and custom scripts can route requests to the GLM-5.1 endpoint. You get frontier coding capability without leaving your Mac workflow.
Distillation is coming. When a 744B model achieves these results, smaller distilled variants inevitably follow. GLM-4's 9B distillation already runs on Mac, and a GLM-5.1 distilled variant in the 30-70B range could be a game-changer for local inference.
Open weights shift the ecosystem. MIT-licensed weights at this capability level mean anyone can fine-tune, quantize, and optimize. Community-driven GGUF and MLX conversions of future GLM variants will likely appear within days of release.

How to Use GLM-5.1 Today

There are three ways to access GLM-5.1 right now:

Z.ai API: The official API at api.z.ai offers GLM-5.1 with OpenAI-compatible endpoints. You can point any tool that supports custom API endpoints (LM Studio, Open WebUI, Cursor) at it and start using the model immediately.
HuggingFace weights: The full model weights are available on HuggingFace under the MIT license. If you have access to a multi-GPU cloud instance (8x A100 80GB or equivalent), you can self-host with vLLM or TGI.
Cloud GPU platforms: Services like Lambda, RunPod, and Together AI are expected to offer GLM-5.1 inference endpoints shortly. Check their model libraries for availability.

The Verdict

GLM-5.1 is a landmark release. It proves that open-source models can match and exceed the best proprietary systems on the most demanding agentic coding benchmarks. The MIT license makes it the most permissively licensed frontier-class model ever released.

What GLM-5.1 gets right

Top SWE-Bench Pro score ever (58.4%). MIT license with no restrictions. Dominant on agentic coding benchmarks across the board. 200K context with 131K max output. Proves frontier AI can be built outside the NVIDIA ecosystem.

Where it falls short

Cannot run locally on any Mac or consumer hardware. Requires ~390 GB VRAM at INT4. LLMCheck Score of only 58 due to zero speed and accessibility points. General-purpose chat quality lags behind Claude and GPT-5 on non-coding tasks.

According to LLMCheck analysis, GLM-5.1 earns a score of 58: strong capability points driven by its benchmark dominance and a full 10 for its MIT license, but zero points for speed and accessibility because no Mac can run it. For developers who need the absolute best agentic coding model and are willing to use an API, GLM-5.1 is now the top open-source option. For local Mac inference, the models that matter most are still the ones that fit in your RAM.

GLM-5.1: The First Open Model to Beat Claude on SWE-Bench Pro

The Breakthrough: Open Source > Claude on Coding

Architecture Deep Dive

Benchmark Results

What This Means for Mac Users

How to Use GLM-5.1 Today

The Verdict

What GLM-5.1 gets right

Where it falls short

Frequently Asked Questions

Did GLM-5.1 really beat Claude on SWE-Bench Pro?

Can I run GLM-5.1 locally on a Mac?

What license does GLM-5.1 use?

How does GLM-5.1 compare to GPT-5 and DeepSeek V3.2?

Why does GLM-5.1 score only 58 on the LLMCheck leaderboard?

Sources & References

See How Every Model Scores on Your Mac