Why this is a landmark
Open-weights models have beaten closed models on individual benchmarks before. Qwen has topped MMLU. DeepSeek has led AIME math. Llama has matched GPT on trivia. But none of those wins touched the one benchmark that actually predicts whether a model can do an engineer’s job: resolving real, multi-file GitHub issues autonomously. On that benchmark — SWE-Bench Pro — the closed frontier had never been beaten by an open model. Until July 8, 2026.
GLM 5.2, released by Zhipu AI (the Beijing lab also known as Z.ai), is the model that broke through. It scores 68.5% on SWE-Bench Pro, the first open-weights model ever to pass 68 and the first to sit above both GPT-5 and Claude Opus 4.6 on agentic coding. And it does this under the MIT license — not a custom “open-ish” license with a monthly-active-user cap or a no-compete clause, but the genuinely permissive MIT license that lets anyone download, fine-tune, self-host, and commercialize the weights with zero restrictions.
The landmark claim: GLM 5.2 is the first open-weights model in history to beat the closed frontier — GPT-5 and Claude Opus 4.6 — on SWE-Bench Pro, the most realistic public benchmark for autonomous coding agents. The gap between “open” and “state of the art” on real engineering work is now, for the first time, negative.
What makes this more than a leaderboard footnote is the combination. A model that is simultaneously (a) the best in the world at agentic coding, (b) fully open under MIT, and (c) available in a Mac-runnable variant has never existed at once. According to LLMCheck benchmarks, this is the single most consequential open-source release of 2026 — the moment a self-hostable model stopped being a compromise and started being the frontier.
The headline benchmark: SWE-Bench Pro 68.5%
SWE-Bench Pro is the hardened successor to SWE-Bench Verified. Where Verified measured single-patch correctness on a curated set, Pro tests the full agentic loop — clone the repository, read the issue, locate the relevant files, edit across multiple modules, run the test suite, read the failures, and iterate until the patch passes. It is contamination-resistant (the issues post-date most training cutoffs) and it is the closest thing the field has to a measure of “can this model replace a junior engineer on a well-scoped ticket?”
Here is where GLM 5.2 lands against the field as of July 11, 2026:
| Rank | Model | Type | SWE-Bench Pro |
|---|---|---|---|
| 1 | GLM 5.2 | Open (MIT) | 68.5% |
| 2 | GPT-5 | Closed | 67.0% |
| 3 | DeepSeek R3 | Open | 66.0% |
| 4 | Claude Opus 4.6 | Closed | 65.1% |
| 5 | Kimi K3 | Open | 63.0% |
| 6 | Qwen 4.1 | Open | 62.0% |
The 1.5-point margin over GPT-5 is small but directional. A 1.5-percentage-point lead is within the noise band of a single benchmark run, and we are not claiming GLM 5.2 is decisively “better than GPT-5” at everything. What we are claiming — and what the data supports — is that the open frontier has reached parity-or-better with the closed frontier on the hardest practical coding benchmark, and that has never happened before. The historical pattern was open models trailing by 5–10 points. That gap is now gone.
The top three are now open. Read the table again: positions one, three, and five are open-weights models. Two of the top three coding models in the world — GLM 5.2 and DeepSeek R3 — are things you can download. The closed labs no longer own the coding-agent podium, and the implications for every company currently paying per-token for a coding API are immediate.
Why coding is the bellwether: SWE-Bench Pro is hard to game. It requires planning, tool use, long-context code comprehension, and error recovery in a single loop — the same capabilities that drive agentic performance across domains. A model that tops SWE-Bench Pro is a model that can drive autonomous workflows generally, which is why this number, more than any MMLU score, is the one to watch.
Architecture deep-dive: 744B/40B MoE
GLM 5.2 is a sparse mixture-of-experts model: 744 billion total parameters, of which only about 40 billion activate per token. That ratio — roughly 5.4% of the network firing on any given forward pass — is what makes a model this large tractable to serve at all. The dense-equivalent quality comes from the full 744B parameter pool; the inference cost tracks the 40B active slice.
Architecture
Provenance
Hybrid reasoning is the capability unlock. Like the strongest 2026 models, GLM 5.2 ships with a hybrid reasoning mode that lets it decide, per query, whether to answer directly or to spend tokens on an explicit chain of thought before responding. On easy queries it stays fast; on SWE-Bench-style multi-step problems it switches into extended reasoning automatically. This is the mechanism behind the agentic-coding score — the model plans before it edits, then reflects on test failures before it retries.
The training story is geopolitically notable. GLM 5.2 was trained on 22 trillion tokens on a domestic accelerator cluster built on Huawei Ascend silicon — not Nvidia. This matters beyond benchmarks: it demonstrates that a frontier-class model can now be trained end-to-end outside the Nvidia ecosystem. For the open-weights movement, it means the supply chain for state-of-the-art models is diversifying, which makes the open frontier more resilient to any single hardware bottleneck. The knowledge cutoff is June 2026, making GLM 5.2 one of the most up-to-date models available at launch.
256K context is workhorse-sized, not record-setting. A 256K context window comfortably holds a large codebase, a long document set, or an extended agentic trajectory. It is not the 1M-token frontier that Qwen and Gemma chase, but for the coding-agent use case GLM 5.2 is built for, 256K is more than enough to hold a repository’s relevant files plus a full reasoning trace.
Full benchmark suite vs GPT-5 & Claude Opus 4.6
SWE-Bench Pro is the headline, but it is not the whole picture. Here is GLM 5.2 across the standard 2026 benchmark suite, measured against the two closed models it most directly challenges. According to LLMCheck benchmarks, the pattern is clear: GLM 5.2 leads on coding and competition math, and trades within a point or two everywhere else.
| Benchmark | GLM 5.2 | GPT-5 | Claude Opus 4.6 |
|---|---|---|---|
| SWE-Bench Pro | 68.5% | 67.0% | 65.1% |
| SWE-Verified | 84% | 83% | 83% |
| MMLU | 92% | 92% | 91% |
| HumanEval | 97% | 96% | 96% |
| AIME 2026 | 94% | 93% | 90% |
| GPQA-Diamond | 89% | 89% | 88% |
| Arena ELO | ~1565 | ~1580 | ~1572 |
| License | MIT (open) | Closed | Closed |
Where GLM 5.2 wins clearly: agentic coding (SWE-Bench Pro), code generation (HumanEval 97%), and competition math (AIME 94%). These are the capability axes that matter most for autonomous agents and technical work, and GLM 5.2 holds the top spot on all three.
Where it ties: MMLU (92%), GPQA-Diamond (89%), and SWE-Verified (84%) are statistical dead heats with the closed frontier — the differences are inside the margin of error of a single evaluation run.
Where it trails: Arena ELO. GLM 5.2’s ~1565 sits just behind GPT-5 (~1580) and Claude Opus 4.6 (~1572). Arena ELO measures human-preferred general chat — tone, helpfulness, formatting, refusal calibration — and the closed labs still hold a small edge there. If your use case is consumer chat, the closed models remain marginally preferred; if your use case is code and reasoning, GLM 5.2 is at or beyond the frontier.
The license story — MIT vs the closed frontier
The benchmark numbers get the headlines, but the MIT license is what actually changes the industry. GPT-5 and Claude Opus 4.6 are accessible only through paid APIs. You rent intelligence by the token, you cannot inspect the weights, you cannot fine-tune on your private data without sending it to a vendor, and your unit economics are permanently tied to someone else’s pricing.
GLM 5.2 inverts all of that. MIT is the most permissive mainstream license in existence. It grants:
- Zero API cost. Self-host the weights and your marginal cost per token is electricity and amortized hardware — not a per-million-token invoice that scales with usage. For a product doing billions of tokens a month, this is the difference between a viable margin and an impossible one.
- Full self-hosting. Run it in your own VPC, on-prem, or air-gapped. Your prompts and your users’ data never leave your infrastructure — a hard requirement for healthcare, finance, defense, and any regulated vertical.
- Unrestricted fine-tuning. Adapt the weights to your domain, your codebase, your tone. With MIT you own the derivative outright; there is no vendor approval, no clause about competing models, no MAU ceiling.
- Redistribution. Ship the weights inside your product, bundle a fine-tune, or republish a quantization. The only obligation is preserving the copyright and license notice.
Compare this to the “open-ish” licenses that dominated 2025: Llama’s 700M-MAU cap, Gemma’s prohibited-use list, the various “custom” licenses with no-compete clauses. GLM 5.2 has none of that. It is the most capable model on earth at agentic coding, and it is also one of the most permissively licensed. For a startup deciding whether to build on a closed API or a self-hosted open model, July 2026 is the first month where the open option is not a downgrade.
The startup math: A coding-agent product that would cost $40,000/month in GPT-5 API fees at scale can run on rented or owned GLM 5.2 inference for a fraction of that — with full data control and the freedom to fine-tune. MIT licensing turns the frontier from an operating expense into a capital decision.
Can you run GLM 5.2 on a Mac?
Here is the honest answer, because this is a Mac-focused site and the temptation is to over-promise: the full GLM 5.2 does not run on a normal Mac. It is a server-class model. At Q4 quantization the weights occupy roughly 390 GB — that is more unified memory than any shipping Mac has, with the sole exception of an M3 Ultra-class 512 GB machine, and even there you would be forced into an extreme Q2 quantization that materially degrades quality and leaves almost no room for context.
Full GLM 5.2 is built for an 8×H100 node or a GPU cluster, served with vLLM or SGLang. That is the right home for it. Trying to cram it onto a single Mac is a science project, not a workflow.
The Mac answer is GLM 5.2 Air. Zhipu shipped a smaller sibling specifically for the self-host-on-a-workstation use case: a 106-billion-parameter MoE with 12 billion active parameters, same MIT license, same 256K context. It is frontier-adjacent rather than frontier-topping, and crucially it fits on a 64 GB Mac. Here is how Air performs across the Apple Silicon tier:
| Mac | Unified RAM | Runtime | Speed |
|---|---|---|---|
| M5 Max | 64 GB | Ollama (Q4) | ~30 tok/s |
| M5 Max | 128 GB | MLX | ~34 tok/s |
| M4 Ultra | 192 GB | MLX | ~38 tok/s |
A 64 GB Mac — an M5 Max MacBook Pro or a base Mac Studio — is the realistic entry point. At ~30 tok/s, GLM 5.2 Air is fast enough for interactive coding assistance and agent loops, and it leaves enough headroom for a useful context window. Step up to 128 GB or an M4 Ultra and both speed and usable context improve.
GLM 5.2 Air — the best frontier-adjacent model you can self-host
GLM 5.2 Air deserves its own analysis, because for the overwhelming majority of LLMCheck readers it — not the 744B flagship — is the model you will actually run. The question is whether a 106B/12B model that you can host on a Mac Studio is good enough to matter. The answer is yes.
GLM 5.2 Air
Mac Fit
58% on SWE-Bench Pro is the number that matters. To put it in context: 58% places GLM 5.2 Air ahead of where the closed frontier sat in late 2025, and ahead of most other models you can self-host today. It is roughly 10 points behind the full GLM 5.2 (68.5%) and a few points behind GPT-5 — but it is a 106B model running on a single Mac, not an 8×H100 node. According to LLMCheck benchmarks, that is the best agentic-coding score of any model that fits comfortably on consumer-accessible Apple Silicon.
The MoE structure is why it is fast. With only 12B parameters active per token, GLM 5.2 Air runs far quicker than a 106B dense model would. That sparsity is what delivers ~30–38 tok/s on Apple Silicon — comfortably interactive for a coding agent that reads files, proposes edits, and reacts to test output.
Installation is a single command:
# Ollama — the one-line Mac install (~60 GB download)
ollama run glm5.2:air
# MLX — fastest on Apple Silicon (128 GB+ recommended)
pip install mlx-lm
mlx_lm.generate --model mlx-community/GLM-5.2-Air-106B-A12B-4bit \
--prompt "Read this repo and add a rate-limiter to the API layer"
# LM Studio: search "GLM 5.2 Air" in the Discover tabFor a developer who wants a self-hosted coding assistant that approaches the closed frontier, runs entirely on their own Mac, costs nothing per token, and carries no licensing risk, GLM 5.2 Air is the default recommendation as of July 2026. It is the model that finally makes “self-host your coding agent” a serious answer rather than a hobbyist one.
GLM 5.2 vs the open field
GLM 5.2 did not arrive into an empty field. Three other open-weights models define the July 2026 frontier, and positioning GLM 5.2 against them clarifies exactly what it is for.
vs DeepSeek R3. The closest rival. R3 scores 66% on SWE-Bench Pro to GLM 5.2’s 68.5%, and both are permissive server-class MoE models. DeepSeek R3 remains exceptional at pure math and reasoning and has a slightly leaner active-parameter profile. GLM 5.2 takes the agentic-coding crown; R3 stays competitive on cost-per-reasoning. For a deeper head-to-head, see our dedicated GLM 5.2 vs DeepSeek R3 comparison.
vs Qwen 4.1. Qwen 4.1 scores 62% on SWE-Bench Pro — a strong number, and Qwen’s smaller variants still own the Mac mid-tier on a quality-per-gigabyte basis. But on raw agentic-coding capability, GLM 5.2 (and even GLM 5.2 Air, at 58%, in the same neighborhood) has moved ahead. Qwen remains the better choice when you need a model that runs on 8–24 GB Macs; GLM 5.2 is the better choice when you have the hardware to run a true frontier model.
vs Llama 5 405B. Meta’s largest dense open model competes on general capability and multimodality, but its license carries the 700M-MAU cap and its dense architecture makes it slower to serve than GLM 5.2’s sparse MoE at comparable quality. On agentic coding specifically, GLM 5.2 is ahead, MIT-licensed, and cheaper to run per token thanks to sparsity. For pure open-weights coding capability with no license strings, GLM 5.2 is the clear pick.
How to deploy GLM 5.2
There are three realistic deployment paths depending on which variant you need and what hardware you have.
Path 1 — Full GLM 5.2 on your own server
For the 744B flagship you need a GPU node with roughly 390 GB+ of memory at Q4 — an 8×H100 (640 GB) or 8×H200 box. Serve it with vLLM or SGLang for production throughput; Ollama works for a quick test if your hardware can hold it.
# Ollama (requires a 390 GB+ machine)
ollama run glm5.2
# vLLM (recommended for production serving)
pip install vllm
vllm serve zai-org/GLM-5.2 \
--tensor-parallel-size 8 \
--max-model-len 262144Path 2 — GLM 5.2 Air on a Mac
For local Mac use, GLM 5.2 Air on a 64 GB+ machine is the answer. Ollama is the simplest; MLX is the fastest on Apple Silicon and is worth it if you have 128 GB or more.
# Simplest Mac path
ollama run glm5.2:air
# Fastest Mac path (MLX, 128 GB+)
mlx_lm.generate --model mlx-community/GLM-5.2-Air-106B-A12B-4bit \
--prompt "Refactor this module and add tests"Path 3 — Rent the full model in the cloud
If you want full GLM 5.2 quality without owning a cluster, rent an 8×H100 node by the hour from any major GPU cloud. As of July 2026 that runs in the low single digits of dollars per H100-hour, so a full 8-GPU node is on the order of $15–25/hour — cost-effective for batch jobs, evaluation runs, or bursty agent workloads, and you tear it down when you are done. Because the weights are MIT-licensed, there is no per-token fee layered on top; you pay only for the compute.
The verdict
GLM 5.2 is the most capable open-weights model ever released. It is the first to beat the closed frontier — GPT-5 and Claude Opus 4.6 — on SWE-Bench Pro, the benchmark that best predicts real agentic-coding capability, and it does so under the MIT license, the most permissive terms the industry offers. That combination of capability and openness has never existed simultaneously until now.
The honest caveat is hardware. The full 744B model is server-class and will not run on any normal Mac. But Zhipu solved that with GLM 5.2 Air — a 106B/12B MoE that delivers 58% on SWE-Bench Pro (still frontier-adjacent) on a 64 GB Mac at ~30 tok/s, with the same MIT license and the same one-line install. For the self-hosting developer, Air is the practical takeaway: a coding model that approaches the frontier, runs on your own Mac, costs nothing per token, and carries no license strings.
The bigger story is what GLM 5.2 represents. The era where “open” meant “good enough, but behind” is over. According to LLMCheck benchmarks, as of July 2026 the best agentic-coding model on earth is one you can download. That is the inflection the open-weights movement has been building toward for three years, and GLM 5.2 is the model that crossed the line.