MODEL REVIEW · July 11, 2026 · 17 min read

GLM 5.2: The Open Model That Beats GPT-5 and Claude on SWE-Bench

GLM 5.2 from Zhipu AI is the first open-weights model to cross 68% on SWE-Bench Pro (68.5%), beating Claude Opus 4.6 and GPT-5 on agentic coding. It is MIT licensed and built as a 744B-A40B mixture-of-experts. The full model is server-class, but a 106B GLM 5.2 Air runs on a 64 GB Mac at ~30 tok/s.

For three years the story of open-weights AI was “close, but behind.” On July 8, 2026, that story ended. Zhipu AI released GLM 5.2 — a 744-billion-parameter mixture-of-experts model under the MIT license — and it became the first open model in history to top the closed frontier on a real agentic-coding benchmark. According to LLMCheck benchmarks, GLM 5.2 scores 68.5% on SWE-Bench Pro, ahead of GPT-5 (67.0%) and Claude Opus 4.6 (65.1%). This is the definitive technical breakdown: what GLM 5.2 is, why the SWE-Bench Pro number matters, how its MIT license rewrites the economics for startups, and how to actually run it — including GLM 5.2 Air, the 106B sibling that fits on a Mac Studio.

Why this is a landmark

Open-weights models have beaten closed models on individual benchmarks before. Qwen has topped MMLU. DeepSeek has led AIME math. Llama has matched GPT on trivia. But none of those wins touched the one benchmark that actually predicts whether a model can do an engineer’s job: resolving real, multi-file GitHub issues autonomously. On that benchmark — SWE-Bench Pro — the closed frontier had never been beaten by an open model. Until July 8, 2026.

GLM 5.2, released by Zhipu AI (the Beijing lab also known as Z.ai), is the model that broke through. It scores 68.5% on SWE-Bench Pro, the first open-weights model ever to pass 68 and the first to sit above both GPT-5 and Claude Opus 4.6 on agentic coding. And it does this under the MIT license — not a custom “open-ish” license with a monthly-active-user cap or a no-compete clause, but the genuinely permissive MIT license that lets anyone download, fine-tune, self-host, and commercialize the weights with zero restrictions.

The landmark claim: GLM 5.2 is the first open-weights model in history to beat the closed frontier — GPT-5 and Claude Opus 4.6 — on SWE-Bench Pro, the most realistic public benchmark for autonomous coding agents. The gap between “open” and “state of the art” on real engineering work is now, for the first time, negative.

What makes this more than a leaderboard footnote is the combination. A model that is simultaneously (a) the best in the world at agentic coding, (b) fully open under MIT, and (c) available in a Mac-runnable variant has never existed at once. According to LLMCheck benchmarks, this is the single most consequential open-source release of 2026 — the moment a self-hostable model stopped being a compromise and started being the frontier.

The headline benchmark: SWE-Bench Pro 68.5%

SWE-Bench Pro is the hardened successor to SWE-Bench Verified. Where Verified measured single-patch correctness on a curated set, Pro tests the full agentic loop — clone the repository, read the issue, locate the relevant files, edit across multiple modules, run the test suite, read the failures, and iterate until the patch passes. It is contamination-resistant (the issues post-date most training cutoffs) and it is the closest thing the field has to a measure of “can this model replace a junior engineer on a well-scoped ticket?”

Here is where GLM 5.2 lands against the field as of July 11, 2026:

SWE-Bench Pro — GLM 5.2 vs the closed and open frontier, July 11, 2026.
Rank	Model	Type	SWE-Bench Pro
1	GLM 5.2	Open (MIT)	68.5%
2	GPT-5	Closed	67.0%
3	DeepSeek R3	Open	66.0%
4	Claude Opus 4.6	Closed	65.1%
5	Kimi K3	Open	63.0%
6	Qwen 4.1	Open	62.0%

The 1.5-point margin over GPT-5 is small but directional. A 1.5-percentage-point lead is within the noise band of a single benchmark run, and we are not claiming GLM 5.2 is decisively “better than GPT-5” at everything. What we are claiming — and what the data supports — is that the open frontier has reached parity-or-better with the closed frontier on the hardest practical coding benchmark, and that has never happened before. The historical pattern was open models trailing by 5–10 points. That gap is now gone.

The top three are now open. Read the table again: positions one, three, and five are open-weights models. Two of the top three coding models in the world — GLM 5.2 and DeepSeek R3 — are things you can download. The closed labs no longer own the coding-agent podium, and the implications for every company currently paying per-token for a coding API are immediate.

Why coding is the bellwether: SWE-Bench Pro is hard to game. It requires planning, tool use, long-context code comprehension, and error recovery in a single loop — the same capabilities that drive agentic performance across domains. A model that tops SWE-Bench Pro is a model that can drive autonomous workflows generally, which is why this number, more than any MMLU score, is the one to watch.

Architecture deep-dive: 744B/40B MoE

GLM 5.2 is a sparse mixture-of-experts model: 744 billion total parameters, of which only about 40 billion activate per token. That ratio — roughly 5.4% of the network firing on any given forward pass — is what makes a model this large tractable to serve at all. The dense-equivalent quality comes from the full 744B parameter pool; the inference cost tracks the 40B active slice.

Architecture

Total params744B

Active params40B (5.4%)

TypeSparse MoE

ReasoningHybrid mode

Context256K

Training tokens22T

Provenance

LabZhipu AI (Z.ai)

ReleasedJul 8, 2026

LicenseMIT

AcceleratorHuawei Ascend

Knowledge cutoffJun 2026

Commercial useUnrestricted

Hybrid reasoning is the capability unlock. Like the strongest 2026 models, GLM 5.2 ships with a hybrid reasoning mode that lets it decide, per query, whether to answer directly or to spend tokens on an explicit chain of thought before responding. On easy queries it stays fast; on SWE-Bench-style multi-step problems it switches into extended reasoning automatically. This is the mechanism behind the agentic-coding score — the model plans before it edits, then reflects on test failures before it retries.

The training story is geopolitically notable. GLM 5.2 was trained on 22 trillion tokens on a domestic accelerator cluster built on Huawei Ascend silicon — not Nvidia. This matters beyond benchmarks: it demonstrates that a frontier-class model can now be trained end-to-end outside the Nvidia ecosystem. For the open-weights movement, it means the supply chain for state-of-the-art models is diversifying, which makes the open frontier more resilient to any single hardware bottleneck. The knowledge cutoff is June 2026, making GLM 5.2 one of the most up-to-date models available at launch.

256K context is workhorse-sized, not record-setting. A 256K context window comfortably holds a large codebase, a long document set, or an extended agentic trajectory. It is not the 1M-token frontier that Qwen and Gemma chase, but for the coding-agent use case GLM 5.2 is built for, 256K is more than enough to hold a repository’s relevant files plus a full reasoning trace.

Full benchmark suite vs GPT-5 & Claude Opus 4.6

SWE-Bench Pro is the headline, but it is not the whole picture. Here is GLM 5.2 across the standard 2026 benchmark suite, measured against the two closed models it most directly challenges. According to LLMCheck benchmarks, the pattern is clear: GLM 5.2 leads on coding and competition math, and trades within a point or two everywhere else.

GLM 5.2 vs GPT-5 vs Claude Opus 4.6 — full suite, July 11, 2026. Green = best in row.
Benchmark	GLM 5.2	GPT-5	Claude Opus 4.6
SWE-Bench Pro	68.5%	67.0%	65.1%
SWE-Verified	84%	83%	83%
MMLU	92%	92%	91%
HumanEval	97%	96%	96%
AIME 2026	94%	93%	90%
GPQA-Diamond	89%	89%	88%
Arena ELO	~1565	~1580	~1572
License	MIT (open)	Closed	Closed

Where GLM 5.2 wins clearly: agentic coding (SWE-Bench Pro), code generation (HumanEval 97%), and competition math (AIME 94%). These are the capability axes that matter most for autonomous agents and technical work, and GLM 5.2 holds the top spot on all three.

Where it ties: MMLU (92%), GPQA-Diamond (89%), and SWE-Verified (84%) are statistical dead heats with the closed frontier — the differences are inside the margin of error of a single evaluation run.

Where it trails: Arena ELO. GLM 5.2’s ~1565 sits just behind GPT-5 (~1580) and Claude Opus 4.6 (~1572). Arena ELO measures human-preferred general chat — tone, helpfulness, formatting, refusal calibration — and the closed labs still hold a small edge there. If your use case is consumer chat, the closed models remain marginally preferred; if your use case is code and reasoning, GLM 5.2 is at or beyond the frontier.

The license story — MIT vs the closed frontier

The benchmark numbers get the headlines, but the MIT license is what actually changes the industry. GPT-5 and Claude Opus 4.6 are accessible only through paid APIs. You rent intelligence by the token, you cannot inspect the weights, you cannot fine-tune on your private data without sending it to a vendor, and your unit economics are permanently tied to someone else’s pricing.

GLM 5.2 inverts all of that. MIT is the most permissive mainstream license in existence. It grants:

Zero API cost. Self-host the weights and your marginal cost per token is electricity and amortized hardware — not a per-million-token invoice that scales with usage. For a product doing billions of tokens a month, this is the difference between a viable margin and an impossible one.
Full self-hosting. Run it in your own VPC, on-prem, or air-gapped. Your prompts and your users’ data never leave your infrastructure — a hard requirement for healthcare, finance, defense, and any regulated vertical.
Unrestricted fine-tuning. Adapt the weights to your domain, your codebase, your tone. With MIT you own the derivative outright; there is no vendor approval, no clause about competing models, no MAU ceiling.
Redistribution. Ship the weights inside your product, bundle a fine-tune, or republish a quantization. The only obligation is preserving the copyright and license notice.

Compare this to the “open-ish” licenses that dominated 2025: Llama’s 700M-MAU cap, Gemma’s prohibited-use list, the various “custom” licenses with no-compete clauses. GLM 5.2 has none of that. It is the most capable model on earth at agentic coding, and it is also one of the most permissively licensed. For a startup deciding whether to build on a closed API or a self-hosted open model, July 2026 is the first month where the open option is not a downgrade.

The startup math: A coding-agent product that would cost $40,000/month in GPT-5 API fees at scale can run on rented or owned GLM 5.2 inference for a fraction of that — with full data control and the freedom to fine-tune. MIT licensing turns the frontier from an operating expense into a capital decision.

Can you run GLM 5.2 on a Mac?

Here is the honest answer, because this is a Mac-focused site and the temptation is to over-promise: the full GLM 5.2 does not run on a normal Mac. It is a server-class model. At Q4 quantization the weights occupy roughly 390 GB — that is more unified memory than any shipping Mac has, with the sole exception of an M3 Ultra-class 512 GB machine, and even there you would be forced into an extreme Q2 quantization that materially degrades quality and leaves almost no room for context.

Full GLM 5.2 is built for an 8×H100 node or a GPU cluster, served with vLLM or SGLang. That is the right home for it. Trying to cram it onto a single Mac is a science project, not a workflow.

The Mac answer is GLM 5.2 Air. Zhipu shipped a smaller sibling specifically for the self-host-on-a-workstation use case: a 106-billion-parameter MoE with 12 billion active parameters, same MIT license, same 256K context. It is frontier-adjacent rather than frontier-topping, and crucially it fits on a 64 GB Mac. Here is how Air performs across the Apple Silicon tier:

GLM 5.2 Air on Apple Silicon — Q4 / MLX, July 11, 2026.
Mac	Unified RAM	Runtime	Speed
M5 Max	64 GB	Ollama (Q4)	~30 tok/s
M5 Max	128 GB	MLX	~34 tok/s
M4 Ultra	192 GB	MLX	~38 tok/s

A 64 GB Mac — an M5 Max MacBook Pro or a base Mac Studio — is the realistic entry point. At ~30 tok/s, GLM 5.2 Air is fast enough for interactive coding assistance and agent loops, and it leaves enough headroom for a useful context window. Step up to 128 GB or an M4 Ultra and both speed and usable context improve.

GLM 5.2 Air — the best frontier-adjacent model you can self-host

GLM 5.2 Air deserves its own analysis, because for the overwhelming majority of LLMCheck readers it — not the 744B flagship — is the model you will actually run. The question is whether a 106B/12B model that you can host on a Mac Studio is good enough to matter. The answer is yes.

GLM 5.2 Air

Total params106B

Active params12B

TypeSparse MoE

Context256K

LicenseMIT

SWE-Bench Pro58%

Mac Fit

Min RAM64 GB

M5 Max 64 GB~30 tok/s

M5 Max 128 GB~34 tok/s

M4 Ultra 192 GB~38 tok/s

Installollama run glm5.2:air

Hosted weightsHugging Face

58% on SWE-Bench Pro is the number that matters. To put it in context: 58% places GLM 5.2 Air ahead of where the closed frontier sat in late 2025, and ahead of most other models you can self-host today. It is roughly 10 points behind the full GLM 5.2 (68.5%) and a few points behind GPT-5 — but it is a 106B model running on a single Mac, not an 8×H100 node. According to LLMCheck benchmarks, that is the best agentic-coding score of any model that fits comfortably on consumer-accessible Apple Silicon.

The MoE structure is why it is fast. With only 12B parameters active per token, GLM 5.2 Air runs far quicker than a 106B dense model would. That sparsity is what delivers ~30–38 tok/s on Apple Silicon — comfortably interactive for a coding agent that reads files, proposes edits, and reacts to test output.

Installation is a single command:

# Ollama — the one-line Mac install (~60 GB download)
ollama run glm5.2:air

# MLX — fastest on Apple Silicon (128 GB+ recommended)
pip install mlx-lm
mlx_lm.generate --model mlx-community/GLM-5.2-Air-106B-A12B-4bit \
  --prompt "Read this repo and add a rate-limiter to the API layer"

# LM Studio: search "GLM 5.2 Air" in the Discover tab

For a developer who wants a self-hosted coding assistant that approaches the closed frontier, runs entirely on their own Mac, costs nothing per token, and carries no licensing risk, GLM 5.2 Air is the default recommendation as of July 2026. It is the model that finally makes “self-host your coding agent” a serious answer rather than a hobbyist one.

GLM 5.2 vs the open field

GLM 5.2 did not arrive into an empty field. Three other open-weights models define the July 2026 frontier, and positioning GLM 5.2 against them clarifies exactly what it is for.

vs DeepSeek R3. The closest rival. R3 scores 66% on SWE-Bench Pro to GLM 5.2’s 68.5%, and both are permissive server-class MoE models. DeepSeek R3 remains exceptional at pure math and reasoning and has a slightly leaner active-parameter profile. GLM 5.2 takes the agentic-coding crown; R3 stays competitive on cost-per-reasoning. For a deeper head-to-head, see our dedicated GLM 5.2 vs DeepSeek R3 comparison.

vs Qwen 4.1. Qwen 4.1 scores 62% on SWE-Bench Pro — a strong number, and Qwen’s smaller variants still own the Mac mid-tier on a quality-per-gigabyte basis. But on raw agentic-coding capability, GLM 5.2 (and even GLM 5.2 Air, at 58%, in the same neighborhood) has moved ahead. Qwen remains the better choice when you need a model that runs on 8–24 GB Macs; GLM 5.2 is the better choice when you have the hardware to run a true frontier model.

vs Llama 5 405B. Meta’s largest dense open model competes on general capability and multimodality, but its license carries the 700M-MAU cap and its dense architecture makes it slower to serve than GLM 5.2’s sparse MoE at comparable quality. On agentic coding specifically, GLM 5.2 is ahead, MIT-licensed, and cheaper to run per token thanks to sparsity. For pure open-weights coding capability with no license strings, GLM 5.2 is the clear pick.

How to deploy GLM 5.2

There are three realistic deployment paths depending on which variant you need and what hardware you have.

Path 1 — Full GLM 5.2 on your own server

For the 744B flagship you need a GPU node with roughly 390 GB+ of memory at Q4 — an 8×H100 (640 GB) or 8×H200 box. Serve it with vLLM or SGLang for production throughput; Ollama works for a quick test if your hardware can hold it.

# Ollama (requires a 390 GB+ machine)
ollama run glm5.2

# vLLM (recommended for production serving)
pip install vllm
vllm serve zai-org/GLM-5.2 \
  --tensor-parallel-size 8 \
  --max-model-len 262144

Path 2 — GLM 5.2 Air on a Mac

For local Mac use, GLM 5.2 Air on a 64 GB+ machine is the answer. Ollama is the simplest; MLX is the fastest on Apple Silicon and is worth it if you have 128 GB or more.

# Simplest Mac path
ollama run glm5.2:air

# Fastest Mac path (MLX, 128 GB+)
mlx_lm.generate --model mlx-community/GLM-5.2-Air-106B-A12B-4bit \
  --prompt "Refactor this module and add tests"

Path 3 — Rent the full model in the cloud

If you want full GLM 5.2 quality without owning a cluster, rent an 8×H100 node by the hour from any major GPU cloud. As of July 2026 that runs in the low single digits of dollars per H100-hour, so a full 8-GPU node is on the order of $15–25/hour — cost-effective for batch jobs, evaluation runs, or bursty agent workloads, and you tear it down when you are done. Because the weights are MIT-licensed, there is no per-token fee layered on top; you pay only for the compute.

The verdict

GLM 5.2 is the most capable open-weights model ever released. It is the first to beat the closed frontier — GPT-5 and Claude Opus 4.6 — on SWE-Bench Pro, the benchmark that best predicts real agentic-coding capability, and it does so under the MIT license, the most permissive terms the industry offers. That combination of capability and openness has never existed simultaneously until now.

The honest caveat is hardware. The full 744B model is server-class and will not run on any normal Mac. But Zhipu solved that with GLM 5.2 Air — a 106B/12B MoE that delivers 58% on SWE-Bench Pro (still frontier-adjacent) on a 64 GB Mac at ~30 tok/s, with the same MIT license and the same one-line install. For the self-hosting developer, Air is the practical takeaway: a coding model that approaches the frontier, runs on your own Mac, costs nothing per token, and carries no license strings.

The bigger story is what GLM 5.2 represents. The era where “open” meant “good enough, but behind” is over. According to LLMCheck benchmarks, as of July 2026 the best agentic-coding model on earth is one you can download. That is the inflection the open-weights movement has been building toward for three years, and GLM 5.2 is the model that crossed the line.

LLMCheck Research Team

We benchmark local AI models on real Apple Silicon hardware. Our database covers 79+ open and closed models with standardized tok/s measurements using Ollama, LM Studio, MLX, and vLLM.

Frequently Asked Questions

What is GLM 5.2?

GLM 5.2 is a large open-weights language model released by Zhipu AI (Z.ai) on July 8, 2026 under the MIT license. It is a 744-billion-parameter mixture-of-experts model with 40 billion active parameters, a 256K context window, and a hybrid reasoning mode. According to LLMCheck benchmarks, it is the first open-weights model to cross 68% on SWE-Bench Pro, scoring 68.5% and beating both Claude Opus 4.6 and GPT-5 on that agentic-coding benchmark.

Is GLM 5.2 really better than GPT-5?

On SWE-Bench Pro, yes. GLM 5.2 scores 68.5% versus GPT-5's 67.0% and Claude Opus 4.6's 65.1%, making it the strongest model on that agentic-coding benchmark as of July 2026. On other benchmarks the picture is closer: GLM 5.2 leads on HumanEval (97%) and AIME (94%), ties closely on MMLU (92%) and GPQA-Diamond (89%), and its Arena ELO of ~1565 sits just behind the closed frontier on general chat. It is not uniformly better than GPT-5, but on real-world coding agents it is now the model to beat — and it is open.

What is SWE-Bench Pro and why does 68.5% matter?

SWE-Bench Pro is the hardened, contamination-resistant successor to SWE-Bench Verified. It measures whether a model can resolve real GitHub issues end-to-end — reading a repository, editing multiple files, running tests, and iterating until the patch passes. It is the closest public proxy for autonomous-coding-agent capability. 68.5% matters because it is the first time an open-weights model has topped the closed frontier on this benchmark: GLM 5.2 beats Claude Opus 4.6 (65.1%) and GPT-5 (67.0%). According to LLMCheck benchmarks, that crossover is the most significant open-source milestone of 2026.

Can I run GLM 5.2 on a Mac?

Not the full model. Full GLM 5.2 is server-class — roughly 390 GB at Q4 — and realistically needs a GPU cluster or 8×H100. The only way to run it on a single Mac is an extreme Q2 quantization on M3 Ultra-class 512 GB hardware, which is impractical for most users. For Macs, use GLM 5.2 Air: a 106B/12B MoE that runs on a 64 GB Mac at about 30 tok/s and installs with one command, ollama run glm5.2:air.

What is GLM 5.2 Air?

GLM 5.2 Air is the Mac-runnable sibling of full GLM 5.2 — a 106-billion-parameter mixture-of-experts model with 12 billion active parameters, MIT licensed, with a 256K context window. It scores 58% on SWE-Bench Pro, which is still frontier-adjacent and stronger than most open models you can self-host. According to LLMCheck benchmarks it runs at ~30 tok/s on an M5 Max 64 GB, ~34 tok/s on an M5 Max 128 GB via MLX, and ~38 tok/s on an M4 Ultra 192 GB. Install it with ollama run glm5.2:air.

Is GLM 5.2 free for commercial use?

Yes. Both GLM 5.2 and GLM 5.2 Air ship under the MIT license, which is one of the most permissive licenses available. You can use the weights commercially, self-host them, fine-tune them, redistribute them, and build products on them with no monthly-active-user cap, no field-of-use restrictions, and no per-token API cost. The only requirement is preserving the copyright and license notice.

GLM 5.2 vs DeepSeek R3 — which is better?

On agentic coding, GLM 5.2 leads: 68.5% on SWE-Bench Pro versus DeepSeek R3's 66%. Both are MIT-adjacent permissive open-weights MoE models in the server-class tier. DeepSeek R3 remains extremely strong on math and pure reasoning, and its routing is slightly more efficient per active parameter. For autonomous coding agents and tool use, GLM 5.2 is the current leader; for the lowest-cost frontier reasoning, the two are close enough that license and deployment footprint should decide it.

How much hardware do I need for full GLM 5.2?

Full GLM 5.2 is server-class. At Q4 the weights occupy roughly 390 GB, so a single 8×H100 (640 GB) or 8×H200 node is the practical minimum for production inference, typically served with vLLM or SGLang. You can rent this from any major GPU cloud for a few dollars per hour. On Apple Silicon, only an M3 Ultra-class 512 GB machine can hold it, and only at an extreme Q2 quantization that degrades quality. For Mac use, GLM 5.2 Air on a 64 GB machine is the correct answer.

Is GLM 5.2 censored?

Like every frontier model, GLM 5.2 ships with safety alignment, and the hosted Z.ai API applies additional content moderation consistent with its jurisdiction. However, because the weights are released under MIT, the local model can be used, fine-tuned, and re-aligned freely — self-hosting removes any server-side moderation layer entirely. In practice, for technical, coding, and general-knowledge work the local weights behave like any other open model. Sensitive political topics may reflect the alignment of the training data.

Where can I download GLM 5.2?

GLM 5.2 and GLM 5.2 Air weights are published on Hugging Face by Zhipu AI (Z.ai) and mirrored to the Ollama library. For the Mac-runnable Air model, run ollama run glm5.2:air. For the full server-class model, run ollama run glm5.2 on hardware with 390 GB+ of memory, or deploy the Hugging Face weights with vLLM or SGLang on an 8×H100 node. According to LLMCheck benchmarks, the MIT license means you can redistribute and fine-tune both freely.

Sources & References

Which July 2026 Model Fits Your Mac?

Use our free Mac LLM Checker to find which July 2026 model fits your hardware — from 8 GB MacBook Air to M4 Ultra Mac Studio.

Check My Mac