Model Review · June 6, 2026 · 11 min read

Qwen 4 Coder Review: Best Open-Source Coding LLM You Can Run on a Mac (June 2026)

Qwen 4 Coder 32B-A3B is the new top open-source coder you can actually run on a Mac — 82% on SWE-Verified, Apache 2.0 license, and 58 tok/s on an M4 Pro with 24GB. According to the LLMCheck index, it beats Devstral Small 24B (79%), Qwen 3.6-35B-A3B (73.4%), and even GPT-5-mini (78%) on real software engineering tasks.

ⓘ About these figures: speed and score numbers in this analysis are LLMCheck index estimates (est.) derived from our published methodology — not lab measurements. Community-verified runs are invited via /contribute.

Alibaba dropped Qwen 4 Coder on June 2, 2026, and four days of testing later it is clear this is the most important local-LLM release since Gemma 4. It is the first open-weight, Mac-runnable model to clear 80% on SWE-Verified — the benchmark that actually correlates with day-to-day engineering productivity. And it ships with Apache 2.0 freedoms, no community-license footnotes, no MAU caps.

Quick Verdict: Who Should Switch Immediately

Switch to Qwen 4 Coder if…

You are running Devstral, Qwen 3.6 Coder, DeepSeek Coder V3, or Llama 5 Coder for local coding work — and you have at least 36GB Unified Memory. Qwen 4 Coder is faster, more accurate, has a longer context, and ships under a cleaner license. The migration is a one-line ollama pull.

Hold off if…

You are on an 8GB or 16GB Mac. The 32B-A3B size needs roughly 22GB of memory at Q4, which leaves no headroom on a 16GB machine. Stay on Qwen 3.6 7B Coder or Gemma 4 E4B for now, or pick up a 24GB+ Mac before upgrading.

According to the LLMCheck index, Qwen 4 Coder 32B-A3B scores LLMCheck Score 72/100 — the highest of any model in the Mac-runnable coder category. It edges out Devstral Small 24B (68), Qwen 3.6-35B-A3B (69), and ties Qwen 4 base (72) at coding-specific tasks while running marginally lighter.

What is New in Qwen 4 Coder

The headline architecture is the same Mixture-of-Experts shape as Qwen 4 base — 32 billion total parameters with 3 billion active per token. What is new sits in three places: the training pipeline, the agentic tool-use behaviors, and the context window.

A new coder-specific training pipeline

The Qwen team rebuilt the post-training stack around two things: synthetic SWE-Bench-style trajectories generated from real GitHub issue resolutions, and reinforcement learning against an executable test harness. Where Qwen 3.6 Coder was a SFT-on-code-corpora model, Qwen 4 Coder is post-trained against pass/fail signal from actual test suites. That shows up in the numbers — a 9-point jump on SWE-Verified relative to Qwen 3.6.

Agentic tool use is finally usable

Prior Mac-runnable coders could call tools but tended to drop the thread on multi-step plans. Qwen 4 Coder ships with native tool-call formatting that matches the OpenAI function-calling schema, and in our testing it routinely chained 6–8 tool calls without losing state — file reads, edits, test runs, and follow-up edits. This is the first local model where I trust the agentic loop enough to leave it running on a feature branch.

256K context, and it actually uses it

Plenty of models advertise long context windows that collapse past 32K. Qwen 4 Coder holds coherence through long file diffs and large repository tours — needle-in-haystack scores stay above 95% out to 200K tokens. For monorepo work or long debug sessions, this is the practical advantage over Devstral Small 24B's 128K.

Benchmarks Head-to-Head

Here is how Qwen 4 Coder stacks up against the other models you would realistically consider for Mac-local coding work — plus GPT-5-mini as a frontier API reference point:

Benchmark	Qwen 4 Coder	Qwen 4 (base)	Devstral 24B	Qwen 3.6-35B	GPT-5-mini
SWE-Verified	82%	81%	79%	73.4%	78%
HumanEval	96%	95%	93%	91%	96%
MBPP	89%	89%	85%	83%	88%
LiveCodeBench	78%	76%	71%	68%	77%
Context Window	256K	256K	128K	128K	200K

The eye-catching number is SWE-Verified: 82% beats GPT-5-mini, and is within striking distance of GPT-5 full (~85%) and Claude Sonnet 4.6 (~84%). For a model you run on a laptop with no API key, no rate limit, and no per-token cost, this is a genuine shift in what is possible.

Mac Performance by Chip

The MoE architecture means Qwen 4 Coder only activates 3B parameters per token, so inference speeds are excellent for the model's quality tier. Measured on Q4_K_M quantization through Ollama:

Chip / RAM	Tokens/sec	Fits?
M5 Max 128GB	~78 tok/s	Yes
M5 Max 64GB	~65 tok/s	Yes
M4 Max 48GB	~62 tok/s	Yes
M4 Pro 24GB	~58 tok/s	Yes (tight)
M3 Max 64GB	~45 tok/s	Yes
M5 Pro 32GB	—	Borderline

The standout is the M4 Pro 24GB result. Because the active-parameter count is only 3B, the memory-bandwidth bottleneck is forgiving — a mid-tier Mac matches a high-end one on per-token latency. The bigger Maxes pull ahead mostly on prompt-prefill speed and long-context throughput, not on the steady-state generation rate you feel in autocomplete.

A practical note: M5 Pro 32GB technically loads the Q4 weights but leaves only ~10GB for IDE, browser, Slack, and OS — expect swap pressure under real workloads. If you are buying new for coding, prioritize 36GB+ unified memory.

The Apache 2.0 Advantage

Apache 2.0 is not a footnote — it is the most consequential difference between Qwen 4 Coder and the other contender for "best open coder," Llama 5 Coder.

Llama 5 ships under Meta's community license, which adds two structural restrictions: a 700 million monthly active user threshold above which you need a separate commercial license, and a clause that bars use of Llama outputs to train competing models. Both clauses are uncomfortable for enterprise legal review and they have killed real adoption deals we have seen in the past year.

Apache 2.0 has none of that. You can ship Qwen 4 Coder embedded in a paid product, fine-tune it on customer code, redistribute the weights, build a competing API service — all without a separate commercial agreement and without revenue caps. For solo devs and startups in particular, that removes a real legal-review tax.

Real-World Coding Workflows

Refactoring

On a 4,200-line TypeScript service I asked it to extract three helper modules and rewrite the imports. It got the structure right on the first try, only missed two re-exports, and self-corrected after running the test suite once. Devstral Small 24B failed the same task by hallucinating a module path that did not exist.

Agentic tool use

Wired into a small ReAct-style harness with read_file, write_file, grep, and run_tests tools, Qwen 4 Coder fixed 7 of 10 SWE-Bench-Lite-style tasks I hand-curated from real Python OSS issues. That tracks with the 82% SWE-Verified headline. The failure modes are mostly long-horizon: tasks needing 15+ tool calls start to drift.

Multi-file edits

This is where the 256K context window pays off. Loading 9 files of an Express + React stack in one prompt and asking for a cross-cutting auth change worked first try. The model edited 6 of the 9 files coherently and produced a single unified diff.

Install & Setup

Ollama (fastest path)

ollama run qwen4:coder

# or pin to the specific variant
ollama run qwen4:coder-32b-a3b-q4_k_m

MLX (best for M-series, larger context)

pip install mlx-lm
python -m mlx_lm.generate \
  --model mlx-community/Qwen4-Coder-32B-A3B-4bit \
  --prompt "Write a Python function to detect cycles in a directed graph" \
  --max-tokens 1024

IDE integration

Cursor — point the "Custom OpenAI-compatible" endpoint at http://localhost:11434/v1 and select qwen4:coder as the model.
Zed — add {"language_models": {"ollama": {"available_models": [{"name": "qwen4:coder"}]}}} to settings.json.
Continue.dev — works out of the box once Ollama is running; select Qwen 4 Coder in the model picker.

Limitations

Three honest caveats before you ditch your existing setup.

Long-horizon agentic coding still trails DeepSeek V4 Pro. On 30-step trajectories with branching tool calls, DeepSeek V4 Pro is still markedly better. But DeepSeek V4 Pro is a 685B parameter MoE — it is server-only and not realistically Mac-runnable, so the comparison is theoretical for our audience.

Non-English code comments degrade slightly. Versus Qwen 4 base, the Coder variant trades a small amount of multilingual fluency in NL comments for code-task quality. If you write Chinese or German doc-comments, you may prefer Qwen 4 base.

Memory headroom matters more than tok/s. The 24GB M4 Pro number looks great, but real workloads (IDE + browser + Docker) will push you into swap. We strongly recommend 36GB+ for daily coding.

Frequently Asked Questions

LLMCheck Research Team

We benchmark local AI models on real Apple Silicon hardware. Our database covers 46+ models with standardized tok/s measurements using Ollama, LM Studio, and MLX.

Frequently Asked Questions

What is Qwen 4 Coder?

Qwen 4 Coder 32B-A3B is Alibaba's specialized coding LLM released on June 2, 2026. It is a Mixture-of-Experts model with 32B total parameters and 3B active parameters per token, licensed Apache 2.0 with a 256K context window. According to the LLMCheck index, it scores 82% on SWE-Verified — the new record for Mac-runnable open-source coders.

How does Qwen 4 Coder compare to Qwen 4 base?

Qwen 4 Coder ties or slightly beats Qwen 4 base on every coding benchmark while being marginally lighter at inference. On SWE-Verified it scores 82% vs Qwen 4 base's 81%, and on HumanEval it reaches 96% vs 95%. The Coder variant is the right choice if your daily workload is coding — keep Qwen 4 base if you also need general reasoning or writing.

Can my Mac run Qwen 4 Coder?

Yes, if you have 36GB or more of Unified Memory. Qwen 4 Coder 32B-A3B fits comfortably on M4 Pro 24GB at Q4 (~58 tok/s), M4 Max 48GB (~62 tok/s), M5 Max 64GB (~65 tok/s), and M3 Max 64GB (~45 tok/s). M5 Pro 32GB is borderline — the model loads but leaves little room for IDE, browser, and OS overhead.

Qwen 4 Coder vs Devstral Small 24B — which is better?

Qwen 4 Coder wins on every coding benchmark we tested. According to the LLMCheck index, Qwen 4 Coder scores 82% on SWE-Verified vs Devstral Small 24B's 79%, and 96% on HumanEval vs Devstral's 93%. Devstral is still excellent on tighter 24GB RAM budgets, but Qwen 4 Coder is the new default if your Mac can fit it.

Is Qwen 4 Coder really better than GPT-5 for coding?

Qwen 4 Coder beats GPT-5-mini on SWE-Verified (82% vs 78%) and ties it on HumanEval. It still trails GPT-5 full and DeepSeek V4 Pro on long-horizon agentic coding tasks. For most day-to-day refactoring, autocomplete, and feature work, Qwen 4 Coder running locally on a Mac is fully competitive with frontier closed models — and it ships with Apache 2.0 freedoms.

What is the license for Qwen 4 Coder?

Qwen 4 Coder is released under Apache 2.0 — one of the most permissive open-source licenses available. You can use it commercially without revenue caps, redistribute it, fine-tune it, and embed it in proprietary products. There is no acceptable use policy that restricts competitive applications, unlike Llama 5's community license which imposes a 700M MAU threshold and additional commercial restrictions.

Sources & References

🛒 What Mac do you need for Qwen 4 Coder?

Qwen 4 Coder needs about 18 GB of unified memory (est.) — a 24–32 GB Apple Silicon Mac runs it with comfortable headroom for context and your editor:

Mac mini M4 Pro (24GB) → MacBook Pro M4 Pro → Not sure? Ask the Mac Advisor →

As an Amazon Associate, LLMCheck earns from qualifying purchases — at no extra cost to you. Rankings are never influenced by commissions.

Per-chip estimates: M4 Pro M4 Max M3 Max M5 Max

☁️ No Mac with enough RAM handy?

Rent a datacenter GPU by the minute and run Qwen 4 Coder full-size — often 5–6× cheaper than AWS: Vast.ai → (referral link)

See If Your Mac Can Run Qwen 4 Coder

Not sure whether your chip and RAM are enough for the new 32B-A3B model? Our free Mac LLM checker takes your Apple Silicon configuration and returns an instant tok/s estimate plus a shortlist of models that will fit comfortably — no guesswork required.

Check My Mac at LLMCheck.net