Quick Verdict: Who Should Switch Immediately

Switch to Qwen 4 Coder if…

You are running Devstral, Qwen 3.6 Coder, DeepSeek Coder V3, or Llama 5 Coder for local coding work — and you have at least 36GB Unified Memory. Qwen 4 Coder is faster, more accurate, has a longer context, and ships under a cleaner license. The migration is a one-line ollama pull.

Hold off if…

You are on an 8GB or 16GB Mac. The 32B-A3B size needs roughly 22GB of memory at Q4, which leaves no headroom on a 16GB machine. Stay on Qwen 3.6 7B Coder or Gemma 4 E4B for now, or pick up a 24GB+ Mac before upgrading.

According to LLMCheck benchmarks, Qwen 4 Coder 32B-A3B scores LLMCheck Score 72/100 — the highest of any model in the Mac-runnable coder category. It edges out Devstral Small 24B (68), Qwen 3.6-35B-A3B (69), and ties Qwen 4 base (72) at coding-specific tasks while running marginally lighter.

What is New in Qwen 4 Coder

The headline architecture is the same Mixture-of-Experts shape as Qwen 4 base — 32 billion total parameters with 3 billion active per token. What is new sits in three places: the training pipeline, the agentic tool-use behaviors, and the context window.

A new coder-specific training pipeline

The Qwen team rebuilt the post-training stack around two things: synthetic SWE-Bench-style trajectories generated from real GitHub issue resolutions, and reinforcement learning against an executable test harness. Where Qwen 3.6 Coder was a SFT-on-code-corpora model, Qwen 4 Coder is post-trained against pass/fail signal from actual test suites. That shows up in the numbers — a 9-point jump on SWE-Verified relative to Qwen 3.6.

Agentic tool use is finally usable

Prior Mac-runnable coders could call tools but tended to drop the thread on multi-step plans. Qwen 4 Coder ships with native tool-call formatting that matches the OpenAI function-calling schema, and in our testing it routinely chained 6–8 tool calls without losing state — file reads, edits, test runs, and follow-up edits. This is the first local model where I trust the agentic loop enough to leave it running on a feature branch.

256K context, and it actually uses it

Plenty of models advertise long context windows that collapse past 32K. Qwen 4 Coder holds coherence through long file diffs and large repository tours — needle-in-haystack scores stay above 95% out to 200K tokens. For monorepo work or long debug sessions, this is the practical advantage over Devstral Small 24B's 128K.

Benchmarks Head-to-Head

Here is how Qwen 4 Coder stacks up against the other models you would realistically consider for Mac-local coding work — plus GPT-5-mini as a frontier API reference point:

Benchmark Qwen 4 Coder Qwen 4 (base) Devstral 24B Qwen 3.6-35B GPT-5-mini
SWE-Verified 82% 81% 79% 73.4% 78%
HumanEval 96% 95% 93% 91% 96%
MBPP 89% 89% 85% 83% 88%
LiveCodeBench 78% 76% 71% 68% 77%
Context Window 256K 256K 128K 128K 200K

The eye-catching number is SWE-Verified: 82% beats GPT-5-mini, and is within striking distance of GPT-5 full (~85%) and Claude Sonnet 4.6 (~84%). For a model you run on a laptop with no API key, no rate limit, and no per-token cost, this is a genuine shift in what is possible.

Mac Performance by Chip

The MoE architecture means Qwen 4 Coder only activates 3B parameters per token, so inference speeds are excellent for the model's quality tier. Measured on Q4_K_M quantization through Ollama:

Chip / RAM Tokens/sec Fits?
M5 Max 128GB ~78 tok/s Yes
M5 Max 64GB ~65 tok/s Yes
M4 Max 48GB ~62 tok/s Yes
M4 Pro 24GB ~58 tok/s Yes (tight)
M3 Max 64GB ~45 tok/s Yes
M5 Pro 32GB Borderline

The standout is the M4 Pro 24GB result. Because the active-parameter count is only 3B, the memory-bandwidth bottleneck is forgiving — a mid-tier Mac matches a high-end one on per-token latency. The bigger Maxes pull ahead mostly on prompt-prefill speed and long-context throughput, not on the steady-state generation rate you feel in autocomplete.

A practical note: M5 Pro 32GB technically loads the Q4 weights but leaves only ~10GB for IDE, browser, Slack, and OS — expect swap pressure under real workloads. If you are buying new for coding, prioritize 36GB+ unified memory.

The Apache 2.0 Advantage

Apache 2.0 is not a footnote — it is the most consequential difference between Qwen 4 Coder and the other contender for "best open coder," Llama 5 Coder.

Llama 5 ships under Meta's community license, which adds two structural restrictions: a 700 million monthly active user threshold above which you need a separate commercial license, and a clause that bars use of Llama outputs to train competing models. Both clauses are uncomfortable for enterprise legal review and they have killed real adoption deals we have seen in the past year.

Apache 2.0 has none of that. You can ship Qwen 4 Coder embedded in a paid product, fine-tune it on customer code, redistribute the weights, build a competing API service — all without a separate commercial agreement and without revenue caps. For solo devs and startups in particular, that removes a real legal-review tax.

Real-World Coding Workflows

Refactoring

On a 4,200-line TypeScript service I asked it to extract three helper modules and rewrite the imports. It got the structure right on the first try, only missed two re-exports, and self-corrected after running the test suite once. Devstral Small 24B failed the same task by hallucinating a module path that did not exist.

Agentic tool use

Wired into a small ReAct-style harness with read_file, write_file, grep, and run_tests tools, Qwen 4 Coder fixed 7 of 10 SWE-Bench-Lite-style tasks I hand-curated from real Python OSS issues. That tracks with the 82% SWE-Verified headline. The failure modes are mostly long-horizon: tasks needing 15+ tool calls start to drift.

Multi-file edits

This is where the 256K context window pays off. Loading 9 files of an Express + React stack in one prompt and asking for a cross-cutting auth change worked first try. The model edited 6 of the 9 files coherently and produced a single unified diff.

Install & Setup

Ollama (fastest path)

ollama run qwen4:coder

# or pin to the specific variant
ollama run qwen4:coder-32b-a3b-q4_k_m

MLX (best for M-series, larger context)

pip install mlx-lm
python -m mlx_lm.generate \
  --model mlx-community/Qwen4-Coder-32B-A3B-4bit \
  --prompt "Write a Python function to detect cycles in a directed graph" \
  --max-tokens 1024

IDE integration

Limitations

Three honest caveats before you ditch your existing setup.

Long-horizon agentic coding still trails DeepSeek V4 Pro. On 30-step trajectories with branching tool calls, DeepSeek V4 Pro is still markedly better. But DeepSeek V4 Pro is a 685B parameter MoE — it is server-only and not realistically Mac-runnable, so the comparison is theoretical for our audience.

Non-English code comments degrade slightly. Versus Qwen 4 base, the Coder variant trades a small amount of multilingual fluency in NL comments for code-task quality. If you write Chinese or German doc-comments, you may prefer Qwen 4 base.

Memory headroom matters more than tok/s. The 24GB M4 Pro number looks great, but real workloads (IDE + browser + Docker) will push you into swap. We strongly recommend 36GB+ for daily coding.

Frequently Asked Questions