Quick Verdict: Who Should Switch Immediately
Switch to Qwen 4 Coder if…
You are running Devstral, Qwen 3.6 Coder, DeepSeek Coder V3, or Llama 5 Coder for local coding work — and you have at least 36GB Unified Memory. Qwen 4 Coder is faster, more accurate, has a longer context, and ships under a cleaner license. The migration is a one-line ollama pull.
Hold off if…
You are on an 8GB or 16GB Mac. The 32B-A3B size needs roughly 22GB of memory at Q4, which leaves no headroom on a 16GB machine. Stay on Qwen 3.6 7B Coder or Gemma 4 E4B for now, or pick up a 24GB+ Mac before upgrading.
According to LLMCheck benchmarks, Qwen 4 Coder 32B-A3B scores LLMCheck Score 72/100 — the highest of any model in the Mac-runnable coder category. It edges out Devstral Small 24B (68), Qwen 3.6-35B-A3B (69), and ties Qwen 4 base (72) at coding-specific tasks while running marginally lighter.
What is New in Qwen 4 Coder
The headline architecture is the same Mixture-of-Experts shape as Qwen 4 base — 32 billion total parameters with 3 billion active per token. What is new sits in three places: the training pipeline, the agentic tool-use behaviors, and the context window.
A new coder-specific training pipeline
The Qwen team rebuilt the post-training stack around two things: synthetic SWE-Bench-style trajectories generated from real GitHub issue resolutions, and reinforcement learning against an executable test harness. Where Qwen 3.6 Coder was a SFT-on-code-corpora model, Qwen 4 Coder is post-trained against pass/fail signal from actual test suites. That shows up in the numbers — a 9-point jump on SWE-Verified relative to Qwen 3.6.
Agentic tool use is finally usable
Prior Mac-runnable coders could call tools but tended to drop the thread on multi-step plans. Qwen 4 Coder ships with native tool-call formatting that matches the OpenAI function-calling schema, and in our testing it routinely chained 6–8 tool calls without losing state — file reads, edits, test runs, and follow-up edits. This is the first local model where I trust the agentic loop enough to leave it running on a feature branch.
256K context, and it actually uses it
Plenty of models advertise long context windows that collapse past 32K. Qwen 4 Coder holds coherence through long file diffs and large repository tours — needle-in-haystack scores stay above 95% out to 200K tokens. For monorepo work or long debug sessions, this is the practical advantage over Devstral Small 24B's 128K.
Benchmarks Head-to-Head
Here is how Qwen 4 Coder stacks up against the other models you would realistically consider for Mac-local coding work — plus GPT-5-mini as a frontier API reference point:
| Benchmark | Qwen 4 Coder | Qwen 4 (base) | Devstral 24B | Qwen 3.6-35B | GPT-5-mini |
|---|---|---|---|---|---|
| SWE-Verified | 82% | 81% | 79% | 73.4% | 78% |
| HumanEval | 96% | 95% | 93% | 91% | 96% |
| MBPP | 89% | 89% | 85% | 83% | 88% |
| LiveCodeBench | 78% | 76% | 71% | 68% | 77% |
| Context Window | 256K | 256K | 128K | 128K | 200K |
The eye-catching number is SWE-Verified: 82% beats GPT-5-mini, and is within striking distance of GPT-5 full (~85%) and Claude Sonnet 4.6 (~84%). For a model you run on a laptop with no API key, no rate limit, and no per-token cost, this is a genuine shift in what is possible.
Mac Performance by Chip
The MoE architecture means Qwen 4 Coder only activates 3B parameters per token, so inference speeds are excellent for the model's quality tier. Measured on Q4_K_M quantization through Ollama:
| Chip / RAM | Tokens/sec | Fits? |
|---|---|---|
| M5 Max 128GB | ~78 tok/s | Yes |
| M5 Max 64GB | ~65 tok/s | Yes |
| M4 Max 48GB | ~62 tok/s | Yes |
| M4 Pro 24GB | ~58 tok/s | Yes (tight) |
| M3 Max 64GB | ~45 tok/s | Yes |
| M5 Pro 32GB | — | Borderline |
The standout is the M4 Pro 24GB result. Because the active-parameter count is only 3B, the memory-bandwidth bottleneck is forgiving — a mid-tier Mac matches a high-end one on per-token latency. The bigger Maxes pull ahead mostly on prompt-prefill speed and long-context throughput, not on the steady-state generation rate you feel in autocomplete.
A practical note: M5 Pro 32GB technically loads the Q4 weights but leaves only ~10GB for IDE, browser, Slack, and OS — expect swap pressure under real workloads. If you are buying new for coding, prioritize 36GB+ unified memory.
The Apache 2.0 Advantage
Apache 2.0 is not a footnote — it is the most consequential difference between Qwen 4 Coder and the other contender for "best open coder," Llama 5 Coder.
Llama 5 ships under Meta's community license, which adds two structural restrictions: a 700 million monthly active user threshold above which you need a separate commercial license, and a clause that bars use of Llama outputs to train competing models. Both clauses are uncomfortable for enterprise legal review and they have killed real adoption deals we have seen in the past year.
Apache 2.0 has none of that. You can ship Qwen 4 Coder embedded in a paid product, fine-tune it on customer code, redistribute the weights, build a competing API service — all without a separate commercial agreement and without revenue caps. For solo devs and startups in particular, that removes a real legal-review tax.
Real-World Coding Workflows
Refactoring
On a 4,200-line TypeScript service I asked it to extract three helper modules and rewrite the imports. It got the structure right on the first try, only missed two re-exports, and self-corrected after running the test suite once. Devstral Small 24B failed the same task by hallucinating a module path that did not exist.
Agentic tool use
Wired into a small ReAct-style harness with read_file, write_file, grep, and run_tests tools, Qwen 4 Coder fixed 7 of 10 SWE-Bench-Lite-style tasks I hand-curated from real Python OSS issues. That tracks with the 82% SWE-Verified headline. The failure modes are mostly long-horizon: tasks needing 15+ tool calls start to drift.
Multi-file edits
This is where the 256K context window pays off. Loading 9 files of an Express + React stack in one prompt and asking for a cross-cutting auth change worked first try. The model edited 6 of the 9 files coherently and produced a single unified diff.
Install & Setup
Ollama (fastest path)
ollama run qwen4:coder
# or pin to the specific variant
ollama run qwen4:coder-32b-a3b-q4_k_m
MLX (best for M-series, larger context)
pip install mlx-lm
python -m mlx_lm.generate \
--model mlx-community/Qwen4-Coder-32B-A3B-4bit \
--prompt "Write a Python function to detect cycles in a directed graph" \
--max-tokens 1024
IDE integration
- Cursor — point the "Custom OpenAI-compatible" endpoint at
http://localhost:11434/v1and selectqwen4:coderas the model. - Zed — add
{"language_models": {"ollama": {"available_models": [{"name": "qwen4:coder"}]}}}tosettings.json. - Continue.dev — works out of the box once Ollama is running; select Qwen 4 Coder in the model picker.
Limitations
Three honest caveats before you ditch your existing setup.
Long-horizon agentic coding still trails DeepSeek V4 Pro. On 30-step trajectories with branching tool calls, DeepSeek V4 Pro is still markedly better. But DeepSeek V4 Pro is a 685B parameter MoE — it is server-only and not realistically Mac-runnable, so the comparison is theoretical for our audience.
Non-English code comments degrade slightly. Versus Qwen 4 base, the Coder variant trades a small amount of multilingual fluency in NL comments for code-task quality. If you write Chinese or German doc-comments, you may prefer Qwen 4 base.
Memory headroom matters more than tok/s. The 24GB M4 Pro number looks great, but real workloads (IDE + browser + Docker) will push you into swap. We strongly recommend 36GB+ for daily coding.