The stack at a glance
A local coding assistant has two pieces: a model server that runs the LLM on your hardware, and an editor integration that feeds it your code and shows you suggestions. We use Ollama for the server and Qwen 4 Coder 32B-A3B for the model — Alibaba's Apache 2.0 coding model that, according to LLMCheck benchmarks, hits 82% on SWE-Verified, the best of any open model you can run on a Mac.[1]
Like Qwen 4.1, it is a mixture-of-experts model: 32B total parameters, 3B active per token. That is what lets a near-frontier coder fit in ~18 GB and run at ~58 tok/s on an M4 Pro. For the editor, you have three good choices, and this guide covers all of them: Continue.dev (VS Code), Zed (native Mac editor with built-in AI), and Cursor (pointed at a custom endpoint).
| Editor | Best for | Offline? |
|---|---|---|
| Continue.dev | Deepest local-model control in VS Code | Fully local |
| Zed | Fast native editor, least setup | Fully local |
| Cursor | Already a Cursor user | Mostly (some cloud features) |
Step 1: Hardware Check
Qwen 4 Coder 32B-A3B needs about 18 GB of unified memory at Q4, plus headroom for your editor and the code context you feed it. That makes a 24 GB Mac the recommended minimum. Check yours under Apple menu → About This Mac → Memory.
- 24 GB+ — ideal. Run the full 32B-A3B model.
- 16 GB — drop to a smaller coder such as
qwen4-coder:7b, which fits comfortably and is still strong for autocomplete and routine edits. - 32 GB+ — raise the context window for whole-file and multi-file work.
Buying a Mac for local coding? A 24–32 GB M4 Pro or M5 is the value sweet spot. Our Mac hardware buying hub ranks configurations by real-world tok/s per dollar so you don't overspend on memory you won't use — or under-buy and stall.
Step 2: Install Ollama & Pull Qwen 4 Coder
Install Ollama from ollama.com (full walkthrough in our Install Ollama on Mac guide), then pull the coding model:
ollama pull qwen4-coder
This downloads the Q4 build (~18 GB). To confirm it is ready and check that Ollama's local server is serving it:
ollama list # should show qwen4-coder
ollama serve # starts the API at http://localhost:11434 (usually already running)
Ollama normally runs its server automatically in the background, so ollama serve is only needed if the API isn't already up. You can do a quick smoke test:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen4-coder","messages":[{"role":"user","content":"Write a Python one-liner to flatten a nested list."}]}'
Step 3: Install Continue.dev (or Zed / Cursor)
Pick the editor that fits how you work. The rest of this guide uses Continue.dev as the primary path because it offers the most control, with Zed and Cursor notes alongside.
Continue.dev (VS Code)
Open VS Code, go to the Extensions panel (Cmd+Shift+X), search for Continue, and click Install. A Continue icon appears in your sidebar. It also works in JetBrains IDEs via their plugin marketplace.
Zed
Download Zed from zed.dev. AI is built in — no extension needed. You'll configure its Ollama provider in the next step.
Cursor
If you already use Cursor, you can keep it and point it at your local model. Open Settings → Models, enable a custom OpenAI-compatible base URL, and you'll wire it up in Step 4.
Step 4: Point the Extension at Your Local Model
Continue.dev config
Open the Continue config (click the gear in the Continue sidebar, or edit ~/.continue/config.yaml) and add Qwen 4 Coder as both your chat model and your autocomplete model:
models:
- name: Qwen 4 Coder (local)
provider: ollama
model: qwen4-coder
apiBase: http://localhost:11434
roles:
- chat
- edit
- apply
- name: Qwen 4 Coder Autocomplete
provider: ollama
model: qwen4-coder
apiBase: http://localhost:11434
roles:
- autocomplete
Save the file. Continue picks up the change immediately — you'll see "Qwen 4 Coder (local)" in the model dropdown.
Zed config
In Zed, open settings.json (Cmd+,) and register Ollama as a language-model provider:
{
"language_models": {
"ollama": {
"api_url": "http://localhost:11434"
}
},
"assistant": {
"default_model": {
"provider": "ollama",
"model": "qwen4-coder"
}
}
}
Cursor config
In Cursor's Settings → Models, add a custom model with the OpenAI-compatible base URL pointing at Ollama:
Base URL: http://localhost:11434/v1
API Key: ollama (any non-empty string works locally)
Model: qwen4-coder
Heads-up on Cursor: some Cursor features (like its tab autocomplete and indexing) still route through Cursor's cloud even with a custom model. For a guaranteed fully-offline assistant, prefer Continue.dev or Zed.
Step 5: Use It — Autocomplete, Chat & Agent Mode
With the model wired in, you now have a full coding assistant running on-device. Three things to try:
- Tab autocomplete — start typing a function and Qwen 4 Coder suggests the rest inline. Press
Tabto accept. Great for boilerplate, tests, and repetitive patterns. - Inline chat & edit — select code and press
Cmd+I(Continue) to ask for a refactor, a bug fix, or an explanation. The model rewrites the selection in place and shows a diff you can accept or reject. - Agent mode — in Continue's chat, switch to Agent and give a higher-level task ("add input validation to this endpoint and a test for it"). It reads relevant files, proposes multi-file changes, and applies them with your approval.
A good first prompt to feel it out, with a file open:
Refactor this function to use early returns and add a docstring.
Then write a pytest test that covers the edge cases.
Everything here runs through your local Ollama server — no network calls, no data sent to any vendor, and it keeps working on a plane or behind a firewall.
Tip: For autocomplete that stays snappy, some teams pair a small fast model (e.g. qwen4-coder:7b) for tab completion with the 32B model for chat and agent mode. Continue.dev lets you assign different models per role, exactly as shown in Step 4.
Step 6: Tips & Scaling Beyond Your Mac
Give it more context on bigger Macs
The more of your codebase the model can see, the better its edits. On 32 GB+ Macs, raise the context window so it can hold larger files and more surrounding code. With Ollama you can bake a larger window into a custom model:
# Modelfile
FROM qwen4-coder
PARAMETER num_ctx 32768
ollama create qwen4-coder-32k -f Modelfile
# then reference qwen4-coder-32k in your editor config
On a 24 GB Mac, keep context moderate (8K–16K) so the model doesn't swap and slow down.
When you need a coder bigger than your Mac
Some 2026 frontier coders — DeepSeek V4 Pro, Kimi K3 — are simply too large for any Mac's unified memory. When a hard task outruns what Qwen 4 Coder can do locally, the practical option is to rent a GPU by the hour and run the big model there, keeping the same Ollama/OpenAI-compatible workflow — just point your editor at the remote endpoint instead of localhost.
A cost-effective place to do this is Vast.ai, a marketplace for on-demand GPUs where an H100 or 80GB card runs a few dollars an hour — far cheaper than buying hardware for an occasional frontier-model task.
Disclosure: the Vast.ai link is a referral; if you sign up through it, LLMCheck may earn a small credit at no extra cost to you. We only recommend it because renting beats buying for occasional big-model jobs.
You're set. You now have a private coding assistant — autocomplete, chat, and agent edits — running entirely on your Mac, with a clear path to rent extra horsepower only when you actually need it.