The stack at a glance

A local coding assistant has two pieces: a model server that runs the LLM on your hardware, and an editor integration that feeds it your code and shows you suggestions. We use Ollama for the server and Qwen 4 Coder 32B-A3B for the model — Alibaba's Apache 2.0 coding model that, according to LLMCheck benchmarks, hits 82% on SWE-Verified, the best of any open model you can run on a Mac.[1]

Like Qwen 4.1, it is a mixture-of-experts model: 32B total parameters, 3B active per token. That is what lets a near-frontier coder fit in ~18 GB and run at ~58 tok/s on an M4 Pro. For the editor, you have three good choices, and this guide covers all of them: Continue.dev (VS Code), Zed (native Mac editor with built-in AI), and Cursor (pointed at a custom endpoint).

EditorBest forOffline?
Continue.devDeepest local-model control in VS CodeFully local
ZedFast native editor, least setupFully local
CursorAlready a Cursor userMostly (some cloud features)

Step 1: Hardware Check

Qwen 4 Coder 32B-A3B needs about 18 GB of unified memory at Q4, plus headroom for your editor and the code context you feed it. That makes a 24 GB Mac the recommended minimum. Check yours under Apple menu → About This MacMemory.

Buying a Mac for local coding? A 24–32 GB M4 Pro or M5 is the value sweet spot. Our Mac hardware buying hub ranks configurations by real-world tok/s per dollar so you don't overspend on memory you won't use — or under-buy and stall.

Step 2: Install Ollama & Pull Qwen 4 Coder

Install Ollama from ollama.com (full walkthrough in our Install Ollama on Mac guide), then pull the coding model:

ollama pull qwen4-coder

This downloads the Q4 build (~18 GB). To confirm it is ready and check that Ollama's local server is serving it:

ollama list          # should show qwen4-coder
ollama serve         # starts the API at http://localhost:11434 (usually already running)

Ollama normally runs its server automatically in the background, so ollama serve is only needed if the API isn't already up. You can do a quick smoke test:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen4-coder","messages":[{"role":"user","content":"Write a Python one-liner to flatten a nested list."}]}'

Step 3: Install Continue.dev (or Zed / Cursor)

Pick the editor that fits how you work. The rest of this guide uses Continue.dev as the primary path because it offers the most control, with Zed and Cursor notes alongside.

Continue.dev (VS Code)

Open VS Code, go to the Extensions panel (Cmd+Shift+X), search for Continue, and click Install. A Continue icon appears in your sidebar. It also works in JetBrains IDEs via their plugin marketplace.

Zed

Download Zed from zed.dev. AI is built in — no extension needed. You'll configure its Ollama provider in the next step.

Cursor

If you already use Cursor, you can keep it and point it at your local model. Open Settings → Models, enable a custom OpenAI-compatible base URL, and you'll wire it up in Step 4.

Step 4: Point the Extension at Your Local Model

Continue.dev config

Open the Continue config (click the gear in the Continue sidebar, or edit ~/.continue/config.yaml) and add Qwen 4 Coder as both your chat model and your autocomplete model:

models:
  - name: Qwen 4 Coder (local)
    provider: ollama
    model: qwen4-coder
    apiBase: http://localhost:11434
    roles:
      - chat
      - edit
      - apply

  - name: Qwen 4 Coder Autocomplete
    provider: ollama
    model: qwen4-coder
    apiBase: http://localhost:11434
    roles:
      - autocomplete

Save the file. Continue picks up the change immediately — you'll see "Qwen 4 Coder (local)" in the model dropdown.

Zed config

In Zed, open settings.json (Cmd+,) and register Ollama as a language-model provider:

{
  "language_models": {
    "ollama": {
      "api_url": "http://localhost:11434"
    }
  },
  "assistant": {
    "default_model": {
      "provider": "ollama",
      "model": "qwen4-coder"
    }
  }
}

Cursor config

In Cursor's Settings → Models, add a custom model with the OpenAI-compatible base URL pointing at Ollama:

Base URL:  http://localhost:11434/v1
API Key:   ollama        (any non-empty string works locally)
Model:     qwen4-coder

Heads-up on Cursor: some Cursor features (like its tab autocomplete and indexing) still route through Cursor's cloud even with a custom model. For a guaranteed fully-offline assistant, prefer Continue.dev or Zed.

Step 5: Use It — Autocomplete, Chat & Agent Mode

With the model wired in, you now have a full coding assistant running on-device. Three things to try:

A good first prompt to feel it out, with a file open:

Refactor this function to use early returns and add a docstring.
Then write a pytest test that covers the edge cases.

Everything here runs through your local Ollama server — no network calls, no data sent to any vendor, and it keeps working on a plane or behind a firewall.

Tip: For autocomplete that stays snappy, some teams pair a small fast model (e.g. qwen4-coder:7b) for tab completion with the 32B model for chat and agent mode. Continue.dev lets you assign different models per role, exactly as shown in Step 4.

Step 6: Tips & Scaling Beyond Your Mac

Give it more context on bigger Macs

The more of your codebase the model can see, the better its edits. On 32 GB+ Macs, raise the context window so it can hold larger files and more surrounding code. With Ollama you can bake a larger window into a custom model:

# Modelfile
FROM qwen4-coder
PARAMETER num_ctx 32768
ollama create qwen4-coder-32k -f Modelfile
# then reference qwen4-coder-32k in your editor config

On a 24 GB Mac, keep context moderate (8K–16K) so the model doesn't swap and slow down.

When you need a coder bigger than your Mac

Some 2026 frontier coders — DeepSeek V4 Pro, Kimi K3 — are simply too large for any Mac's unified memory. When a hard task outruns what Qwen 4 Coder can do locally, the practical option is to rent a GPU by the hour and run the big model there, keeping the same Ollama/OpenAI-compatible workflow — just point your editor at the remote endpoint instead of localhost.

A cost-effective place to do this is Vast.ai, a marketplace for on-demand GPUs where an H100 or 80GB card runs a few dollars an hour — far cheaper than buying hardware for an occasional frontier-model task.

Disclosure: the Vast.ai link is a referral; if you sign up through it, LLMCheck may earn a small credit at no extra cost to you. We only recommend it because renting beats buying for occasional big-model jobs.

You're set. You now have a private coding assistant — autocomplete, chat, and agent edits — running entirely on your Mac, with a clear path to rent extra horsepower only when you actually need it.