Fix 1: Check Metal GPU Is Active

The single biggest speed difference comes from whether your Mac is using its GPU (via Metal) or running on CPU only. Metal-accelerated inference is 3-5x faster than CPU-only mode.

How to check:

  1. Open Activity Monitor (search in Spotlight)
  2. Click the GPU tab (or Window → GPU History)
  3. Run a model and check if your inference engine shows GPU usage
  4. In Ollama, run ollama ps — look for GPU percentage in the output

If GPU usage is zero, see our dedicated GPU not used guide for the full fix.

Quick check: Run ollama ps while a model is loaded. The output shows how many layers are on GPU vs CPU. If GPU% is 0, Metal is not active.

Fix 2: Close Memory-Hungry Apps

Apple Silicon Macs use unified memory — your GPU and CPU share the same RAM pool. Every gigabyte used by another app is a gigabyte unavailable for your model. According to LLMCheck testing, closing background apps can improve inference speed by 30-50% on memory-constrained machines.

Worst offenders:

# Check what's using your RAM
# Open Activity Monitor → Memory tab → Sort by Memory column

Fix 3: Use Smaller Quantization

Quantization reduces model precision to shrink file size and memory usage. According to LLMCheck benchmarks, Q4_K_M is the sweet spot — it runs about 2x faster than Q8 with only 1-3% quality degradation on most benchmarks.

QuantizationSize (7B model)Speed (M3 Pro)Quality Loss
F1614 GB8 tok/sNone (baseline)
Q8_07.5 GB15 tok/s~0.5%
Q4_K_M4.1 GB28 tok/s~2%
Q4_K_S3.9 GB30 tok/s~3%
Q2_K2.8 GB35 tok/s~8-10%

In Ollama, most models default to Q4_K_M. To explicitly choose a quantization:

# Pull specific quantization
ollama pull qwen3.5:9b-q4_K_M
ollama pull qwen3.5:9b-q8_0

Fix 4: Switch to MLX for 20-50% Speedup

MLX is Apple's native machine learning framework, built specifically for Apple Silicon. It uses unified memory more efficiently than llama.cpp and has deeper Metal integration. According to LLMCheck benchmarks, MLX delivers 20-50% faster tok/s than Ollama for the same model and quantization.

# Install MLX
pip install mlx-lm

# Run a model
mlx_lm.generate --model mlx-community/Qwen2.5-7B-Instruct-4bit \
  --prompt "Explain quantum computing simply"

The tradeoff: MLX requires more technical setup than Ollama, and the model ecosystem is smaller. But if speed is your priority, it is worth the effort.

Fix 5: Reduce Context Length

Context length determines how much text the model can process at once. Larger context uses more memory and slows inference. Most casual use cases work fine with 4096 tokens.

# Set context length for Ollama
export OLLAMA_NUM_CTX=4096
ollama serve

# Or per-model in a Modelfile
FROM qwen3.5:9b
PARAMETER num_ctx 4096

Reducing context from 8192 to 4096 frees approximately 500 MB-1 GB of RAM depending on the model, which translates directly to faster inference.

Fix 6: Use MoE Models

Mixture of Experts (MoE) models are a game-changer for speed. They have many parameters but only activate a fraction per token. Gemma 4 26B-A4B has 26B total parameters but activates only 4B per token — meaning it runs at the speed of a 4B model while delivering quality closer to a 26B model.

ModelActive ParamsSpeed (M3 Pro)Quality (MMLU)
Qwen 3.5 26B (dense)26B8 tok/s82%
Gemma 4 26B-A4B (MoE)4B active24 tok/s79%
Qwen 3.5 9B (dense)9B22 tok/s72%
# Pull a MoE model in Ollama
ollama pull gemma4:26b-a4b

Fix 7: Upgrade Your Inference Engine

Each new release of Ollama, LM Studio, and llama.cpp includes Metal optimizations that improve speed. According to LLMCheck testing, Ollama v0.20+ is 15-25% faster than v0.15 for the same model on the same hardware.

# Check your Ollama version
ollama --version

# Update Ollama (re-download from ollama.com or use brew)
brew upgrade ollama

Impact Comparison: All 7 Fixes

FixSpeed ImprovementEffortTime to Apply
Check Metal GPU3-5x if GPU was offLow2 min
Close apps30-50% on constrained RAMLow1 min
Smaller quantization2x (Q8 to Q4_K_M)Low5 min
Switch to MLX20-50%Medium15 min
Reduce context10-25%Low2 min
Use MoE model2-3x vs equivalent denseLow5 min
Upgrade engine15-25%Low5 min

Sources