Fix 1: Check Metal GPU Is Active
The single biggest speed difference comes from whether your Mac is using its GPU (via Metal) or running on CPU only. Metal-accelerated inference is 3-5x faster than CPU-only mode.
How to check:
- Open Activity Monitor (search in Spotlight)
- Click the GPU tab (or Window → GPU History)
- Run a model and check if your inference engine shows GPU usage
- In Ollama, run
ollama ps— look for GPU percentage in the output
If GPU usage is zero, see our dedicated GPU not used guide for the full fix.
Quick check: Run ollama ps while a model is loaded. The output shows how many layers are on GPU vs CPU. If GPU% is 0, Metal is not active.
Fix 2: Close Memory-Hungry Apps
Apple Silicon Macs use unified memory — your GPU and CPU share the same RAM pool. Every gigabyte used by another app is a gigabyte unavailable for your model. According to LLMCheck testing, closing background apps can improve inference speed by 30-50% on memory-constrained machines.
Worst offenders:
- Safari with many tabs — each tab can use 100-500 MB
- Docker Desktop — reserves 2-4 GB by default
- Xcode — indexing can consume 2+ GB
- Chrome — notorious for memory usage, 50-300 MB per tab
- Slack / Teams / Discord — Electron apps using 500 MB+ each
# Check what's using your RAM
# Open Activity Monitor → Memory tab → Sort by Memory column
Fix 3: Use Smaller Quantization
Quantization reduces model precision to shrink file size and memory usage. According to LLMCheck benchmarks, Q4_K_M is the sweet spot — it runs about 2x faster than Q8 with only 1-3% quality degradation on most benchmarks.
| Quantization | Size (7B model) | Speed (M3 Pro) | Quality Loss |
|---|---|---|---|
| F16 | 14 GB | 8 tok/s | None (baseline) |
| Q8_0 | 7.5 GB | 15 tok/s | ~0.5% |
| Q4_K_M | 4.1 GB | 28 tok/s | ~2% |
| Q4_K_S | 3.9 GB | 30 tok/s | ~3% |
| Q2_K | 2.8 GB | 35 tok/s | ~8-10% |
In Ollama, most models default to Q4_K_M. To explicitly choose a quantization:
# Pull specific quantization
ollama pull qwen3.5:9b-q4_K_M
ollama pull qwen3.5:9b-q8_0
Fix 4: Switch to MLX for 20-50% Speedup
MLX is Apple's native machine learning framework, built specifically for Apple Silicon. It uses unified memory more efficiently than llama.cpp and has deeper Metal integration. According to LLMCheck benchmarks, MLX delivers 20-50% faster tok/s than Ollama for the same model and quantization.
# Install MLX
pip install mlx-lm
# Run a model
mlx_lm.generate --model mlx-community/Qwen2.5-7B-Instruct-4bit \
--prompt "Explain quantum computing simply"
The tradeoff: MLX requires more technical setup than Ollama, and the model ecosystem is smaller. But if speed is your priority, it is worth the effort.
Fix 5: Reduce Context Length
Context length determines how much text the model can process at once. Larger context uses more memory and slows inference. Most casual use cases work fine with 4096 tokens.
# Set context length for Ollama
export OLLAMA_NUM_CTX=4096
ollama serve
# Or per-model in a Modelfile
FROM qwen3.5:9b
PARAMETER num_ctx 4096
Reducing context from 8192 to 4096 frees approximately 500 MB-1 GB of RAM depending on the model, which translates directly to faster inference.
Fix 6: Use MoE Models
Mixture of Experts (MoE) models are a game-changer for speed. They have many parameters but only activate a fraction per token. Gemma 4 26B-A4B has 26B total parameters but activates only 4B per token — meaning it runs at the speed of a 4B model while delivering quality closer to a 26B model.
| Model | Active Params | Speed (M3 Pro) | Quality (MMLU) |
|---|---|---|---|
| Qwen 3.5 26B (dense) | 26B | 8 tok/s | 82% |
| Gemma 4 26B-A4B (MoE) | 4B active | 24 tok/s | 79% |
| Qwen 3.5 9B (dense) | 9B | 22 tok/s | 72% |
# Pull a MoE model in Ollama
ollama pull gemma4:26b-a4b
Fix 7: Upgrade Your Inference Engine
Each new release of Ollama, LM Studio, and llama.cpp includes Metal optimizations that improve speed. According to LLMCheck testing, Ollama v0.20+ is 15-25% faster than v0.15 for the same model on the same hardware.
# Check your Ollama version
ollama --version
# Update Ollama (re-download from ollama.com or use brew)
brew upgrade ollama
Impact Comparison: All 7 Fixes
| Fix | Speed Improvement | Effort | Time to Apply |
|---|---|---|---|
| Check Metal GPU | 3-5x if GPU was off | Low | 2 min |
| Close apps | 30-50% on constrained RAM | Low | 1 min |
| Smaller quantization | 2x (Q8 to Q4_K_M) | Low | 5 min |
| Switch to MLX | 20-50% | Medium | 15 min |
| Reduce context | 10-25% | Low | 2 min |
| Use MoE model | 2-3x vs equivalent dense | Low | 5 min |
| Upgrade engine | 15-25% | Low | 5 min |
Sources
- Ollama GitHub repository — Release notes and performance benchmarks
- MLX GitHub repository — Apple's ML framework documentation
- Apple Metal documentation — GPU acceleration details
- LLMCheck Leaderboard — Benchmark data for 42+ models on Apple Silicon
- LLMCheck Troubleshooting Hub — More troubleshooting guides