Why is my local LLM only getting 2-3 tok/s on Mac?

The most common cause is memory pressure. When a model is too large for your available RAM, macOS swaps to disk (SSD) which is 10-50x slower than unified memory. Check Activity Monitor — if memory pressure is yellow or red, your model is too large. Switch to a Q4_K_M quantization or a smaller model.

Does MLX make local LLMs faster on Mac?

Yes. MLX is Apple's native machine learning framework optimized for Apple Silicon. According to the LLMCheck index, MLX delivers 20-50% faster inference than llama.cpp-based engines for most models, because it uses unified memory more efficiently and has deeper Metal integration.

What is the fastest quantization for local LLM on Mac?

Q4_K_M offers the best balance of speed and quality for most users. It runs about 2x faster than Q8 and 3-4x faster than F16, with only minimal quality degradation. For maximum speed at the cost of some quality, Q4_K_S is slightly faster.

How much faster are MoE models compared to dense models?

Mixture of Experts models activate only a fraction of their parameters per token. Gemma 4 26B-A4B activates 4B out of 26B parameters, running roughly 3x faster than a dense 26B model while maintaining comparable quality. This makes MoE the best architecture for speed-constrained setups.

Should I reduce context length to speed up local LLM inference?

Yes, if you do not need long context. Each token of context uses memory. Reducing OLLAMA_NUM_CTX from the default (often 8192 or higher) to 4096 frees up RAM for more model layers on GPU, which directly increases tok/s. Only reduce context if your tasks do not require long document processing.

Why Is My Local LLM Slow on Mac? 7 Fixes for Faster Inference

Getting 2 tok/s when you expected 20? Slow local LLM inference on Mac is almost always caused by one of seven issues. This guide walks through each fix in order of impact — from the quick wins that take 30 seconds to the architecture changes that can triple your speed.

Fix 1: Check Metal GPU Is Active

The single biggest speed difference comes from whether your Mac is using its GPU (via Metal) or running on CPU only. Metal-accelerated inference is 3-5x faster than CPU-only mode.

How to check:

Open Activity Monitor (search in Spotlight)
Click the GPU tab (or Window → GPU History)
Run a model and check if your inference engine shows GPU usage
In Ollama, run ollama ps — look for GPU percentage in the output

If GPU usage is zero, see our dedicated GPU not used guide for the full fix.

Quick check: Run ollama ps while a model is loaded. The output shows how many layers are on GPU vs CPU. If GPU% is 0, Metal is not active.

Fix 2: Close Memory-Hungry Apps

Apple Silicon Macs use unified memory — your GPU and CPU share the same RAM pool. Every gigabyte used by another app is a gigabyte unavailable for your model. According to the LLMCheck index, closing background apps can improve inference speed by 30-50% on memory-constrained machines.

Worst offenders:

Safari with many tabs — each tab can use 100-500 MB
Docker Desktop — reserves 2-4 GB by default
Xcode — indexing can consume 2+ GB
Chrome — notorious for memory usage, 50-300 MB per tab
Slack / Teams / Discord — Electron apps using 500 MB+ each

# Check what's using your RAM
# Open Activity Monitor → Memory tab → Sort by Memory column

Fix 3: Use Smaller Quantization

Quantization reduces model precision to shrink file size and memory usage. According to the LLMCheck index, Q4_K_M is the sweet spot — it runs about 2x faster than Q8 with only 1-3% quality degradation on most benchmarks.

Quantization	Size (7B model)	Speed (M3 Pro)	Quality Loss
F16	14 GB	8 tok/s	None (baseline)
Q8_0	7.5 GB	15 tok/s	~0.5%
Q4_K_M	4.1 GB	28 tok/s	~2%
Q4_K_S	3.9 GB	30 tok/s	~3%
Q2_K	2.8 GB	35 tok/s	~8-10%

In Ollama, most models default to Q4_K_M. To explicitly choose a quantization:

# Pull specific quantization
ollama pull qwen3.5:9b-q4_K_M
ollama pull qwen3.5:9b-q8_0

Fix 4: Switch to MLX for 20-50% Speedup

MLX is Apple's native machine learning framework, built specifically for Apple Silicon. It uses unified memory more efficiently than llama.cpp and has deeper Metal integration. According to the LLMCheck index, MLX delivers 20-50% faster tok/s than Ollama for the same model and quantization.

# Install MLX
pip install mlx-lm

# Run a model
mlx_lm.generate --model mlx-community/Qwen2.5-7B-Instruct-4bit \
  --prompt "Explain quantum computing simply"

The tradeoff: MLX requires more technical setup than Ollama, and the model ecosystem is smaller. But if speed is your priority, it is worth the effort.

Fix 5: Reduce Context Length

Context length determines how much text the model can process at once. Larger context uses more memory and slows inference. Most casual use cases work fine with 4096 tokens.

# Set context length for Ollama
export OLLAMA_NUM_CTX=4096
ollama serve

# Or per-model in a Modelfile
FROM qwen3.5:9b
PARAMETER num_ctx 4096

Reducing context from 8192 to 4096 frees approximately 500 MB-1 GB of RAM depending on the model, which translates directly to faster inference.

Fix 6: Use MoE Models

Mixture of Experts (MoE) models are a game-changer for speed. They have many parameters but only activate a fraction per token. Gemma 4 26B-A4B has 26B total parameters but activates only 4B per token — meaning it runs at the speed of a 4B model while delivering quality closer to a 26B model.

Model	Active Params	Speed (M3 Pro)	Quality (MMLU)
Qwen 3.5 26B (dense)	26B	8 tok/s	82%
Gemma 4 26B-A4B (MoE)	4B active	24 tok/s	79%
Qwen 3.5 9B (dense)	9B	22 tok/s	72%

# Pull a MoE model in Ollama
ollama pull gemma4:26b-a4b

Fix 7: Upgrade Your Inference Engine

Each new release of Ollama, LM Studio, and llama.cpp includes Metal optimizations that improve speed. According to the LLMCheck index, Ollama v0.20+ is 15-25% faster than v0.15 for the same model on the same hardware.

# Check your Ollama version
ollama --version

# Update Ollama (re-download from ollama.com or use brew)
brew upgrade ollama

Impact Comparison: All 7 Fixes

Fix	Speed Improvement	Effort	Time to Apply
Check Metal GPU	3-5x if GPU was off	Low	2 min
Close apps	30-50% on constrained RAM	Low	1 min
Smaller quantization	2x (Q8 to Q4_K_M)	Low	5 min
Switch to MLX	20-50%	Medium	15 min
Reduce context	10-25%	Low	2 min
Use MoE model	2-3x vs equivalent dense	Low	5 min
Upgrade engine	15-25%	Low	5 min

Sources

Ollama GitHub repository — Release notes and performance benchmarks
MLX GitHub repository — Apple's ML framework documentation
Apple Metal documentation — GPU acceleration details
LLMCheck Leaderboard — Benchmark data for 79+ models on Apple Silicon
LLMCheck Troubleshooting Hub — More troubleshooting guides

Why Is My Local LLM Slow on Mac? 7 Fixes for Faster Inference

Fix 1: Check Metal GPU Is Active

Fix 2: Close Memory-Hungry Apps

Fix 3: Use Smaller Quantization

Fix 4: Switch to MLX for 20-50% Speedup

Fix 5: Reduce Context Length

Fix 6: Use MoE Models

Fix 7: Upgrade Your Inference Engine

Impact Comparison: All 7 Fixes

Sources

Frequently Asked Questions

Why is my local LLM only getting 2-3 tok/s on Mac?

Does MLX make local LLMs faster on Mac?

What is the fastest quantization for local LLM on Mac?

How much faster are MoE models compared to dense models?

Should I reduce context length to speed up local LLM inference?

Find the Fastest Model for Your Mac