Diagnose: Is Your GPU Active?
Before fixing anything, confirm whether your GPU is actually being used. There are three ways to check:
Method 1: Activity Monitor
- Open Activity Monitor (Spotlight → type "Activity Monitor")
- Go to Window → GPU History (or click the GPU tab)
- Run your model and watch for GPU usage spikes
- If GPU stays at 0% during generation, Metal is not active
Method 2: Ollama ps
ollama ps
# Example output (GPU active):
# NAME SIZE PROCESSOR UNTIL
# qwen3.5:9b 5.5 GB 100% GPU 4 minutes from now
# Example output (CPU only):
# NAME SIZE PROCESSOR UNTIL
# qwen3.5:9b 5.5 GB 100% CPU 4 minutes from now
Method 3: Ollama serve output
# Stop Ollama, then start manually to see logs
ollama serve
# Look for these lines indicating Metal is active:
# ggml_metal_init: allocating
# ggml_metal_init: loaded kernel_*
Key indicator: If ollama ps shows "100% CPU" instead of "100% GPU", your model is running without Metal acceleration. Follow the fix for your specific engine below.
Fix for Ollama
Ollama v0.15+ automatically enables Metal on Apple Silicon. If GPU is not active, try these steps:
- Update Ollama to the latest version (v0.20+ recommended):
brew upgrade ollama # Or re-download from ollama.com - Set the Metal environment variable explicitly:
export OLLAMA_METAL=1 ollama serve - Check model fits in RAM — if the model exceeds 75% of your total RAM, Ollama falls back to partial or full CPU mode:
# Check model size ollama show qwen3.5:9b --modelfile | grep size - Restart Ollama completely:
# Kill all Ollama processes pkill ollama # Wait 2 seconds, then restart ollama serve
Fix for LM Studio
LM Studio has a dedicated GPU settings panel. According to LLMCheck testing, the most common issue is GPU layers being set to 0 (CPU-only mode).
- Open LM Studio and go to Settings
- Navigate to Hardware or GPU Settings
- Set GPU Layers to max (or a specific number like 35-99)
- Reload the model for changes to take effect
If GPU layers is set to 0, the model runs entirely on CPU. Setting it to max puts as many layers on GPU as memory allows.
Fix for MLX
MLX always uses Metal by design. There is no CPU fallback and no configuration needed. If MLX runs on your Apple Silicon Mac, Metal is active.
# Install MLX
pip install mlx-lm
# Run a model — Metal is always active
mlx_lm.generate --model mlx-community/Qwen2.5-7B-Instruct-4bit \
--prompt "Hello world"
If MLX throws an error, it typically means your macOS version is too old (requires macOS 13.3+) or you are on an Intel Mac (MLX only supports Apple Silicon).
Fix for llama.cpp
If you compiled llama.cpp yourself, Metal must be explicitly enabled at build time:
# Clone and build with Metal support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DLLAMA_METAL=ON
cmake --build . --config Release
Without the -DLLAMA_METAL=ON flag, llama.cpp builds in CPU-only mode. According to LLMCheck, this is the single most common reason for GPU not being used when running llama.cpp directly.
Common Causes of GPU Not Being Used
- Outdated engine version — Older versions of Ollama (pre-v0.15) had limited or no Metal support
- Intel Mac — Metal acceleration for LLMs requires Apple Silicon (M1+). Intel Macs cannot use Metal for this purpose
- Model too large — If the model exceeds available memory, the engine falls back to CPU for overflow layers
- macOS too old — Metal compute shaders for LLMs require macOS 13 Ventura or later
- Background GPU processes — Other apps using the GPU (video editing, gaming) can cause resource contention
- llama.cpp built without Metal flag — The
-DLLAMA_METAL=ONcmake flag is required
GPU vs CPU Inference Speed Comparison
According to LLMCheck benchmarks, the speed difference between Metal GPU and CPU-only inference is dramatic across all model sizes:
| Model | GPU (Metal) tok/s | CPU Only tok/s | Speedup |
|---|---|---|---|
| Phi-4 Mini 3.8B Q4 | 52 tok/s | 14 tok/s | 3.7x |
| Qwen 3.5 9B Q4 | 28 tok/s | 7 tok/s | 4.0x |
| Gemma 4 26B-A4B Q4 | 24 tok/s | 5 tok/s | 4.8x |
| Llama 4 Scout Q4 | 18 tok/s | 4 tok/s | 4.5x |
| Qwen 3.5 32B Q4 | 12 tok/s | 3 tok/s | 4.0x |
Benchmarks on M3 Pro 18 GB. GPU tok/s measured with full Metal offloading. CPU tok/s with Metal disabled.
Sources
- Ollama GitHub repository — Metal support documentation
- Apple Metal documentation — GPU compute framework
- llama.cpp GitHub repository — Build instructions with Metal
- MLX GitHub repository — Apple's native ML framework
- LLMCheck Leaderboard — GPU benchmark data for 42+ models