Diagnose: Is Your GPU Active?

Before fixing anything, confirm whether your GPU is actually being used. There are three ways to check:

Method 1: Activity Monitor

  1. Open Activity Monitor (Spotlight → type "Activity Monitor")
  2. Go to Window → GPU History (or click the GPU tab)
  3. Run your model and watch for GPU usage spikes
  4. If GPU stays at 0% during generation, Metal is not active

Method 2: Ollama ps

ollama ps

# Example output (GPU active):
# NAME              SIZE    PROCESSOR  UNTIL
# qwen3.5:9b       5.5 GB  100% GPU   4 minutes from now

# Example output (CPU only):
# NAME              SIZE    PROCESSOR  UNTIL
# qwen3.5:9b       5.5 GB  100% CPU   4 minutes from now

Method 3: Ollama serve output

# Stop Ollama, then start manually to see logs
ollama serve

# Look for these lines indicating Metal is active:
# ggml_metal_init: allocating
# ggml_metal_init: loaded kernel_*

Key indicator: If ollama ps shows "100% CPU" instead of "100% GPU", your model is running without Metal acceleration. Follow the fix for your specific engine below.

Fix for Ollama

Ollama v0.15+ automatically enables Metal on Apple Silicon. If GPU is not active, try these steps:

  1. Update Ollama to the latest version (v0.20+ recommended):
    brew upgrade ollama
    # Or re-download from ollama.com
  2. Set the Metal environment variable explicitly:
    export OLLAMA_METAL=1
    ollama serve
  3. Check model fits in RAM — if the model exceeds 75% of your total RAM, Ollama falls back to partial or full CPU mode:
    # Check model size
    ollama show qwen3.5:9b --modelfile | grep size
  4. Restart Ollama completely:
    # Kill all Ollama processes
    pkill ollama
    # Wait 2 seconds, then restart
    ollama serve

Fix for LM Studio

LM Studio has a dedicated GPU settings panel. According to LLMCheck testing, the most common issue is GPU layers being set to 0 (CPU-only mode).

  1. Open LM Studio and go to Settings
  2. Navigate to Hardware or GPU Settings
  3. Set GPU Layers to max (or a specific number like 35-99)
  4. Reload the model for changes to take effect

If GPU layers is set to 0, the model runs entirely on CPU. Setting it to max puts as many layers on GPU as memory allows.

Fix for MLX

MLX always uses Metal by design. There is no CPU fallback and no configuration needed. If MLX runs on your Apple Silicon Mac, Metal is active.

# Install MLX
pip install mlx-lm

# Run a model — Metal is always active
mlx_lm.generate --model mlx-community/Qwen2.5-7B-Instruct-4bit \
  --prompt "Hello world"

If MLX throws an error, it typically means your macOS version is too old (requires macOS 13.3+) or you are on an Intel Mac (MLX only supports Apple Silicon).

Fix for llama.cpp

If you compiled llama.cpp yourself, Metal must be explicitly enabled at build time:

# Clone and build with Metal support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DLLAMA_METAL=ON
cmake --build . --config Release

Without the -DLLAMA_METAL=ON flag, llama.cpp builds in CPU-only mode. According to LLMCheck, this is the single most common reason for GPU not being used when running llama.cpp directly.

Common Causes of GPU Not Being Used

GPU vs CPU Inference Speed Comparison

According to LLMCheck benchmarks, the speed difference between Metal GPU and CPU-only inference is dramatic across all model sizes:

ModelGPU (Metal) tok/sCPU Only tok/sSpeedup
Phi-4 Mini 3.8B Q452 tok/s14 tok/s3.7x
Qwen 3.5 9B Q428 tok/s7 tok/s4.0x
Gemma 4 26B-A4B Q424 tok/s5 tok/s4.8x
Llama 4 Scout Q418 tok/s4 tok/s4.5x
Qwen 3.5 32B Q412 tok/s3 tok/s4.0x

Benchmarks on M3 Pro 18 GB. GPU tok/s measured with full Metal offloading. CPU tok/s with Metal disabled.

Sources