How do I check if my local LLM is using GPU on Mac?

Open Activity Monitor and click the GPU tab. Run your model and watch for GPU usage spikes. In Ollama, run 'ollama ps' which shows the percentage of model layers offloaded to GPU. If GPU% is 0, the model is running on CPU only.

Does Ollama automatically use GPU on Mac?

Yes, Ollama v0.15+ automatically uses Metal GPU acceleration on Apple Silicon Macs. However, if the model is too large for available memory, Ollama may fall back to partial or full CPU mode. Run 'ollama ps' to verify GPU allocation. According to LLMCheck, ensuring the model fits in 75% of your RAM guarantees full GPU offloading.

Can Intel Macs use GPU for local LLMs?

No. Metal acceleration for LLM inference requires Apple Silicon (M1 or later). Intel Macs with AMD GPUs do not support the Metal compute shaders used by Ollama, llama.cpp, or MLX. Intel Mac users are limited to CPU-only inference, which is 3-5x slower.

How much faster is GPU vs CPU for local LLM on Mac?

According to the LLMCheck index, Metal GPU inference is 3-5x faster than CPU-only on Apple Silicon. For example, Qwen 3.5 9B at Q4_K_M runs at approximately 28 tok/s with Metal GPU on M3 Pro versus 7 tok/s on CPU only. The exact speedup depends on model size and available memory bandwidth.

Why does my Mac GPU show 0% usage during LLM inference?

Common causes include: running an old version of your inference engine that lacks Metal support, the model being too large forcing CPU fallback, running on an Intel Mac, or a misconfigured build of llama.cpp without the Metal flag. Update your software first, then check model size versus available RAM.

Local LLM Not Using GPU on Mac? How to Enable Metal Acceleration

Your Mac has a powerful GPU built into its Apple Silicon chip, but your local LLM might not be using it. CPU-only inference is 3-5x slower than Metal-accelerated GPU inference. This guide shows you how to diagnose, enable, and verify GPU usage across every major inference engine.

Diagnose: Is Your GPU Active?

Before fixing anything, confirm whether your GPU is actually being used. There are three ways to check:

Method 1: Activity Monitor

Open Activity Monitor (Spotlight → type "Activity Monitor")
Go to Window → GPU History (or click the GPU tab)
Run your model and watch for GPU usage spikes
If GPU stays at 0% during generation, Metal is not active

Method 2: Ollama ps

ollama ps

# Example output (GPU active):
# NAME              SIZE    PROCESSOR  UNTIL
# qwen3.5:9b       5.5 GB  100% GPU   4 minutes from now

# Example output (CPU only):
# NAME              SIZE    PROCESSOR  UNTIL
# qwen3.5:9b       5.5 GB  100% CPU   4 minutes from now

Method 3: Ollama serve output

# Stop Ollama, then start manually to see logs
ollama serve

# Look for these lines indicating Metal is active:
# ggml_metal_init: allocating
# ggml_metal_init: loaded kernel_*

Key indicator: If ollama ps shows "100% CPU" instead of "100% GPU", your model is running without Metal acceleration. Follow the fix for your specific engine below.

Fix for Ollama

Ollama v0.15+ automatically enables Metal on Apple Silicon. If GPU is not active, try these steps:

Update Ollama to the latest version (v0.20+ recommended):
```
brew upgrade ollama
# Or re-download from ollama.com
```
Set the Metal environment variable explicitly:
```
export OLLAMA_METAL=1
ollama serve
```
Check model fits in RAM — if the model exceeds 75% of your total RAM, Ollama falls back to partial or full CPU mode:
```
# Check model size
ollama show qwen3.5:9b --modelfile | grep size
```

Restart Ollama completely:

# Kill all Ollama processes
pkill ollama
# Wait 2 seconds, then restart
ollama serve

Fix for LM Studio

LM Studio has a dedicated GPU settings panel. According to the LLMCheck index, the most common issue is GPU layers being set to 0 (CPU-only mode).

Open LM Studio and go to Settings
Navigate to Hardware or GPU Settings
Set GPU Layers to max (or a specific number like 35-99)
Reload the model for changes to take effect

If GPU layers is set to 0, the model runs entirely on CPU. Setting it to max puts as many layers on GPU as memory allows.

Fix for MLX

MLX always uses Metal by design. There is no CPU fallback and no configuration needed. If MLX runs on your Apple Silicon Mac, Metal is active.

# Install MLX
pip install mlx-lm

# Run a model — Metal is always active
mlx_lm.generate --model mlx-community/Qwen2.5-7B-Instruct-4bit \
  --prompt "Hello world"

If MLX throws an error, it typically means your macOS version is too old (requires macOS 13.3+) or you are on an Intel Mac (MLX only supports Apple Silicon).

Fix for llama.cpp

If you compiled llama.cpp yourself, Metal must be explicitly enabled at build time:

# Clone and build with Metal support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DLLAMA_METAL=ON
cmake --build . --config Release

Without the -DLLAMA_METAL=ON flag, llama.cpp builds in CPU-only mode. According to LLMCheck, this is the single most common reason for GPU not being used when running llama.cpp directly.

Common Causes of GPU Not Being Used

Outdated engine version — Older versions of Ollama (pre-v0.15) had limited or no Metal support
Intel Mac — Metal acceleration for LLMs requires Apple Silicon (M1+). Intel Macs cannot use Metal for this purpose
Model too large — If the model exceeds available memory, the engine falls back to CPU for overflow layers
macOS too old — Metal compute shaders for LLMs require macOS 13 Ventura or later
Background GPU processes — Other apps using the GPU (video editing, gaming) can cause resource contention
llama.cpp built without Metal flag — The -DLLAMA_METAL=ON cmake flag is required

GPU vs CPU Inference Speed Comparison

According to the LLMCheck index, the speed difference between Metal GPU and CPU-only inference is dramatic across all model sizes:

Model	GPU (Metal) tok/s	CPU Only tok/s	Speedup
Phi-4 Mini 3.8B Q4	52 tok/s	14 tok/s	3.7x
Qwen 3.5 9B Q4	28 tok/s	7 tok/s	4.0x
Gemma 4 26B-A4B Q4	24 tok/s	5 tok/s	4.8x
Llama 4 Scout Q4	18 tok/s	4 tok/s	4.5x
Qwen 3.5 32B Q4	12 tok/s	3 tok/s	4.0x

Benchmarks on M3 Pro 18 GB. GPU tok/s measured with full Metal offloading. CPU tok/s with Metal disabled.

Sources

Ollama GitHub repository — Metal support documentation
Apple Metal documentation — GPU compute framework
llama.cpp GitHub repository — Build instructions with Metal
MLX GitHub repository — Apple's native ML framework
LLMCheck Leaderboard — GPU benchmark data for 79+ models

Local LLM Not Using GPU on Mac? How to Enable Metal Acceleration

Diagnose: Is Your GPU Active?

Method 1: Activity Monitor

Method 2: Ollama ps

Method 3: Ollama serve output

Fix for Ollama

Fix for LM Studio

Fix for MLX

Fix for llama.cpp

Common Causes of GPU Not Being Used

GPU vs CPU Inference Speed Comparison

Sources

Frequently Asked Questions

How do I check if my local LLM is using GPU on Mac?

Does Ollama automatically use GPU on Mac?

Can Intel Macs use GPU for local LLMs?

How much faster is GPU vs CPU for local LLM on Mac?

Why does my Mac GPU show 0% usage during LLM inference?

See GPU Benchmark Results for Every Model