Why is my local LLM running slow on Mac?

The most common cause is a memory bandwidth bottleneck. If your model is too large for available RAM, macOS swaps to disk and inference drops to 1-2 tok/s. Other causes include Metal GPU not being active, excessive context length, and running outdated inference engines. According to LLMCheck, switching from Q8 to Q4_K_M quantization and closing memory-hungry apps typically doubles inference speed.

How do I fix GPU not being used for local AI on Mac?

Check Activity Monitor for GPU usage while running inference. In Ollama, run 'ollama ps' to verify GPU offloading. If Metal is not active, set OLLAMA_METAL=1, update to the latest Ollama version, and ensure you are on macOS 13+ with Apple Silicon. Intel Macs do not support Metal acceleration for LLMs.

What should I do if a model is too large for my Mac's RAM?

Use the 75% rule: only about 75% of your total RAM is available for models because macOS needs the rest. Switch to a smaller quantization (Q4_K_M instead of Q8), try a Mixture of Experts model like Gemma 4 26B-A4B which only activates 4B parameters at a time, or choose a smaller model in the same family. According to LLMCheck, a 16 GB Mac can comfortably run up to 14B parameter models at Q4 quantization.

Local AI Troubleshooting Hub — Fix Common LLM Issues on Mac

Running into problems with local AI on your Mac? You are not alone. These four guides cover the most common issues — from slow inference to cryptic Ollama error messages — with tested, step-by-step fixes for Apple Silicon.

Troubleshooting Guides

Select the issue that matches your problem. Each guide includes diagnostic steps, root cause analysis, and verified fixes. According to LLMCheck testing, most local AI issues on Mac can be resolved in under 10 minutes.

⚡

Slow Inference — 7 Fixes for Faster Speed

Your local LLM is running at 2 tok/s when it should be doing 20+. Diagnose the bottleneck and fix it.

Read fix →

🔨

GPU Not Used — Enable Metal Acceleration

Your Mac has a powerful GPU but the model is running on CPU only. Force Metal acceleration on.

Read fix →

💾

Model Too Large for RAM — Fit Big LLMs

The model you want needs more RAM than your Mac has. Quantize, switch architectures, or offload.

Read fix →

🚧

Common Ollama Errors — Quick Fixes

Decoding cryptic Ollama error messages. Connection refused, model not found, memory errors, and more.

Read fix →

Quick Diagnostic Checklist

Before diving into a specific guide, run through this quick checklist. According to LLMCheck data, these five checks resolve about 60% of all local AI issues on Mac:

Check your macOS version — Metal acceleration requires macOS 13 Ventura or later
Check available RAM — Open Activity Monitor and look at Memory Pressure (green is good, red means trouble)
Update your inference engine — Run ollama --version and compare to the latest release
Verify model size vs. RAM — The model file size should not exceed 75% of your total RAM
Close background apps — Docker, Chrome with many tabs, and Xcode are the worst memory offenders

Tip: If you are not sure which issue you have, start with the slow inference guide — it covers the broadest range of problems and includes a diagnostic flowchart.

Sources

Ollama GitHub repository — Official documentation and issue tracker
Apple Metal documentation — GPU acceleration framework
MLX GitHub repository — Apple's machine learning framework
LLMCheck Leaderboard — Benchmark data for 42+ models on Apple Silicon

Local AI Troubleshooting Hub — Fix Common LLM Issues on Mac

Troubleshooting Guides

Slow Inference — 7 Fixes for Faster Speed

GPU Not Used — Enable Metal Acceleration

Model Too Large for RAM — Fit Big LLMs

Common Ollama Errors — Quick Fixes

Quick Diagnostic Checklist

Sources

Frequently Asked Questions

Why is my local LLM running slow on Mac?

How do I fix GPU not being used for local AI on Mac?

What should I do if a model is too large for my Mac's RAM?

Find the Best Model for Your Mac