The 75% Rule
Not all of your Mac's RAM is available for AI models. macOS itself, system processes, the inference engine, and any open apps consume memory. According to LLMCheck testing, approximately 25% of total RAM is used by the system even with minimal apps running.
The formula: Available RAM for models = Total RAM x 0.75
- 8 GB Mac → ~6 GB available for models
- 16 GB Mac → ~12 GB available
- 24 GB Mac → ~18 GB available
- 32 GB Mac → ~24 GB available
- 64 GB Mac → ~48 GB available
- 128 GB Mac → ~96 GB available
How to check: Open Activity Monitor → Memory tab. Look at "Memory Used" and "Memory Pressure." If pressure is green, you have headroom. Yellow means you are close to the limit. Red means you are swapping — the model is too large.
Solution 1: Use Smaller Quantization
Quantization reduces model precision from 16-bit floating point to 4-bit or even 2-bit integers. This shrinks the model dramatically with surprisingly little quality loss. According to LLMCheck benchmarks, Q4_K_M is the optimal choice — it reduces size by 70-75% with only 1-3% quality degradation.
| Quantization | 70B Model Size | Fits In | Quality Loss |
|---|---|---|---|
| F16 (full) | 140 GB | 192+ GB | None (baseline) |
| Q8_0 | 75 GB | 128 GB | ~0.5% |
| Q4_K_M | 40 GB | 64 GB | ~2% |
| Q4_K_S | 38 GB | 64 GB | ~3% |
| Q2_K | 27 GB | 36 GB | ~8-10% |
# Pull a specific quantization in Ollama
ollama pull llama3.1:70b-q4_K_M
# Check model size
ollama show llama3.1:70b-q4_K_M
Solution 2: Switch to MoE Models
Mixture of Experts (MoE) models are architecturally designed to use less memory per inference. They have many total parameters but only activate a small subset per token. This means you get the quality of a large model with the memory footprint of a small one.
| Model | Type | RAM Needed (Q4) | Quality (MMLU) |
|---|---|---|---|
| Qwen 3.5 26B | Dense | ~16 GB | 82% |
| Gemma 4 26B-A4B | MoE (4B active) | ~18 GB | 79% |
| Llama 4 Scout | MoE (17B active) | ~65 GB | 84% |
According to LLMCheck, Gemma 4 26B-A4B is the best MoE option for 24-32 GB Macs. It needs only 18 GB of RAM at Q4 while delivering quality that rivals dense models twice its active parameter count.
ollama pull gemma4:26b-a4b
Solution 3: Use a Smaller Model in the Same Family
Model families like Qwen 3.5 and Gemma 4 offer multiple sizes. Dropping to a smaller variant in the same family preserves the model's training quality while fitting in less RAM.
- Qwen 3.5: 4B (2.8 GB) → 9B (5.5 GB) → 14B (8.5 GB) → 35B (20 GB) → 72B (42 GB)
- Gemma 4: E2B (1.5 GB) → E4B (3 GB) → 26B-A4B (18 GB) → 31B (19 GB)
- Llama: 3.2 3B (2 GB) → 3.1 8B (5 GB) → 4 Scout (65 GB) → 3.1 70B (42 GB)
Rule of thumb: A smaller model that fits entirely in RAM will always outperform a larger model that requires swapping. According to LLMCheck, Qwen 3.5 9B at Q4 (5.5 GB, 28 tok/s) is faster and more responsive than Qwen 3.5 35B at Q4 (20 GB) on a 16 GB Mac where the larger model causes swapping.
Solution 4: Partial GPU Offloading
If a model is slightly too large for full GPU offloading, you can split it between GPU and CPU. This is slower than full GPU but much faster than full swapping.
In Ollama, this happens automatically — when a model cannot fully fit in GPU memory, Ollama places as many layers as possible on GPU and the remainder on CPU. You can see the split with ollama ps:
ollama ps
# Partial offload example:
# NAME SIZE PROCESSOR UNTIL
# qwen3.5:35b 20 GB 60% GPU/40% CPU 4 minutes from now
In LM Studio, set the GPU layers slider to a specific number rather than "max" to control how many layers go to GPU.
RAM Tier Guide: What Fits in Your Mac
According to LLMCheck benchmarks, here is what you can comfortably run at each RAM tier with Q4_K_M quantization:
| Total RAM | Available (~75%) | Best Models | Max Dense Size |
|---|---|---|---|
| 8 GB | ~6 GB | Gemma 4 E4B, Phi-4 Mini, Qwen 3.5 4B | Up to 4B |
| 16 GB | ~12 GB | Qwen 3.5 9B, Gemma 4 E4B, Llama 3.1 8B | Up to 9B |
| 24 GB | ~18 GB | Qwen 3.5 14B, Gemma 4 26B-A4B (MoE) | Up to 14B |
| 32 GB | ~24 GB | Qwen 3.5 35B, Gemma 4 31B Dense | Up to 32B |
| 64 GB | ~48 GB | Llama 3.1 70B, Qwen 3.5 72B, Llama 4 Scout (MoE) | Up to 70B |
| 128 GB | ~96 GB | Any model at any quantization | Up to 120B+ |
When to Accept You Need More RAM
Sometimes optimization is not enough. You need more RAM if:
- You consistently need 70B+ models — no amount of quantization makes 70B fit in 16 GB RAM
- You need long context — 32K+ context windows consume significant additional memory on top of the model
- You are running multiple models simultaneously — agents and tool-use workflows need several models loaded at once
- Quality from smaller models is insufficient — if Q4_K_M of a smaller model does not meet your needs, you need a bigger model which needs more RAM
According to LLMCheck, 24 GB is the sweet spot for most local AI users in 2026. It comfortably runs 14B dense models and MoE models up to 26B, which covers the vast majority of use cases.
Sources
- Ollama GitHub repository — Model management and quantization docs
- LLMCheck RAM Guide — Detailed RAM requirements for every model
- LLMCheck Leaderboard — Model sizes and performance data
- HuggingFace Quantization Guide — Technical quantization details