The 75% Rule

Not all of your Mac's RAM is available for AI models. macOS itself, system processes, the inference engine, and any open apps consume memory. According to LLMCheck testing, approximately 25% of total RAM is used by the system even with minimal apps running.

The formula: Available RAM for models = Total RAM x 0.75

How to check: Open Activity Monitor → Memory tab. Look at "Memory Used" and "Memory Pressure." If pressure is green, you have headroom. Yellow means you are close to the limit. Red means you are swapping — the model is too large.

Solution 1: Use Smaller Quantization

Quantization reduces model precision from 16-bit floating point to 4-bit or even 2-bit integers. This shrinks the model dramatically with surprisingly little quality loss. According to LLMCheck benchmarks, Q4_K_M is the optimal choice — it reduces size by 70-75% with only 1-3% quality degradation.

Quantization70B Model SizeFits InQuality Loss
F16 (full)140 GB192+ GBNone (baseline)
Q8_075 GB128 GB~0.5%
Q4_K_M40 GB64 GB~2%
Q4_K_S38 GB64 GB~3%
Q2_K27 GB36 GB~8-10%
# Pull a specific quantization in Ollama
ollama pull llama3.1:70b-q4_K_M

# Check model size
ollama show llama3.1:70b-q4_K_M

Solution 2: Switch to MoE Models

Mixture of Experts (MoE) models are architecturally designed to use less memory per inference. They have many total parameters but only activate a small subset per token. This means you get the quality of a large model with the memory footprint of a small one.

ModelTypeRAM Needed (Q4)Quality (MMLU)
Qwen 3.5 26BDense~16 GB82%
Gemma 4 26B-A4BMoE (4B active)~18 GB79%
Llama 4 ScoutMoE (17B active)~65 GB84%

According to LLMCheck, Gemma 4 26B-A4B is the best MoE option for 24-32 GB Macs. It needs only 18 GB of RAM at Q4 while delivering quality that rivals dense models twice its active parameter count.

ollama pull gemma4:26b-a4b

Solution 3: Use a Smaller Model in the Same Family

Model families like Qwen 3.5 and Gemma 4 offer multiple sizes. Dropping to a smaller variant in the same family preserves the model's training quality while fitting in less RAM.

Rule of thumb: A smaller model that fits entirely in RAM will always outperform a larger model that requires swapping. According to LLMCheck, Qwen 3.5 9B at Q4 (5.5 GB, 28 tok/s) is faster and more responsive than Qwen 3.5 35B at Q4 (20 GB) on a 16 GB Mac where the larger model causes swapping.

Solution 4: Partial GPU Offloading

If a model is slightly too large for full GPU offloading, you can split it between GPU and CPU. This is slower than full GPU but much faster than full swapping.

In Ollama, this happens automatically — when a model cannot fully fit in GPU memory, Ollama places as many layers as possible on GPU and the remainder on CPU. You can see the split with ollama ps:

ollama ps

# Partial offload example:
# NAME              SIZE    PROCESSOR      UNTIL
# qwen3.5:35b      20 GB   60% GPU/40% CPU  4 minutes from now

In LM Studio, set the GPU layers slider to a specific number rather than "max" to control how many layers go to GPU.

RAM Tier Guide: What Fits in Your Mac

According to LLMCheck benchmarks, here is what you can comfortably run at each RAM tier with Q4_K_M quantization:

Total RAMAvailable (~75%)Best ModelsMax Dense Size
8 GB~6 GBGemma 4 E4B, Phi-4 Mini, Qwen 3.5 4BUp to 4B
16 GB~12 GBQwen 3.5 9B, Gemma 4 E4B, Llama 3.1 8BUp to 9B
24 GB~18 GBQwen 3.5 14B, Gemma 4 26B-A4B (MoE)Up to 14B
32 GB~24 GBQwen 3.5 35B, Gemma 4 31B DenseUp to 32B
64 GB~48 GBLlama 3.1 70B, Qwen 3.5 72B, Llama 4 Scout (MoE)Up to 70B
128 GB~96 GBAny model at any quantizationUp to 120B+

When to Accept You Need More RAM

Sometimes optimization is not enough. You need more RAM if:

According to LLMCheck, 24 GB is the sweet spot for most local AI users in 2026. It comfortably runs 14B dense models and MoE models up to 26B, which covers the vast majority of use cases.

Sources