How Unified Memory Works for AI
On traditional PCs, running a large language model requires loading the model weights into a dedicated GPU's VRAM. An NVIDIA RTX 4090 has 24 GB of VRAM, which limits the model size you can run at full speed. Any overflow spills to system RAM, and speed drops dramatically.
Apple Silicon works differently. The CPU, GPU, and Neural Engine all share a single pool of Unified Memory. When you load a 20 GB model on a 32 GB Mac, the GPU accesses those weights directly without any copying or bus transfers. This is why a Mac with 32 GB of Unified Memory can outperform a PC with 24 GB of dedicated VRAM for certain model sizes — there is no artificial boundary between "GPU memory" and "system memory."
According to LLMCheck testing, this architecture makes Macs uniquely efficient for running models that are slightly too large for discrete GPUs. A 32 GB Mac runs a 20 GB model at full GPU speed, while a 24 GB GPU PC would need to partially offload the same model to slower system RAM.
RAM Tier Breakdown with Recommendations
Here is what each RAM configuration can realistically run, based on our standardized benchmarks across Apple Silicon chips:
| RAM | Best Model | File Size | Free RAM Needed | tok/s | Quality Level |
|---|---|---|---|---|---|
| 8 GB | Phi-4 Mini (3.8B) | 2.4 GB | ~4 GB | ~135 | Basic |
| 16 GB | Qwen 3.5 9B | 5.5 GB | ~8 GB | ~100 | Strong |
| 24 GB | Llama 3.1 14B | 8.5 GB | ~13 GB | ~65 | Very Strong |
| 32 GB | Qwen 3.5 35B MoE | 20 GB | ~30 GB | ~45 | Near-Frontier |
| 64 GB | DeepSeek R1 70B | 40 GB | ~60 GB | ~10 | Frontier |
| 128 GB | Qwen 3.5 122B MoE | 70 GB | ~105 GB | ~8 | Frontier+ |
The sweet spot: According to LLMCheck data, 32 GB offers the best value-for-intelligence ratio. The Qwen 3.5 35B MoE model available at this tier scores within 10-15% of models requiring twice the RAM, thanks to its efficient Mixture-of-Experts architecture.
The 1.5x Memory Rule Explained
A common mistake is assuming you need exactly as much RAM as the model file size. In practice, your Mac needs approximately 1.5 times the model's file size in free available memory. Here is why:
- Model weights (1x): The base model file must be loaded entirely into memory. A Q4-quantized 70B model is approximately 40 GB on disk.
- KV-cache (~0.3x): As you chat with the model, it maintains a key-value cache that stores the conversation context. This grows with longer conversations and can consume several gigabytes for large context windows.
- Inference engine (~0.1x): Ollama, LM Studio, or the MLX framework itself needs working memory for computation buffers and the Metal shader pipeline.
- macOS overhead (~0.1x): The operating system, Finder, and background processes need memory too. macOS itself uses 3-5 GB at idle.
If the total exceeds your available RAM, macOS starts swapping to the SSD. According to LLMCheck benchmarks, even partial swapping drops token generation speed by 5-10x, making the model effectively unusable for interactive work.
The MoE Advantage: Big Models on Small RAM
Mixture-of-Experts (MoE) models are a game-changer for memory-constrained Macs. Traditional "dense" models activate every parameter for every token. MoE models only activate a fraction of their parameters per token, while still benefiting from the full model's training knowledge.
The practical impact is dramatic. Qwen 3.5 35B is an MoE model with 35 billion total parameters but only activates roughly 8 billion per forward pass. This means it fits in 32 GB of RAM while delivering intelligence closer to a traditional 30B+ dense model. The trade-off is a larger file size relative to its "active" parameter count, but the quality jump is substantial.
For Mac users, MoE models effectively give you one tier of intelligence above what your RAM would normally allow. A 32 GB Mac with an MoE model approaches what used to require 64 GB with dense architectures.
Future-Proofing Your Mac Purchase
Mac RAM cannot be upgraded after purchase. Every Apple Silicon Mac has its memory soldered onto the system-on-chip package. This makes your initial RAM choice a decision that lasts the entire 5-7 year lifespan of the machine.
Model efficiency is improving rapidly. Models that required 64 GB two years ago now have distilled versions running on 16 GB with 80% of the quality. However, the frontier keeps advancing too. If you want to run the best available local model three years from now, buy one tier above what you need today.
- Casual AI use (chat, summarization): 16 GB minimum, 24 GB recommended
- Professional AI use (coding, analysis, writing): 32 GB minimum, 64 GB recommended
- AI development and research: 64 GB minimum, 128 GB recommended