How Unified Memory Works for AI

On traditional PCs, running a large language model requires loading the model weights into a dedicated GPU's VRAM. An NVIDIA RTX 4090 has 24 GB of VRAM, which limits the model size you can run at full speed. Any overflow spills to system RAM, and speed drops dramatically.

Apple Silicon works differently. The CPU, GPU, and Neural Engine all share a single pool of Unified Memory. When you load a 20 GB model on a 32 GB Mac, the GPU accesses those weights directly without any copying or bus transfers. This is why a Mac with 32 GB of Unified Memory can outperform a PC with 24 GB of dedicated VRAM for certain model sizes — there is no artificial boundary between "GPU memory" and "system memory."

According to LLMCheck testing, this architecture makes Macs uniquely efficient for running models that are slightly too large for discrete GPUs. A 32 GB Mac runs a 20 GB model at full GPU speed, while a 24 GB GPU PC would need to partially offload the same model to slower system RAM.

RAM Tier Breakdown with Recommendations

Here is what each RAM configuration can realistically run, based on our standardized benchmarks across Apple Silicon chips:

RAM Best Model File Size Free RAM Needed tok/s Quality Level
8 GB Phi-4 Mini (3.8B) 2.4 GB ~4 GB ~135 Basic
16 GB Qwen 3.5 9B 5.5 GB ~8 GB ~100 Strong
24 GB Llama 3.1 14B 8.5 GB ~13 GB ~65 Very Strong
32 GB Qwen 3.5 35B MoE 20 GB ~30 GB ~45 Near-Frontier
64 GB DeepSeek R1 70B 40 GB ~60 GB ~10 Frontier
128 GB Qwen 3.5 122B MoE 70 GB ~105 GB ~8 Frontier+

The sweet spot: According to LLMCheck data, 32 GB offers the best value-for-intelligence ratio. The Qwen 3.5 35B MoE model available at this tier scores within 10-15% of models requiring twice the RAM, thanks to its efficient Mixture-of-Experts architecture.

The 1.5x Memory Rule Explained

A common mistake is assuming you need exactly as much RAM as the model file size. In practice, your Mac needs approximately 1.5 times the model's file size in free available memory. Here is why:

If the total exceeds your available RAM, macOS starts swapping to the SSD. According to LLMCheck benchmarks, even partial swapping drops token generation speed by 5-10x, making the model effectively unusable for interactive work.

The MoE Advantage: Big Models on Small RAM

Mixture-of-Experts (MoE) models are a game-changer for memory-constrained Macs. Traditional "dense" models activate every parameter for every token. MoE models only activate a fraction of their parameters per token, while still benefiting from the full model's training knowledge.

The practical impact is dramatic. Qwen 3.5 35B is an MoE model with 35 billion total parameters but only activates roughly 8 billion per forward pass. This means it fits in 32 GB of RAM while delivering intelligence closer to a traditional 30B+ dense model. The trade-off is a larger file size relative to its "active" parameter count, but the quality jump is substantial.

For Mac users, MoE models effectively give you one tier of intelligence above what your RAM would normally allow. A 32 GB Mac with an MoE model approaches what used to require 64 GB with dense architectures.

Future-Proofing Your Mac Purchase

Mac RAM cannot be upgraded after purchase. Every Apple Silicon Mac has its memory soldered onto the system-on-chip package. This makes your initial RAM choice a decision that lasts the entire 5-7 year lifespan of the machine.

Model efficiency is improving rapidly. Models that required 64 GB two years ago now have distilled versions running on 16 GB with 80% of the quality. However, the frontier keeps advancing too. If you want to run the best available local model three years from now, buy one tier above what you need today.