RAM Requirements at a Glance
Gemma 4 ships in four sizes, and the hardware gap between them is enormous. The table below shows what each variant demands at INT4 quantization (the most common for local inference) and at full FP16 precision.[1]
| Variant | INT4 RAM | FP16 RAM | Storage | Minimum Mac |
|---|---|---|---|---|
| Gemma 4 E2B | ~1.5 GB | ~5 GB | ~1 GB | Any Mac (8 GB+) |
| Gemma 4 E4B | ~3 GB | ~10 GB | ~2 GB | Any Mac (8 GB+) |
| Gemma 4 26B-A4B | ~18 GB | ~52 GB | ~16 GB | M3/M4/M5 Pro 24 GB+ |
| Gemma 4 31B | ~20 GB | ~62 GB | ~18 GB | M4/M5 Pro 24 GB+ |
Key insight: The E2B and E4B variants are remarkably efficient. Gemma 4 E4B is the default when you run ollama run gemma4 and it fits comfortably on any 8 GB Mac with room to spare for the OS and other apps. The 26B-A4B is a Mixture-of-Experts model that activates only 4B parameters per token, so it punches well above its parameter count in quality while keeping active memory reasonable.
Apple Silicon Performance by Chip
According to LLMCheck hardware testing, here are the estimated tokens per second for each Gemma 4 variant across Apple Silicon chips at Q4_K_M quantization.[2]
| Chip (RAM) | E2B | E4B | 26B-A4B | 31B |
|---|---|---|---|---|
| M1 8 GB | ~60 tok/s | ~45 tok/s | -- | -- |
| M2 16 GB | ~75 | ~55 | -- | -- |
| M3 Pro 18 GB | ~85 | ~65 | ~18 (tight) | -- |
| M4 Pro 24 GB | ~95 | ~75 | ~28 | ~14 |
| M5 Pro 24 GB | ~110 | ~90 | ~35 | ~18 |
| M4 Max 64 GB | ~120 | ~100 | ~40 | ~20 |
| M5 Max 128 GB | ~155 | ~125 | ~48 | ~24 |
| M5 Ultra | ~160 | ~130 | ~55 | ~28 |
The "--" entries mean the model does not fit in that chip's RAM at INT4 quantization. The M3 Pro 18 GB can technically load the 26B-A4B, but with only ~0 GB headroom it will swap aggressively and the ~18 tok/s figure degrades quickly under real workloads.
M5 Max & M5 Pro: Why They're Ideal for Gemma 4
The M5 generation brings three advantages that matter specifically for Gemma 4 inference:
- Memory Bandwidth: The M5 Max delivers 600 GB/s (vs M4 Max at 546 GB/s), a ~10% increase that translates directly to faster token generation. The M5 Pro provides 273 GB/s with its Neural Engine rated at 20 TOPS. Since LLM inference is memory-bandwidth bound, this is the single biggest performance lever.
- MLX Day-One Support: According to LLMCheck testing, Apple's MLX framework provides optimized inference paths for Gemma 4 on Apple Silicon from launch. MLX uses Metal GPU shaders tuned to the M5's specific architecture, delivering 20-50% higher throughput than llama.cpp for the same models.
- Neural Accelerators: The M5's redesigned Neural Engine accelerates transformer attention and embedding operations during prefill. This means faster time-to-first-token, especially for the 26B-A4B MoE model where routing computation benefits from dedicated hardware.
- Unified Memory: Unlike discrete GPUs where VRAM is separate from system RAM, Apple Silicon's Unified Memory architecture means your full RAM allocation is available for model weights. A 64 GB M5 Max gives you 64 GB for the model -- no split, no PCIe bottleneck.
M5 Max vs M4 Max for Gemma 4 31B: The M5 Max hits ~24 tok/s compared to the M4 Max's ~20 tok/s -- a 20% improvement that makes the difference between a slightly sluggish and a comfortable conversational experience.
Which Mac Should You Buy for Gemma 4?
Budget: MacBook Air M3/M4/M5 (8-16 GB) -- $999-$1,299
Runs Gemma 4 E2B and E4B at excellent speeds. The E4B is the sweet spot here -- it is the default Ollama model and delivers strong coding, writing, and reasoning performance in a model that barely touches your RAM. This is the entry point for local Gemma 4.
Mid-Range: MacBook Pro M4/M5 Pro 24 GB -- $1,999-$2,499
Unlocks the 26B-A4B MoE model, which is where Gemma 4 gets seriously capable. At ~28-35 tok/s this is fast enough for real-time coding assistance and extended conversations. Also runs the 31B dense model at a functional 14-18 tok/s.
Performance: Mac Studio M4/M5 Max 64 GB+ -- $3,199+
The best option for running the 26B-A4B and 31B models at full speed, plus enough headroom to keep other applications open. The M5 Max 128 GB configuration can run all four Gemma 4 variants simultaneously.
Workstation: Mac Studio M5 Ultra 128 GB+ -- $6,999+
For researchers and teams who need to serve Gemma 4 models to multiple users, run FP16 precision variants, or maintain several models in memory while running other intensive workloads.
Quantization Guide: Q4 vs Q6 vs Q8
Quantization compresses model weights to reduce RAM usage at the cost of some quality. Here is how it affects each Gemma 4 variant:
- Q4_K_M (4-bit): The default for most users. Minimal quality loss for the E2B and E4B models. The 26B-A4B and 31B show a small but measurable quality drop on complex reasoning tasks (~2-3% on benchmarks).
- Q6_K (6-bit): Roughly 50% more RAM than Q4. Recovers most of the quality gap for the 31B model. Recommended if you have the headroom -- the 31B at Q6 needs ~25 GB, fitting on a 36 GB Mac.
- Q8_0 (8-bit): Near-lossless quality. E2B at Q8 still only needs ~4 GB, making it an easy choice on 16 GB+ Macs. The 31B at Q8 requires ~35 GB -- tight on a 36 GB machine but comfortable on 48 GB+.
- FP16 (full precision): Reference quality, but the 26B-A4B at FP16 demands ~52 GB and the 31B needs ~62 GB. Only practical on M5 Max 64 GB+ or M5 Ultra.
Running Multiple Gemma 4 Models
A common workflow is to keep a fast model (E4B) loaded for quick tasks while routing complex queries to the 26B-A4B. According to LLMCheck hardware testing, here is what you need:
- E4B + 26B-A4B simultaneously: ~21 GB at INT4. Works on any 24 GB+ Mac with a few GB left for the OS. On a 36 GB M5 Pro, you get comfortable headroom.
- E4B + 31B simultaneously: ~23 GB at INT4. Needs 24 GB+ with minimal overhead, or 36 GB+ for comfort.
- All four variants loaded: ~43 GB at INT4. Requires M5 Max 64 GB or M5 Ultra. Ollama handles automatic model loading and unloading, but having all models resident in memory eliminates cold-start latency.
# Run E4B (default) and 26B-A4B side by side with Ollama
ollama run gemma4 # loads E4B (~3 GB)
ollama run gemma4:26b-a4b # loads 26B-A4B (~18 GB)
# Check memory usage
ollama ps