Apple Silicon uses Unified Memory: the CPU and GPU share one pool of RAM connected by a wide memory bus. The width of that bus — how many gigabytes per second it can move — is what governs LLM decode speed. To generate each new token, the chip must read a large fraction of the model's active weights out of that pool. No matter how many GPU cores you have, they sit idle waiting for those bytes to arrive. Bandwidth is the bottleneck; cores are not.
The Full Bandwidth Table (M1–M5)
Here is the Unified Memory bandwidth for every Apple Silicon tier across five generations. Higher is faster for token generation, full stop.
| Generation | Base | Pro | Max | Ultra |
|---|---|---|---|---|
| M1 | 68 GB/s | 200 GB/s | 400 GB/s | 800 GB/s |
| M2 | 100 GB/s | 200 GB/s | 400 GB/s | 800 GB/s |
| M3 | 100 GB/s | 150 GB/s | 400 GB/s | 800 GB/s |
| M4 | 120 GB/s | 273 GB/s | 546 GB/s | 1092 GB/s |
| M5 | ~150 GB/s | ~273 GB/s | ~600 GB/s | ~1100 GB/s* |
*M5 Ultra not yet released; figure is an expectation based on the two-Max-die UltraFusion pattern. See M5 Ultra Expectations below.
A few things jump out. The Max tier roughly quadruples the base tier within a generation, and the Ultra doubles the Max again — because an Ultra is literally two Max dies fused together over UltraFusion, which doubles the memory interface. Note also the M3 Pro's odd dip to 150 GB/s (Apple narrowed its bus that generation), recovered emphatically by the M4 Pro's 273 GB/s.
According to LLMCheck, the bandwidth column above predicts local-LLM generation speed better than GPU core count, clock speed, or even the chip generation. When in doubt, follow the GB/s.
How to Estimate tok/s From Bandwidth
Because token generation reads roughly the model's active weights once per token, you can estimate an upper bound on speed with a single division:
The intuition: if your model occupies 4 GB of memory and your chip moves 546 GB/s, then in one second the bus can do about 546 ÷ 4 ≈ 137 full passes over the weights — and each pass produces roughly one token. So 137 tok/s is the theoretical ceiling.
Real machines never hit the ceiling. KV-cache reads grow with context length, attention has overhead, and the inference runtime is never perfectly efficient. In practice you land at roughly 50–80% of the ideal, so that 4 GB Q4 model on a 546 GB/s M4 Max realistically delivers around 50–80 tok/s. Worked examples:
| Model size in memory | Chip / bandwidth | Ideal ceiling | Realistic |
|---|---|---|---|
| 4 GB (8B Q4) | M4 Max / 546 GB/s | ~137 tok/s | ~50–80 tok/s |
| 4 GB (8B Q4) | M4 base / 120 GB/s | ~30 tok/s | ~15–24 tok/s |
| 20 GB (32B Q4) | M4 Max / 546 GB/s | ~27 tok/s | ~14–22 tok/s |
| 20 GB (32B Q4) | M5 Ultra* / ~1100 GB/s | ~55 tok/s | ~28–44 tok/s |
Two practical lessons fall out of the formula. First, smaller (or more aggressively quantized) models run faster on the same chip, because the divisor shrinks. Second, the same model gets faster on a higher-bandwidth chip in near-linear proportion — double the GB/s, roughly double the tok/s.
One important caveat: this formula governs generation (decode). The other phase, prompt processing (prefill) — reading your entire prompt before the first token — is more compute-bound than bandwidth-bound, which is where GPU cores and the M5's per-core Neural Accelerators actually earn their keep. So bandwidth rules decode; compute rules prefill.
Why Max & Ultra Crush Base/Pro for Big Models
The bigger the model, the more the bandwidth gap between tiers matters — and it compounds with a capacity advantage too. There are two separate reasons to favor Max and Ultra for serious local LLMs:
- Bandwidth scales the speed. A 20 GB model that crawls at single-digit tok/s on a 120 GB/s base chip becomes genuinely usable on a 546 GB/s Max, and comfortable on a 1,092 GB/s Ultra. The divisor (model size) is large, so the dividend (bandwidth) decides everything.
- RAM ceilings unlock the model at all. Max and Ultra configurations offer far higher Unified Memory caps (up to 128 GB on laptops, more on desktop Ultra). A model that doesn't fit in RAM either won't load or spills to disk and grinds to a halt — so the high-RAM tiers are often the only ones that can run a large model in the first place.
For small 7–8B models, a base or Pro chip is perfectly pleasant and the premium tiers are overkill. The case for Max and Ultra is built entirely on large models and long contexts, where both their bandwidth and their RAM headroom pull decisively ahead. If you're shopping by workload rather than spec sheet, our Mac hardware guide breaks down which tier fits which model size.
How to Measure Bandwidth on Your Mac
You don't have to trust the spec sheet — you can observe what your machine actually delivers:
- Run a real LLM decode test (best). Load a known model in Ollama or LM Studio and read the reported tok/s, then compare it against the bandwidth-derived ceiling for that model's memory footprint. This is the most honest measure of AI performance, because it captures the runtime overhead the synthetic numbers ignore.
- Synthetic memory benchmark. Tools like the classic Stream benchmark, or Metal-based bandwidth probes, report sustained GB/s directly. Useful for sanity-checking that your hardware hits its rated bus speed.
- Cross-check against published data. Compare your result against standardized measurements — our benchmarks page lists tok/s for many model-and-chip pairs so you can see whether your machine is performing as expected.
If your measured tok/s lands well below the realistic range for your bandwidth tier, the usual culprits are an unquantized model, an oversized context window inflating KV-cache traffic, or another app contending for memory bandwidth in the background.
M5 Ultra Expectations
The M5 Ultra hasn't shipped yet, but its bandwidth is reasonably predictable. Apple builds every Ultra by fusing two Max dies with the UltraFusion interconnect, which doubles the Max's memory interface — that's exactly how the M1/M2/M3 Ultras hit 800 GB/s (2× their 400 GB/s Max) and the M4 Ultra reached 1,092 GB/s (2× the 546 GB/s M4 Max).
Likely M5 Ultra bandwidth
With the M5 Max sitting near 600 GB/s, doubling it puts the M5 Ultra around ~1,100 GB/s. That would make it the fastest Apple Silicon yet for local-LLM decode — and, paired with very high RAM ceilings, the natural home for the largest models you can run on a Mac.
What to do until then
If you need top-tier bandwidth today, the M4 Ultra's 1,092 GB/s is already excellent and widely available. The M5 Max at ~600 GB/s is the strongest laptop-class option. Match RAM to your largest model first, then chase GB/s.
Treat the ~1,100 GB/s figure as an informed expectation, not a confirmed spec — until Apple ships the part and independent benchmarks land, the M4 Ultra remains the measured bandwidth king.