Apple Silicon uses Unified Memory: the CPU and GPU share one pool of RAM connected by a wide memory bus. The width of that bus — how many gigabytes per second it can move — is what governs LLM decode speed. To generate each new token, the chip must read a large fraction of the model's active weights out of that pool. No matter how many GPU cores you have, they sit idle waiting for those bytes to arrive. Bandwidth is the bottleneck; cores are not.

The Full Bandwidth Table (M1–M5)

Here is the Unified Memory bandwidth for every Apple Silicon tier across five generations. Higher is faster for token generation, full stop.

Generation Base Pro Max Ultra
M1 68 GB/s 200 GB/s 400 GB/s 800 GB/s
M2 100 GB/s 200 GB/s 400 GB/s 800 GB/s
M3 100 GB/s 150 GB/s 400 GB/s 800 GB/s
M4 120 GB/s 273 GB/s 546 GB/s 1092 GB/s
M5 ~150 GB/s ~273 GB/s ~600 GB/s ~1100 GB/s*

*M5 Ultra not yet released; figure is an expectation based on the two-Max-die UltraFusion pattern. See M5 Ultra Expectations below.

A few things jump out. The Max tier roughly quadruples the base tier within a generation, and the Ultra doubles the Max again — because an Ultra is literally two Max dies fused together over UltraFusion, which doubles the memory interface. Note also the M3 Pro's odd dip to 150 GB/s (Apple narrowed its bus that generation), recovered emphatically by the M4 Pro's 273 GB/s.

According to LLMCheck, the bandwidth column above predicts local-LLM generation speed better than GPU core count, clock speed, or even the chip generation. When in doubt, follow the GB/s.

How to Estimate tok/s From Bandwidth

Because token generation reads roughly the model's active weights once per token, you can estimate an upper bound on speed with a single division:

tok/s (ceiling) ≈ bandwidth (GB/s) ÷ model-size-in-memory (GB)

The intuition: if your model occupies 4 GB of memory and your chip moves 546 GB/s, then in one second the bus can do about 546 ÷ 4 ≈ 137 full passes over the weights — and each pass produces roughly one token. So 137 tok/s is the theoretical ceiling.

Real machines never hit the ceiling. KV-cache reads grow with context length, attention has overhead, and the inference runtime is never perfectly efficient. In practice you land at roughly 50–80% of the ideal, so that 4 GB Q4 model on a 546 GB/s M4 Max realistically delivers around 50–80 tok/s. Worked examples:

Model size in memory Chip / bandwidth Ideal ceiling Realistic
4 GB (8B Q4) M4 Max / 546 GB/s ~137 tok/s ~50–80 tok/s
4 GB (8B Q4) M4 base / 120 GB/s ~30 tok/s ~15–24 tok/s
20 GB (32B Q4) M4 Max / 546 GB/s ~27 tok/s ~14–22 tok/s
20 GB (32B Q4) M5 Ultra* / ~1100 GB/s ~55 tok/s ~28–44 tok/s

Two practical lessons fall out of the formula. First, smaller (or more aggressively quantized) models run faster on the same chip, because the divisor shrinks. Second, the same model gets faster on a higher-bandwidth chip in near-linear proportion — double the GB/s, roughly double the tok/s.

One important caveat: this formula governs generation (decode). The other phase, prompt processing (prefill) — reading your entire prompt before the first token — is more compute-bound than bandwidth-bound, which is where GPU cores and the M5's per-core Neural Accelerators actually earn their keep. So bandwidth rules decode; compute rules prefill.

Why Max & Ultra Crush Base/Pro for Big Models

The bigger the model, the more the bandwidth gap between tiers matters — and it compounds with a capacity advantage too. There are two separate reasons to favor Max and Ultra for serious local LLMs:

For small 7–8B models, a base or Pro chip is perfectly pleasant and the premium tiers are overkill. The case for Max and Ultra is built entirely on large models and long contexts, where both their bandwidth and their RAM headroom pull decisively ahead. If you're shopping by workload rather than spec sheet, our Mac hardware guide breaks down which tier fits which model size.

How to Measure Bandwidth on Your Mac

You don't have to trust the spec sheet — you can observe what your machine actually delivers:

If your measured tok/s lands well below the realistic range for your bandwidth tier, the usual culprits are an unquantized model, an oversized context window inflating KV-cache traffic, or another app contending for memory bandwidth in the background.

M5 Ultra Expectations

The M5 Ultra hasn't shipped yet, but its bandwidth is reasonably predictable. Apple builds every Ultra by fusing two Max dies with the UltraFusion interconnect, which doubles the Max's memory interface — that's exactly how the M1/M2/M3 Ultras hit 800 GB/s (2× their 400 GB/s Max) and the M4 Ultra reached 1,092 GB/s (2× the 546 GB/s M4 Max).

Likely M5 Ultra bandwidth

With the M5 Max sitting near 600 GB/s, doubling it puts the M5 Ultra around ~1,100 GB/s. That would make it the fastest Apple Silicon yet for local-LLM decode — and, paired with very high RAM ceilings, the natural home for the largest models you can run on a Mac.

What to do until then

If you need top-tier bandwidth today, the M4 Ultra's 1,092 GB/s is already excellent and widely available. The M5 Max at ~600 GB/s is the strongest laptop-class option. Match RAM to your largest model first, then chase GB/s.

Treat the ~1,100 GB/s figure as an informed expectation, not a confirmed spec — until Apple ships the part and independent benchmarks land, the M4 Ultra remains the measured bandwidth king.