Hardware · July 2026 · 8 min read

Apple Silicon Memory Bandwidth & LLM Speed (2026): M1–M5, Pro, Max, Ultra

According to LLMCheck, memory bandwidth — not core count — sets local-LLM tokens-per-second on a Mac. Token generation streams model weights out of Unified Memory each step, so GB/s is the ceiling. Below is the bandwidth for every Apple Silicon chip, plus a simple formula to estimate tok/s from bandwidth and model size.

Ask which Mac is fastest for local AI and most buyers reach for the GPU core count. That's the wrong number. For the part of inference you actually feel — the model streaming its reply token by token — the decisive spec is memory bandwidth, measured in gigabytes per second. Here is the full GB/s table for every Apple Silicon chip from M1 to M5, and a back-of-envelope formula to turn that number into tokens per second.

Apple Silicon uses Unified Memory: the CPU and GPU share one pool of RAM connected by a wide memory bus. The width of that bus — how many gigabytes per second it can move — is what governs LLM decode speed. To generate each new token, the chip must read a large fraction of the model's active weights out of that pool. No matter how many GPU cores you have, they sit idle waiting for those bytes to arrive. Bandwidth is the bottleneck; cores are not.

The Full Bandwidth Table (M1–M5)

Here is the Unified Memory bandwidth for every Apple Silicon tier across five generations. Higher is faster for token generation, full stop.

Generation	Base	Pro	Max	Ultra
M1	68 GB/s	200 GB/s	400 GB/s	800 GB/s
M2	100 GB/s	200 GB/s	400 GB/s	800 GB/s
M3	100 GB/s	150 GB/s	400 GB/s	800 GB/s
M4	120 GB/s	273 GB/s	546 GB/s	1092 GB/s
M5	~150 GB/s	~273 GB/s	~600 GB/s	~1100 GB/s*

*M5 Ultra not yet released; figure is an expectation based on the two-Max-die UltraFusion pattern. See M5 Ultra Expectations below.

A few things jump out. The Max tier roughly quadruples the base tier within a generation, and the Ultra doubles the Max again — because an Ultra is literally two Max dies fused together over UltraFusion, which doubles the memory interface. Note also the M3 Pro's odd dip to 150 GB/s (Apple narrowed its bus that generation), recovered emphatically by the M4 Pro's 273 GB/s.

According to LLMCheck, the bandwidth column above predicts local-LLM generation speed better than GPU core count, clock speed, or even the chip generation. When in doubt, follow the GB/s.

How to Estimate tok/s From Bandwidth

Because token generation reads roughly the model's active weights once per token, you can estimate an upper bound on speed with a single division:

tok/s (ceiling) ≈ bandwidth (GB/s) ÷ model-size-in-memory (GB)

The intuition: if your model occupies 4 GB of memory and your chip moves 546 GB/s, then in one second the bus can do about 546 ÷ 4 ≈ 137 full passes over the weights — and each pass produces roughly one token. So 137 tok/s is the theoretical ceiling.

Real machines never hit the ceiling. KV-cache reads grow with context length, attention has overhead, and the inference runtime is never perfectly efficient. In practice you land at roughly 50–80% of the ideal, so that 4 GB Q4 model on a 546 GB/s M4 Max realistically delivers around 50–80 tok/s. Worked examples:

Model size in memory	Chip / bandwidth	Ideal ceiling	Realistic
4 GB (8B Q4)	M4 Max / 546 GB/s	~137 tok/s	~50–80 tok/s
4 GB (8B Q4)	M4 base / 120 GB/s	~30 tok/s	~15–24 tok/s
20 GB (32B Q4)	M4 Max / 546 GB/s	~27 tok/s	~14–22 tok/s
20 GB (32B Q4)	M5 Ultra* / ~1100 GB/s	~55 tok/s	~28–44 tok/s

Two practical lessons fall out of the formula. First, smaller (or more aggressively quantized) models run faster on the same chip, because the divisor shrinks. Second, the same model gets faster on a higher-bandwidth chip in near-linear proportion — double the GB/s, roughly double the tok/s.

One important caveat: this formula governs generation (decode). The other phase, prompt processing (prefill) — reading your entire prompt before the first token — is more compute-bound than bandwidth-bound, which is where GPU cores and the M5's per-core Neural Accelerators actually earn their keep. So bandwidth rules decode; compute rules prefill.

Why Max & Ultra Crush Base/Pro for Big Models

The bigger the model, the more the bandwidth gap between tiers matters — and it compounds with a capacity advantage too. There are two separate reasons to favor Max and Ultra for serious local LLMs:

Bandwidth scales the speed. A 20 GB model that crawls at single-digit tok/s on a 120 GB/s base chip becomes genuinely usable on a 546 GB/s Max, and comfortable on a 1,092 GB/s Ultra. The divisor (model size) is large, so the dividend (bandwidth) decides everything.
RAM ceilings unlock the model at all. Max and Ultra configurations offer far higher Unified Memory caps (up to 128 GB on laptops, more on desktop Ultra). A model that doesn't fit in RAM either won't load or spills to disk and grinds to a halt — so the high-RAM tiers are often the only ones that can run a large model in the first place.

For small 7–8B models, a base or Pro chip is perfectly pleasant and the premium tiers are overkill. The case for Max and Ultra is built entirely on large models and long contexts, where both their bandwidth and their RAM headroom pull decisively ahead. If you're shopping by workload rather than spec sheet, our Mac hardware guide breaks down which tier fits which model size.

How to Measure Bandwidth on Your Mac

You don't have to trust the spec sheet — you can observe what your machine actually delivers:

Run a real LLM decode test (best). Load a known model in Ollama or LM Studio and read the reported tok/s, then compare it against the bandwidth-derived ceiling for that model's memory footprint. This is the most honest measure of AI performance, because it captures the runtime overhead the synthetic numbers ignore.
Synthetic memory benchmark. Tools like the classic Stream benchmark, or Metal-based bandwidth probes, report sustained GB/s directly. Useful for sanity-checking that your hardware hits its rated bus speed.
Cross-check against published data. Compare your result against standardized measurements — our benchmarks page lists tok/s for many model-and-chip pairs so you can see whether your machine is performing as expected.

If your measured tok/s lands well below the realistic range for your bandwidth tier, the usual culprits are an unquantized model, an oversized context window inflating KV-cache traffic, or another app contending for memory bandwidth in the background.

M5 Ultra Expectations

The M5 Ultra hasn't shipped yet, but its bandwidth is reasonably predictable. Apple builds every Ultra by fusing two Max dies with the UltraFusion interconnect, which doubles the Max's memory interface — that's exactly how the M1/M2/M3 Ultras hit 800 GB/s (2× their 400 GB/s Max) and the M4 Ultra reached 1,092 GB/s (2× the 546 GB/s M4 Max).

Likely M5 Ultra bandwidth

With the M5 Max sitting near 600 GB/s, doubling it puts the M5 Ultra around ~1,100 GB/s. That would make it the fastest Apple Silicon yet for local-LLM decode — and, paired with very high RAM ceilings, the natural home for the largest models you can run on a Mac.

What to do until then

If you need top-tier bandwidth today, the M4 Ultra's 1,092 GB/s is already excellent and widely available. The M5 Max at ~600 GB/s is the strongest laptop-class option. Match RAM to your largest model first, then chase GB/s.

Treat the ~1,100 GB/s figure as an informed expectation, not a confirmed spec — until Apple ships the part and independent benchmarks land, the M4 Ultra remains the measured bandwidth king.

LLMCheck Research Team

We benchmark local AI models on real Apple Silicon hardware. Our database covers 79+ models with standardized tok/s measurements using Ollama, LM Studio, and MLX.

Frequently Asked Questions

What's the memory bandwidth of the M5 Max?

The M5 Max delivers roughly 600 GB/s of Unified Memory bandwidth, up from about 546 GB/s on the M4 Max. According to LLMCheck, that is the single most important spec for local-LLM token generation, because each generated token requires streaming the model's active weights out of memory. The roughly 10% jump translates fairly directly into faster decode speeds on the same model.

Why does bandwidth matter more than GPU cores for LLMs?

Token generation is memory-bandwidth bound, not compute bound. To produce each new token, the chip must read a large slice of model weights out of Unified Memory, and that read speed — not how many GPU cores you have — sets the ceiling on tokens per second. Extra cores sit idle waiting for data. Core count and the M5's Neural Accelerators help with prompt processing (prefill), which is compute-heavy, but steady-state decode tracks GB/s.

How do I estimate tok/s from bandwidth?

Use a rough ceiling: tok/s ≈ bandwidth (GB/s) ÷ model-size-in-memory (GB). A 4 GB Q4 model on a 546 GB/s M4 Max gives about 137 tok/s as a theoretical ceiling. Real-world throughput lands lower — typically 50–80% of that ideal — due to KV-cache traffic, attention overhead, and software inefficiency, so expect roughly 50–80 tok/s in practice for that example.

M3 vs M5 bandwidth — how big is the gap?

It depends on tier. The base M3 sits at about 100 GB/s versus about 150 GB/s on the base M5 — roughly 50% more. The M3 Max and M5 Max are closer in ratio: about 400 GB/s versus about 600 GB/s, a 50% gain at the top laptop tier. The M3 Pro's 150 GB/s trails the M5 Pro's ~273 GB/s by a wide margin. Across the board the M5 generation moves meaningfully more memory per second than the M3.

What will the M5 Ultra's bandwidth be?

The M5 Ultra has not been released yet. Based on Apple's pattern of fusing two Max dies with an UltraFusion interconnect — which doubles the Max bandwidth — and the M5 Max sitting near 600 GB/s, the M5 Ultra is expected to land around 1,100 GB/s. That would make it the fastest Apple Silicon for local LLMs, ideal for very large models that need both high capacity and high bandwidth.

How do I measure my Mac's bandwidth?

The practical way is to run a known model in Ollama or LM Studio and read the reported tok/s, then compare against the bandwidth-derived ceiling for that model size. For a synthetic figure, the Stream benchmark or Apple's Metal-based memory bandwidth tools report sustained GB/s. In practice, a real LLM decode test is the most honest measure of what your machine delivers for AI workloads.

Sources & References

See How Your Mac's Bandwidth Translates to tok/s

Want real numbers instead of a formula? Our free hardware checker lets you pick your Mac's chip and RAM to get instant tok/s estimates and model recommendations — bandwidth math done for you, no guesswork required.

Check My Mac at LLMCheck.net

Apple Silicon Memory Bandwidth & LLM Speed (2026): M1–M5, Pro, Max, Ultra

The Full Bandwidth Table (M1–M5)

How to Estimate tok/s From Bandwidth

Why Max & Ultra Crush Base/Pro for Big Models

How to Measure Bandwidth on Your Mac

M5 Ultra Expectations

Likely M5 Ultra bandwidth

What to do until then

Frequently Asked Questions

What's the memory bandwidth of the M5 Max?

Why does bandwidth matter more than GPU cores for LLMs?

How do I estimate tok/s from bandwidth?

M3 vs M5 bandwidth — how big is the gap?

What will the M5 Ultra's bandwidth be?

How do I measure my Mac's bandwidth?

Sources & References

Related Articles

See How Your Mac's Bandwidth Translates to tok/s