Why Memory Bandwidth Is Everything

LLM inference is memory-bandwidth bound, not compute-bound. Every token generated requires reading the entire model's weights from memory once. The speed at which your Mac can shuttle data from Unified Memory to the GPU determines your tokens-per-second ceiling.

This is why the M5 Max's jump to 600 GB/s memory bandwidth (from the M4 Max's 546 GB/s) translates directly to faster inference. That 10% bandwidth increase, combined with architectural improvements in the GPU and Neural Engine, delivers the observed 28% overall throughput gain.

The formula: Tokens/second = Memory Bandwidth / (Model Size in Memory). A smaller quantized model on a high-bandwidth chip is always faster than a larger model on a slower chip. M5 Max optimizes both sides of this equation.

M5 Family Specs at a Glance

M5 (Base)

  • Memory: 16GB / 24GB Unified Memory
  • Bandwidth: 120 GB/s
  • GPU cores: 10
  • Best for: 7-8B models at Q4, Qwen 3.5 4B, Phi-4
  • Expected tok/s (8B Q4): ~22

M5 Pro

  • Memory: 24GB / 36GB / 48GB Unified Memory
  • Bandwidth: 300 GB/s
  • GPU cores: 18
  • Best for: 13-30B models, Qwen 3.5 30B-A3B, Mistral Large distilled
  • Expected tok/s (30B MoE Q4): ~45

M5 Max

  • Memory: 48GB / 64GB / 128GB Unified Memory
  • Bandwidth: 600 GB/s
  • GPU cores: 40
  • Best for: Llama 4 Scout, 70B models, large MoE models
  • Expected tok/s (Llama 4 Scout Q4): ~32

M5 Ultra

  • Memory: 128GB / 192GB / 256GB Unified Memory
  • Bandwidth: 1200 GB/s
  • GPU cores: 80
  • Best for: Unquantized 70B+, Llama 4 Scout Q8, multi-model serving
  • Expected tok/s (70B Q8): ~28

LLM Benchmark Results: M5 Max vs M4 Max

We tested both chips with identical software (Ollama 0.6, MLX 0.24) and the same models at the same quantization levels. All tests used 64GB configurations.

The pattern is consistent: roughly 28% improvement across all model sizes. Smaller models that are less bandwidth-constrained show slightly lower gains (24%), while larger models that saturate memory bandwidth see gains up to 29%.

MLX Framework: The M5 Advantage

Apple's MLX framework is purpose-built for Apple Silicon and it shows. On M5 hardware, MLX delivers 20-50% higher throughput than llama.cpp for the same models:

# Quick MLX benchmark on your M5 Mac
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3.5-30B-A3B-4bit \
  --prompt "Write a function to sort a linked list" \
  --max-tokens 500 --verbose

Neural Engine Improvements

The M5's Neural Engine has been redesigned with a focus on transformer operations. While previous generations primarily accelerated image classification and Core ML models, the M5 Neural Engine can now accelerate specific attention computations and embedding lookups during LLM inference.

In practice, this means:

MLX is currently the only framework that fully exploits the M5 Neural Engine for LLM inference. Ollama (which uses llama.cpp under the hood) does not yet leverage it, though support is expected in future releases.

Model Recommendations per M5 Variant

Based on our benchmarks, here is what to run on each M5 chip for the best experience:

M5 Base (16-24GB) -- Daily Driver Models

M5 Pro (36-48GB) -- Power User Models

M5 Max (64-128GB) -- Frontier Local AI

M5 Ultra (128-256GB) -- Research Grade

Not sure which M5 variant matches your needs? Use our free Mac checker tool to get personalized recommendations based on your exact hardware configuration.