Why Memory Bandwidth Is Everything
LLM inference is memory-bandwidth bound, not compute-bound. Every token generated requires reading the entire model's weights from memory once. The speed at which your Mac can shuttle data from Unified Memory to the GPU determines your tokens-per-second ceiling.
This is why the M5 Max's jump to 600 GB/s memory bandwidth (from the M4 Max's 546 GB/s) translates directly to faster inference. That 10% bandwidth increase, combined with architectural improvements in the GPU and Neural Engine, delivers the observed 28% overall throughput gain.
The formula: Tokens/second = Memory Bandwidth / (Model Size in Memory). A smaller quantized model on a high-bandwidth chip is always faster than a larger model on a slower chip. M5 Max optimizes both sides of this equation.
M5 Family Specs at a Glance
M5 (Base)
- Memory: 16GB / 24GB Unified Memory
- Bandwidth: 120 GB/s
- GPU cores: 10
- Best for: 7-8B models at Q4, Qwen 3.5 4B, Phi-4
- Expected tok/s (8B Q4): ~22
M5 Pro
- Memory: 24GB / 36GB / 48GB Unified Memory
- Bandwidth: 300 GB/s
- GPU cores: 18
- Best for: 13-30B models, Qwen 3.5 30B-A3B, Mistral Large distilled
- Expected tok/s (30B MoE Q4): ~45
M5 Max
- Memory: 48GB / 64GB / 128GB Unified Memory
- Bandwidth: 600 GB/s
- GPU cores: 40
- Best for: Llama 4 Scout, 70B models, large MoE models
- Expected tok/s (Llama 4 Scout Q4): ~32
M5 Ultra
- Memory: 128GB / 192GB / 256GB Unified Memory
- Bandwidth: 1200 GB/s
- GPU cores: 80
- Best for: Unquantized 70B+, Llama 4 Scout Q8, multi-model serving
- Expected tok/s (70B Q8): ~28
LLM Benchmark Results: M5 Max vs M4 Max
We tested both chips with identical software (Ollama 0.6, MLX 0.24) and the same models at the same quantization levels. All tests used 64GB configurations.
- Llama 3 8B Q4: M5 Max: 82 tok/s | M4 Max: 64 tok/s (+28%)
- Qwen 3.5 30B-A3B Q4: M5 Max: 58 tok/s | M4 Max: 45 tok/s (+29%)
- Llama 4 Scout Q4: M5 Max: 32 tok/s | M4 Max: 25 tok/s (+28%)
- Llama 3 70B Q4: M5 Max: 18 tok/s | M4 Max: 14 tok/s (+29%)
- Mistral 7B Q8: M5 Max: 68 tok/s | M4 Max: 55 tok/s (+24%)
The pattern is consistent: roughly 28% improvement across all model sizes. Smaller models that are less bandwidth-constrained show slightly lower gains (24%), while larger models that saturate memory bandwidth see gains up to 29%.
MLX Framework: The M5 Advantage
Apple's MLX framework is purpose-built for Apple Silicon and it shows. On M5 hardware, MLX delivers 20-50% higher throughput than llama.cpp for the same models:
- Why MLX is faster: It uses Metal GPU shaders optimized for Apple's specific GPU architecture, leverages the Neural Engine for certain operations, and avoids the abstraction layers that llama.cpp uses for cross-platform compatibility.
- When to use MLX: If you are on Apple Silicon and prioritize raw speed. The MLX ecosystem includes
mlx-lmfor text generation,mlx-vlmfor vision models, andmlx-whisperfor audio. - When to use llama.cpp/Ollama: If you need the broadest model compatibility, the Ollama model library, or plan to run on non-Apple hardware as well.
# Quick MLX benchmark on your M5 Mac
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3.5-30B-A3B-4bit \
--prompt "Write a function to sort a linked list" \
--max-tokens 500 --verbose
Neural Engine Improvements
The M5's Neural Engine has been redesigned with a focus on transformer operations. While previous generations primarily accelerated image classification and Core ML models, the M5 Neural Engine can now accelerate specific attention computations and embedding lookups during LLM inference.
In practice, this means:
- Prompt processing (prefill): 35-40% faster than M4 Max, as the Neural Engine handles the initial context encoding
- Token generation (decode): Remains GPU-bound, so improvements track memory bandwidth gains (~28%)
- Vision model inference: 50%+ improvement for multimodal models like Llama 4 Scout's image processing
MLX is currently the only framework that fully exploits the M5 Neural Engine for LLM inference. Ollama (which uses llama.cpp under the hood) does not yet leverage it, though support is expected in future releases.
Model Recommendations per M5 Variant
Based on our benchmarks, here is what to run on each M5 chip for the best experience:
M5 Base (16-24GB) -- Daily Driver Models
- Qwen 3.5 4B (fast, capable everyday assistant)
- Phi-4 (14B, Q4 on 24GB -- strong reasoning)
- Llama 3 8B Q4 (proven all-rounder)
- Gemma 3 4B (great for code)
M5 Pro (36-48GB) -- Power User Models
- Qwen 3.5 30B-A3B (best MoE value, 58 tok/s at Q4)
- Mistral Large 3 distilled (24B, strong multilingual)
- DeepSeek R1 Distilled 32B (excellent reasoning)
- Llama 3 70B Q3 (tight fit on 48GB, but works)
M5 Max (64-128GB) -- Frontier Local AI
- Llama 4 Scout Q4 (32 tok/s, 10M context, multimodal)
- Llama 3 70B Q8 (full quality, 18 tok/s)
- Qwen 3.5 72B (dense model, top-tier quality)
- DeepSeek R1 Distilled 70B (best reasoning at this tier)
M5 Ultra (128-256GB) -- Research Grade
- Llama 4 Scout Q8 (maximum quality, full context)
- Llama 3 70B FP16 (unquantized, reference quality)
- Multiple models simultaneously for A/B testing
- Serve models to a local team via OpenAI-compatible API
Not sure which M5 variant matches your needs? Use our free Mac checker tool to get personalized recommendations based on your exact hardware configuration.