The Three Pillars: GPU, Unified Memory & Neural Engine

Every Apple Silicon chip — from the M1 in a MacBook Air to the M5 Ultra in a Mac Studio — contains three distinct hardware components that work together for AI workloads. Understanding what each one does is the key to understanding why your Mac performs the way it does with local models.

The Metal GPU: Your Primary AI Engine

The GPU is where the heavy lifting happens during LLM inference. Apple Silicon GPUs range from 8 cores (M3 base) to 40 cores (M5 Max) to 80 cores (M5 Ultra). Each core excels at the matrix multiplications that form the backbone of transformer neural networks. When you run a model through Ollama or LM Studio, the GPU is performing billions of multiply-accumulate operations per second to generate each token.

Unified Memory: The Secret Weapon

This is arguably Apple's biggest architectural advantage for AI. Unlike traditional PCs where the CPU has system RAM (say, 32 GB) and the GPU has separate VRAM (say, 24 GB), Apple Silicon uses a single pool of memory shared by everything. A 64 GB Mac makes all 64 GB available to the GPU for loading model weights. On a PC with 64 GB of system RAM and an RTX 4090 (24 GB VRAM), the GPU can only access 24 GB directly.

The Neural Engine: The Specialist

The Neural Engine is a dedicated accelerator designed for specific neural network operations. It delivers up to 38 TOPS (trillion operations per second) on the M5 Max. However, for LLM token generation, the Neural Engine plays a supporting role. It handles preprocessing tasks, image model inference, and on-device ML features like Siri and photo recognition. The GPU remains the primary workhorse for large language model inference.

Common misconception: Many people assume the Neural Engine runs LLMs. In practice, the Metal GPU handles the vast majority of LLM inference workload. The Neural Engine's fixed-function architecture is optimized for specific operations that do not map well to the autoregressive token generation that LLMs require.

How LLM Inference Works on Apple Silicon

When you type a prompt and hit enter, here is what happens inside your Mac, step by step:

  1. Model loading: The quantized model weights (e.g., 20 GB for a 35B model at Q4) are loaded from SSD into Unified Memory. This happens once at startup and takes 5-15 seconds depending on model size.
  2. Prompt encoding: Your text prompt is tokenized and converted into embedding vectors. The GPU processes these through every layer of the transformer in a single forward pass (prompt processing or "prefill" phase). This is compute-bound and benefits from more GPU cores.
  3. Token generation: The model generates one token at a time. For each token, the GPU reads the entire model weights from Unified Memory, performs matrix multiplications, and produces probability distributions for the next token. This is memory-bandwidth-bound.
  4. KV cache management: As the conversation grows, key-value attention caches accumulate in memory. A 128K context window can consume 8-16 GB of additional RAM beyond the model weights.

The critical bottleneck is step 3. According to LLMCheck testing, token generation speed correlates almost linearly with memory bandwidth, not GPU core count or clock speed.

Why Bandwidth Matters More Than Compute

This is the single most important concept for understanding AI performance on Macs. During token generation, the GPU must read the entire model weights from memory for every single token. For a 20 GB model generating 30 tokens per second, that is 600 GB of data read per second from memory.

According to LLMCheck benchmarks, the relationship between memory bandwidth and tok/s is nearly linear across all Apple Silicon generations. A chip with 2x the bandwidth delivers approximately 2x the token generation speed for the same model.

The bandwidth formula: Maximum theoretical tok/s ≈ Memory Bandwidth (GB/s) / Model Size (GB). For a 20 GB model on an M5 Max (~600 GB/s bandwidth): 600 / 20 = 30 tok/s. Real-world numbers are typically 70-85% of this theoretical maximum.

Metal Framework Explained

Metal is Apple's low-level GPU programming framework — think of it as Apple's equivalent to NVIDIA's CUDA. When inference engines like Ollama, LM Studio, or MLX run on your Mac, they submit compute workloads to the GPU through Metal.

Apple's own MLX framework is specifically optimized for Metal and Apple Silicon. According to LLMCheck benchmarks, MLX typically delivers 20-40% better performance than generic llama.cpp Metal backends for the same model and hardware. This is because MLX leverages Apple-specific features like lazy evaluation and unified memory semantics that generic frameworks cannot exploit.

The Metal Performance Shaders (MPS) library provides pre-optimized matrix multiplication kernels that inference engines can call directly. These kernels are tuned for the specific GPU microarchitecture of each Apple Silicon generation, ensuring that M5 chips benefit from architecture-specific optimizations that would not be available on M1.

M5 Neural Accelerators: What Changed in 2025

The M5 generation introduced a significant architectural change: dedicated Neural Accelerators embedded in each GPU core. Previous generations had the Neural Engine as a completely separate block on the chip. The M5 integrates neural processing directly into the GPU pipeline.

According to LLMCheck benchmarks, this delivers measurable improvements in two areas:

For sustained token generation, the improvement is more modest (25-35% over M4 Max) because token generation remains bandwidth-bound rather than compute-bound. The M5 Max's ~600 GB/s memory bandwidth versus the M4 Max's ~546 GB/s accounts for most of the tok/s improvement.

Comparison with the NVIDIA Approach

Apple and NVIDIA take fundamentally different approaches to running AI, each with distinct advantages. According to LLMCheck, the right choice depends entirely on model size.

Component M1 (2020) M3 (2023) M5 Max (2025) NVIDIA RTX 4090 Role in AI
GPU Cores 8 10 40 (+ Neural Accel.) 16,384 CUDA Matrix multiplications
Memory (Max) 16 GB 24 GB 128 GB 24 GB VRAM Model weight storage
Bandwidth 68 GB/s 200 GB/s ~600 GB/s 1,008 GB/s Feeds tokens to GPU
Neural Engine 16-core, 11 TOPS 16-core, 18 TOPS 16-core, 38 TOPS N/A (Tensor Cores) Specialized ML tasks
Max Model (Q4) ~7B ~13B ~70B (Q8) ~13B (VRAM only) Largest model that fits
8B Model tok/s ~25 ~55 ~105 ~150 Real-world speed
Power Draw (AI load) ~15W ~22W ~92W ~350W Electricity cost

The takeaway is clear: NVIDIA wins on raw speed for small models that fit in 24 GB of VRAM. Apple Silicon wins on model capacity thanks to Unified Memory, power efficiency, and total cost of ownership. For developers running 30B+ parameter models — which deliver the best quality for coding and reasoning — Apple Silicon is currently the most practical consumer platform.