How Dense Models Work
In a dense model, every parameter participates in every computation. When you send a prompt to Llama 3 70B, all 70 billion parameters activate for every single token it generates. The model reads 70B weights from memory, performs matrix multiplications across all of them, and produces one output token.
This has two implications for your Mac:
- Memory: All 70B parameters must be loaded into Unified Memory. At Q4 quantization, that is approximately 40GB.
- Speed: Your tokens per second is limited by how fast your Mac can read 40GB from memory per token. On an M4 Max (546 GB/s bandwidth), that is roughly 14 tok/s.
Dense models are simple, predictable, and battle-tested. But they hit a fundamental scaling wall: making them smarter requires making them bigger, which requires more memory and slows them down.
How MoE Models Work
MoE models take a radically different approach. Instead of one monolithic network, they contain multiple specialized sub-networks called experts. A small router network decides which experts handle each token.
The key insight: A 30B MoE model with 3B active parameters has the knowledge capacity of a 30B model but the inference speed of a 3B model. You get more capability per token of compute.
Take Qwen 3.5 30B-A3B as an example:
- Total parameters: 30B (stored across multiple expert networks)
- Active parameters per token: 3B (only 2 experts activate, plus shared layers)
- Experts: 64 total, top-2 routing
When you send this model a prompt, only 3B parameters activate for each token. The router selects the 2 most relevant experts (out of 64), those experts process the token, and the result is combined with the output from shared attention layers. The other 62 experts sit idle for that token.
Expert Routing Explained
The router is a small neural network that takes each token as input and outputs a probability distribution across all experts. The top-K experts with the highest probabilities are selected.
- Top-1 routing: Only one expert activates per token. Used by Llama 4 Scout (16 experts, top-1). Fastest inference but potentially less diverse responses.
- Top-2 routing: Two experts activate per token. Used by Qwen 3.5, Mixtral. Slightly slower but more robust output quality.
- Top-8 routing: Eight experts activate per token. Used by DeepSeek V3.2 (256 experts, top-8). Server-class models where compute is less constrained.
Routing happens at every transformer layer, and different tokens in the same prompt often activate different expert combinations. A question about Python code might route to coding-specialized experts, while a follow-up about biology routes to science experts -- all within the same model.
Why MoE Changes Everything for Mac Users
Here is the concrete difference on a 24GB M5 Pro MacBook Pro:
Dense 30B Model (e.g., Llama 3 30B if it existed)
- Memory needed: ~18GB at Q4 (barely fits in 24GB with OS overhead)
- Speed: ~14 tok/s (reading 18GB per token)
- Active compute: All 30B parameters every token
MoE 30B-A3B Model (Qwen 3.5 30B-A3B)
- Memory needed: ~18GB at Q4 (same total footprint)
- Speed: ~58 tok/s (only reading 3B active params per token)
- Active compute: 3B parameters per token
Same memory. Same hardware. 4x faster generation. That is the MoE advantage in one comparison.
There is an important caveat: MoE models still need to load all expert weights into memory. The total memory footprint is based on total parameters (30B), not active parameters (3B). What MoE saves you is compute time per token -- the actual matrix multiplications happen on a much smaller parameter set.
MoE Models in 2026: Real Examples
Nearly every major model release in 2026 uses MoE. Here are the ones that matter most for Mac users:
- Qwen 3.5 30B-A3B: 30B total, 3B active, 64 experts. Runs on 16GB Mac. The best value MoE model for consumer hardware.
- Llama 4 Scout: 109B total, 17B active, 16 experts. Runs on 64GB Mac. 10M token context window.
- Llama 4 Maverick: 400B total, 17B active, 128 experts. Server-only. Shows how MoE scales expert count without increasing per-token compute.
- Kimi K2.5: 1 trillion total parameters. The largest MoE model publicly released. Server-only but demonstrates the architecture's ceiling.
- Mistral Large 3: 675B total, MoE architecture. Server-class but distilled variants run locally.
- DeepSeek V3.2: 685B total, 37B active, 256 experts. Beats GPT-5 on math benchmarks. Server-only.
Filter our leaderboard by MoE architecture to see all compatible models ranked by Mac performance.
The Future of Local AI Is Sparse
The trend is clear: every frontier model lab is converging on MoE. The architecture lets you scale model knowledge without proportionally scaling inference cost. For local AI on Mac, this means:
- Today: A 24GB Mac runs models with the knowledge of 30B parameters at 58 tok/s
- Near-term: Expect MoE models with 100B+ total parameters and 2-5B active, runnable on 16GB Macs at ChatGPT-level quality
- Longer-term: Expert offloading to SSD could enable trillion-parameter MoE models on consumer hardware, loading experts from NVMe storage on demand
MoE is not a niche optimization. It is the dominant paradigm for AI model architecture in 2026, and understanding it is essential for making informed decisions about which model to run on your Mac.