Hardware Requirements
Llama 4 comes in two variants, and the hardware requirements are very different:
- Llama 4 Scout (109B MoE, 17B active) — the model you can actually run on a Mac. It uses a Mixture-of-Experts architecture with 16 experts, activating only 17B parameters per token. In INT4 quantization, it requires approximately 58GB of memory. You need a 64GB Mac minimum, though 96GB or 128GB is recommended for comfortable headroom.
- Llama 4 Maverick (400B MoE, 17B active) — this is server-only territory. At roughly 210GB in INT4, no current Mac can fit it in memory. You would need dedicated GPU servers or a multi-node setup to run Maverick.
For this guide, we are focusing entirely on Llama 4 Scout, the variant that is practical for local Mac use.
Important: The 64GB requirement is tight. macOS itself uses 8-12GB of RAM, so a 64GB Mac leaves roughly 52-56GB for the model. Scout at Q4_K_M fits, but you should close all other applications before running it. A 96GB or 128GB Mac gives you much more breathing room.
Install Ollama
If you do not have Ollama installed yet, follow our complete Ollama installation guide. It takes under five minutes.
If Ollama is already installed, make sure you are running the latest version. Llama 4 support requires Ollama 0.6 or later. Check your version in Terminal:
ollama --version
If you need to update, download the latest installer from ollama.com and install it over your existing version. Your models will be preserved.
Pull Llama 4 Scout
Download Llama 4 Scout with a single command:
ollama pull llama4-scout
This pulls the default Q4_K_M quantization, which is approximately 58GB. On a 100 Mbps connection, expect the download to take around 45-60 minutes. Make sure you have at least 70GB of free disk space to account for temporary files during download.
You can verify the download completed successfully:
ollama show llama4-scout
This displays the model's parameter count, quantization level, context window size, and total file size on disk.
Choose Your Quantization
The default Q4_K_M quantization offers the best balance of speed and quality for most users, but you have options:
- Q4_K_M (4-bit, ~58GB) — the default. Fastest inference speed, roughly 22 tokens per second on an M5 Max with 128GB. Quality loss is minimal for most tasks including coding, conversation, and analysis.
- Q5_K_M (5-bit, ~72GB) — noticeably better output quality, especially for nuanced reasoning and creative writing. Requires 96GB+ Mac. Runs at roughly 16-18 tok/s on high-end hardware.
To pull a specific quantization:
# Higher quality, needs 96GB+ Mac
ollama pull llama4-scout:q5_K_M
For a deeper dive into quantization trade-offs, see our quantization guide.
Run and Benchmark
Start Llama 4 Scout with:
ollama run llama4-scout
The first run takes 30-60 seconds as the model loads into memory. Subsequent runs are faster if the model is still cached. Once loaded, you will see a chat prompt where you can start typing.
To benchmark your performance, try a consistent prompt and observe the token generation speed displayed at the bottom of the output. Expected speeds by hardware:
- M5 Max 128GB — ~22 tok/s (Q4_K_M), ~16 tok/s (Q5_K_M)
- M4 Ultra 192GB — ~28 tok/s (Q4_K_M), ~21 tok/s (Q5_K_M)
- M3 Max 96GB — ~14 tok/s (Q4_K_M)
- M2 Ultra 128GB — ~18 tok/s (Q4_K_M)
For detailed benchmarks across many models and Mac configurations, visit our benchmarks page.
Tip: If you see significantly lower speeds than expected, check Activity Monitor for memory pressure. A yellow or red indicator means the model is swapping to disk, which destroys performance. Close other applications and try again.
Performance Tips
Getting the best possible performance from Llama 4 Scout on your Mac requires some optimization:
- Close everything else. Web browsers with many tabs can use 4-8GB of RAM. Close Safari, Chrome, and any other memory-heavy apps before running Scout. Every gigabyte matters when you are near the memory limit.
- Try MLX for faster inference. Apple's MLX framework can deliver 20-50% faster inference than Ollama on the same hardware. If you are comfortable with Python and want maximum speed, MLX is worth the extra setup.
- Compare against Qwen 3.5 35B. Before committing to Scout, consider whether Qwen 3.5 35B meets your needs. It runs on 32GB Macs at 30+ tok/s and scores remarkably well on coding and reasoning benchmarks. Check the leaderboard for head-to-head comparisons.
- Reduce context length. If you do not need the full context window, you can limit it with the
/set parameter num_ctx 4096command inside the Ollama chat. Shorter context means less memory usage and faster generation. - Use a wired power connection. Apple Silicon throttles under battery power. Always plug in your Mac when running large models to ensure full performance from the GPU and memory bandwidth.
For more model recommendations tailored to your specific Mac, read our Llama 4 Scout & Maverick analysis.