Best Model for Your Air, by RAM
On a MacBook Air, unified memory is everything. The chip generation (M1 vs M4 vs M5) affects speed, but the amount of RAM decides which models will even fit without forcing macOS to swap to disk — which tanks performance. Find your RAM tier below and start with the matching top pick.
| RAM | Top pick | Also great | Speed (Q4) |
|---|---|---|---|
| 8 GB | Qwen 4 4B | Gemma 4 E2B, Phi-5 Mini | ~45–65 tok/s |
| 16 GB | Phi-5 Mini | Gemma 4.5 12B (Q4), Llama 5 8B | ~35–55 tok/s |
| 24 GB | Qwen 4.1 32B-A3B | Phi-5 Medium 14B | ~40 tok/s |
8 GB Air — small but mighty
An 8 GB Air is the entry point, and it is genuinely useful. Stick to 2B–4B models that leave headroom for the OS. Qwen 4 4B is the best all-rounder at roughly 45–65 tok/s; Gemma 4 E2B is the speed champion at ~58 tok/s and barely touches memory; Phi-5 Mini is the sharpest at reasoning for its size. According to LLMCheck benchmarks, Gemma 4 E2B is among the fastest models on Apple Silicon overall.[1]
16 GB Air — the sweet spot
16 GB is where a MacBook Air becomes a serious local-AI laptop. You can comfortably run 8B–14B models: Phi-5 Mini for crisp reasoning, Gemma 4.5 12B at Q4 for well-rounded chat and writing, and Llama 5 8B for broad general knowledge. This tier handles real coding help, summarization, and RAG over local documents without swapping.
24 GB Air — punching above its weight
A 24 GB Air can run the king of Mac-runnable models: Qwen 4.1 32B-A3B, the current #1 on the LLMCheck leaderboard. Its mixture-of-experts design keeps only ~3B parameters active per token, so it fits in ~18 GB at Q4 and runs near 40 tok/s even on the Air — tight, but very usable. Phi-5 Medium 14B is the roomier alternative if you want more memory headroom for long contexts.
No MLX penalty on the Air. A MacBook Air and a MacBook Pro with the same chip run the same tok/s on a fresh run — the GPU, Neural Engine, and MLX framework are identical. The only difference between them is thermal throttling on long workloads, covered next.
Is the MacBook Air Good Enough?
For most people, yes. The thing to understand is the one trade-off Apple made to keep the Air thin and silent: there is no fan. The chip cools passively through the aluminum chassis. That has a clear, predictable effect on local LLMs:
- Short bursts run at full speed. A one-off question, a paragraph of code, a quick summary — the Air hits the same tok/s as a fan-cooled Mac with the same chip.
- Sustained generation throttles. After several minutes of continuous heavy output (or back-to-back long prompts), the chip warms up and macOS dials back clock speed to stay cool. Tok/s gradually drops on long runs.
- Memory pressure, not heat, is the real ceiling. Picking a model that fits your RAM matters far more than the fanless design for everyday use.
In practice, the Air is excellent for chat, short coding bursts, drafting, and on-the-go private AI. If your workflow is occasional questions rather than hours of nonstop batch generation, the fanless design will rarely get in your way.
How to Run an LLM on Your Air (5 Steps)
Step 1 — Check your MacBook Air's RAM
Click the Apple menu → About This Mac and read the Memory line. That number (8, 16, or 24 GB) is what decides your model. Note it before going further — it maps directly to the table above.
Step 2 — Install Ollama
Ollama is the simplest way to run models on a Mac. Download it from ollama.com, open the .dmg, drag Ollama to Applications, and launch it once so it installs its command-line tool. Then open Terminal and confirm:
ollama --version
You should see a version like ollama version 0.7.x. For a full walkthrough with screenshots, see our Install Ollama on Mac guide.
Step 3 — Pull the right model for your RAM
Run the single command that matches your Air's memory:
# 8 GB Air
ollama run qwen4:4b
# 16 GB Air
ollama run phi5:mini
# 24 GB Air
ollama run qwen4.1
The first run downloads the model (a few hundred MB up to ~18 GB for Qwen 4.1), then caches it locally so future launches take seconds.
Step 4 — Run and chat
When you see the >>> prompt, you are talking to the model entirely on-device — nothing leaves your Air. Type a question, press Enter, and watch it generate. Type /bye or press Ctrl+D to exit.
Step 5 — Manage heat & throttling
To keep your fanless Air running at its best during longer sessions:
- Keep prompts and outputs reasonable. Very long generations build up heat; break big tasks into chunks.
- Drop a model tier for sustained work. If you are batch-processing for an hour, a smaller model (e.g. Qwen 4 4B) throttles less and finishes faster overall.
- Give it air. Use the Air on a hard surface, not a blanket or your lap, so the chassis can shed heat.
- Mind RAM, not just heat. Quit memory-hungry apps before loading a model near your RAM ceiling to avoid swapping.
When to Choose a Pro or Studio Instead
The MacBook Air is the right machine for most local-AI users. But if your work is sustained and heavy — running an LLM as a coding agent all day, batch-processing documents, or serving a local API for hours — the fanless design becomes a real limit, and a model that needs more than 24 GB simply won't fit. That is when active cooling and more memory pay off.
- MacBook Pro (M4 Pro / M5 Pro, 24–48 GB): same portability, but a fan holds peak tok/s on long runs and higher RAM ceilings unlock bigger models.
- Mac Studio (M4 Max / Ultra, 64–192 GB): the desktop powerhouse for the largest Mac-runnable models and full 256K-context work.
Our Mac hardware buying hub breaks down exactly which configuration gives the best tok/s per dollar for local LLMs, across the whole lineup.
Just want the best Air? A MacBook Air M4 or M5 with 24 GB of unified memory clears the bar for Qwen 4.1 32B-A3B while staying silent and light. You can grab one here: MacBook Air M4 (24 GB) on Amazon.
As an Amazon Associate, LLMCheck earns from qualifying purchases. This does not affect our rankings.
Bottom line. Match the model to your RAM, run it through Ollama, and the MacBook Air handles real local AI for free and offline. Buy the Air for portability and silence; step up to a Pro or Studio only when you genuinely need sustained, heavy throughput.