What's New in Phi-5 Large

Microsoft's "phi" family has always been about punching above its weight. The original Phi models proved that a small network trained on meticulously curated, largely synthetic "textbook-quality" data could match models several times its size. Phi-5 Large is the moment that recipe finally gets scaled up to a serious 28B dense parameter count.

The result is a model that behaves like a much larger one on reasoning tasks while keeping the lean memory footprint that made the smaller phi models so deployable. The headline numbers — 88% MMLU, 80% AIME, 76% GPQA — are frontier-class for anything you can fit on a laptop. And crucially, Microsoft kept the MIT license, meaning zero restrictions on commercial use, fine-tuning, or redistribution.

The other quiet upgrade is context. Phi-5 Large ships with a 256K-token window, a tenfold jump over earlier phi releases. That is enough to drop an entire mid-size codebase or a book-length document into a single prompt — assuming you have the RAM for the KV cache, which we will get to.

According to LLMCheck benchmarks, Phi-5 Large 28B is the highest-scoring dense model that fits comfortably in the 24-32 GB RAM tier — beating every other dense model in its weight class on aggregate reasoning while staying under a 16 GB Q4 footprint.

Benchmarks vs the 32GB-Tier Field

The 27-41B range is the most competitive segment in local AI right now. Here is how Phi-5 Large stacks up against the three models you are most likely to be choosing between: Google's Gemma 4.5 27B, Mistral Medium 4, and Alibaba's Qwen 4.1 32B-A3B.

Metric Phi-5 Large 28B Gemma 4.5 27B Mistral Medium 4 Qwen 4.1 32B-A3B
Architecture 28B dense 27B dense 41B-A13B MoE 32B-A3B MoE
MMLU 88% 86% 87% 85%
HumanEval 85% 82% 84% 83%
AIME (math) 80% 62% 71% 74%
GPQA 76% 68% 73% 70%
Context 256K 1M 256K 256K
Multimodal No Yes No No
License MIT Gemma Apache 2.0 Qwen
Speed (M5 Max) ~38 tok/s ~42 tok/s ~48 tok/s ~44 tok/s

The story the table tells is clear: Phi-5 Large wins on raw intelligence — it sweeps every reasoning benchmark, and its 80% AIME math score is a genuine outlier in this class (Gemma 4.5 27B lands 18 points behind). The trade-off is that it gives up ground on three fronts: it is text-only, its context tops out at 256K versus Gemma's million, and the MoE models generate faster because they activate fewer parameters per token.

If your work is reasoning, math, and code, Phi-5 Large is the most capable thing in the tier. If you need to look at images or stuff a million tokens into context, Gemma 4.5 27B remains the better generalist.

Mac Performance by Chip

Because Phi-5 Large is a 28B dense model, its generation speed scales almost linearly with memory bandwidth — the defining spec for local inference on Apple Silicon. Here are the throughput figures we measured at Q4 quantization using the MLX backend:

Chip / Config Speed (Q4, MLX) Experience
M5 Max 128GB ~38 tok/s Effortless, faster than you read
M5 Max 64GB ~34 tok/s Excellent for daily coding
M4 Pro 32GB ~28 tok/s Comfortable interactive chat
M3 Max 64GB ~26 tok/s Smooth, slightly behind M5

The M4 Pro 32GB number is the one that matters most here. At ~28 tok/s, an entry-level-Pro chip with the smallest practical RAM config runs a frontier-reasoning model faster than most people read. That is the whole pitch: you do not need a maxed-out $4,000 machine to run something genuinely smart locally. A mid-tier MacBook Pro is enough.

At Q4, the model weights occupy roughly 16 GB of Unified Memory. On a 32GB Mac that leaves around 12-14 GB for macOS, your browser, your editor, and the KV cache — which is exactly why this model fits the tier so gracefully.

The Dense-vs-MoE Angle

Here is the counterintuitive part that confuses a lot of buyers. Mixture-of-Experts (MoE) models like Mistral Medium 4 are faster than Phi-5 Large because they only activate a fraction of their parameters per token (13B of 41B). So why not just run the faster MoE?

Because MoE models still have to keep every parameter resident in RAM. Mistral Medium 4 may only compute with 13B active weights, but all 41B must be loaded into Unified Memory simultaneously. At Q4 that pushes its footprint past 24 GB — which means on a 32GB Mac you are left with almost nothing for context and other apps, and you risk swapping to disk the moment you open a few browser tabs.

A 28B dense model sidesteps this entirely. Every parameter is used on every token, so there is no wasted resident weight: the full Q4 footprint is just ~16 GB. You trade a few tokens per second of speed for a model that actually leaves you room to work.

The rule of thumb: on a 32GB Mac, total parameter count drives your RAM budget, not active parameters. A dense 28B is the largest "smart" model that fits with breathing room — which is precisely the niche Phi-5 Large was built for.

Who Should Run It

Phi-5 Large is not the universal answer, but for a specific and very common profile it is the best option available today.

Who should look elsewhere? If you need image understanding, run Gemma 4.5 27B. If you have 64GB+ and want maximum speed, an MoE like Mistral Medium 4 will feel snappier. And if you are on an 8GB Mac, Phi-5 Large is too big — reach for Phi-5 Mini instead.

Install + Cursor / Continue.dev Setup

The fastest way onto Phi-5 Large is Ollama, which handles the download, quantization, and serving in one command:

# Pull and run Phi-5 Large 28B (Q4 by default) $ ollama run phi5-large # Ollama exposes an OpenAI-compatible API at: # http://localhost:11434/v1

For coding inside Continue.dev (VS Code or JetBrains), point the extension at your local Ollama endpoint by adding this to config.json:

// ~/.continue/config.json { "models": [ { "title": "Phi-5 Large (local)", "provider": "ollama", "model": "phi5-large" } ] }

For Cursor, open Settings → Models, enable "Override OpenAI Base URL," and set it to http://localhost:11434/v1 with model name phi5-large. You now have a frontier-reasoning autocomplete and chat model that never sends a byte of your source code to the cloud.

If you prefer the MLX path for maximum throughput on Apple Silicon, the model is also available in MLX-quantized form and runs through LM Studio's MLX engine with no extra configuration — just search for "Phi-5 Large" in the model browser and pick the 4-bit build.

Limitations

Phi-5 Large is excellent, but it is not magic, and the phi recipe has well-known trade-offs worth flagging before you commit.

None of these are dealbreakers for the target audience. They are simply the shape of the trade you make when you choose the smartest dense model that fits your RAM over a faster or more multimodal alternative.