The 30-Day Recap (TL;DR)

Every major open-weights release between April 9 and May 9, 2026, with a one-line takeaway. Six of these seven models now appear in the LLMCheck top 25; three are in the top 10.

Headline story: Open source no longer trails by 6–9 months. As of May 2026, the gap to GPT-5o is measured in single-digit percentage points on most benchmarks — and DeepSeek R2 has actually surpassed it on AIME and competition math. The center of gravity in AI development is shifting back to weights you can host yourself.


The new #1: Qwen 4 Preview 32B-A3B

Alibaba dropped Qwen 4 Preview on May 5 with surgical precision: same 3B active-parameter MoE shape that made Qwen 3.6 the runaway hit of April, but trained on 2 trillion more tokens with a knowledge cutoff of April 2026. The result is the most capable open model that still runs comfortably on a 24 GB Mac.

Architecture

Total params32B
Active params3B (9.4%)
Experts128 (4 active)
Reasoning modeHybrid (auto-switch)
Context262K (1M extended)
Training tokens18T

Benchmarks

SWE-Verified76%
MMLU88%
HumanEval92%
AIME 202589%
MATH94%
LLMCheck Score73 / 100

The hybrid reasoning mode is the headline feature. Qwen 3.6 had a manual /think toggle that users had to remember to flip for hard prompts. Qwen 4 Preview replaces this with an automatic classifier that detects reasoning-shaped prompts and engages internal chain-of-thought before answering. In LLMCheck testing, the auto-router correctly engages thinking mode on 94% of math, code-debugging, and multi-step planning prompts — while keeping non-reasoning queries at full speed.

Mac performance is the second story. According to LLMCheck benchmarks, the Q4_K_M quantization measures 78 tok/s on M5 Max 128 GB, 65 tok/s on M5 Max 64 GB, and 58 tok/s on M4 Pro 24 GB — the first 32B-class model that hits production-usable speeds on the entry-tier MacBook Pro. The MoE structure (only 3B active per token) is doing the heavy lifting; from a memory-bandwidth perspective, Qwen 4 Preview behaves like a 3B dense model while delivering 32B-class quality.

License is fully Apache 2.0 — commercial use is unambiguously permitted, no MAU caps, no field-of-use restrictions. This puts Qwen 4 Preview in the same legal posture as the Linux kernel for downstream products. Combined with the 76% SWE-Verified score (within 4 points of GPT-5o), the practical implication is that startups can now ship coding agents on locally-hosted weights with no API spend and no licensing risk.

Install via the standard channels:

# Ollama (one-line install, ~20 GB download)
ollama run qwen4:32b-a3b

# MLX (fastest on Apple Silicon)
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen4-32B-A3B-4bit \
--prompt "Refactor this function for clarity..."

# LM Studio: search "Qwen 4 Preview 32B A3B" in Discover tab

Verdict: Qwen 4 Preview is the new default recommendation for any Mac with 24 GB RAM or more. It replaces Qwen 3.6-35B-A3B at the top of the LLMCheck leaderboard and is the strongest open-weights coding model that runs on consumer hardware as of May 9, 2026.


Llama 5 — Meta is back in the open game

After a quiet six months following Llama 4 Maverick, Meta shipped Llama 5 on May 1 with two simultaneous releases: a dense Llama 5 8B aimed at the mainstream Mac tier, and a Scout 109B-A17B MoE that targets developer workstations. Both ship with full multimodal capability and the new Llama 5 Community License.

Llama 5 8B (dense)

The 8B dense model is a meaningful step up from Llama 3.3 8B and a clear competitor to Qwen 3.5 9B in the 16 GB Mac tier. According to LLMCheck benchmarks, it measures ~110 tok/s on M5 Max via MLX, scores 78% on MMLU and 80% on HumanEval, and uses approximately 6.5 GB of RAM at Q4_K_M. Where Qwen 3.5 9B leans coding, Llama 5 8B is broader — stronger on creative writing, multilingual chat, and instruction following.

Llama 5 Scout 109B-A17B (MoE)

Scout is the more interesting release. It is a 109-billion-parameter MoE with 17 billion active parameters per token, which puts it in the "edge-optimized frontier" niche — too large for 24 GB Macs but comfortable on 64 GB and up. The architecture borrows from Llama 4 Maverick but with a denser router and what Meta calls "activation pruning," a technique that skips experts whose contribution falls below a learned threshold. The result is roughly 38 tok/s on M5 Max 64 GB at Q4_K_M.

Scout supports a full 256K-token context window and is the first open Llama with native multimodal input across text, image, and audio. The audio path was previously exclusive to closed models like Gemini 2.5 Flash. According to LLMCheck testing, Scout transcribes 30-second audio clips with ~94% word-level accuracy, which puts it in the same band as Whisper Large v3 while also producing reasoned answers about the content.

Llama 5 Community License — the asterisk

Both models ship under the Llama 5 Community License, Meta's house license. It is permissive for the vast majority of users but carries the same 700-million-monthly-active-user clause that gated Llama 3 and 4. Companies above that threshold must request a separate commercial license. For 99.9% of developers, the license is functionally Apache 2.0; for hyperscalers, it is a hard wall. The Open Source Initiative continues to decline to classify Llama as "open source" in the strict OSI sense.

# Install both Llama 5 variants
ollama run llama5:8b # ~6.5 GB, dense
ollama run llama5:scout # ~62 GB, MoE 17B-active

# MLX equivalents
mlx_lm.generate --model mlx-community/Llama-5-8B-Instruct-4bit
mlx_lm.generate --model mlx-community/Llama-5-Scout-109B-A17B-4bit

Strategically, the Scout release is Meta's clearest statement yet that it sees on-device AI as a category. Audio + vision + 256K context on weights you can host locally is the kind of capability that closed APIs charge per-token for. For Mac power users with 64 GB+ machines, Scout instantly becomes one of the most interesting models to experiment with this month.


Phi-5 Mini owns the 8 GB tier

Microsoft shipped Phi-5 Mini on April 28 and it has quietly become the most-downloaded model on Ollama's library this month. The pitch is simple: a 4-billion-parameter dense model under MIT license that beats every model under 10B parameters on MMLU and AIME. According to LLMCheck benchmarks, Phi-5 Mini scores 82% MMLU, 61% AIME 2025, and 86% HumanEval at just 4B parameters — numbers that 13B models from Q1 2026 could not reach.

The "phi recipe" — Microsoft's deliberately curated synthetic-data training pipeline — has scaled cleanly from Phi-3 to Phi-5. The team is now generating training data from frontier models, filtering aggressively for textbook-quality samples, and using that to train smaller students. The trick that previously hit a quality ceiling around 7B parameters now appears to scale through 4B with no degradation, mostly because the synthetic data pipeline itself has gotten dramatically better as the teacher models have improved.

Mac performance is exceptional. Phi-5 Mini runs at ~140 tok/s on M5 Max, ~115 tok/s on M4 Max, and ~85 tok/s on M2 Air 16 GB — meaning it is the first model that's both genuinely capable and genuinely fast on entry-tier MacBook Air hardware. RAM footprint at Q4_K_M is approximately 2.6 GB, which leaves plenty of headroom for the OS and other apps even on 8 GB Macs.

Native context is 256K with full attention (no sliding-window compression), and the MIT license removes every legal concern. This combination gives Phi-5 Mini the highest LLMCheck score (70) in the 8 GB tier, edging out Gemma 4 E2B and E4B for the first time.

# 8 GB Mac install (~2.6 GB on disk)
ollama run phi5:mini

# MLX
mlx_lm.generate --model mlx-community/Phi-5-Mini-4B-Instruct-4bit \
--prompt "Explain transformers to me"

For users on 8 GB or 16 GB Macs — which remain the majority of Apple Silicon laptops in the wild — Phi-5 Mini is the new default recommendation for general-purpose chat, reasoning, and coding assistance. The combination of MIT licensing and 140 tok/s makes it ideal for embedded use cases too: shipping it inside a Mac app no longer means shipping a slow or weak model.


DeepSeek R2 enters the frontier

DeepSeek released R2 on May 7 and it is genuinely a frontier-class reasoning model with weights you can download. The architecture is 671 billion total parameters with 37 billion active per token — a direct successor to R1 that retains the MIT license and adds substantial improvements to math, multi-step reasoning, and test-time compute scaling.

DeepSeek R2 Benchmarks

AIME 202591%
MATH88%
GPQA-Diamond84%
SWE-Verified71%
vs. GPT-5o (AIME)+4 pp open lead
LicenseMIT

Mac Viability

M5 Max 128 GB Q2_K~8 tok/s
M4 Ultra 192 GB Q3~12 tok/s
Disk size (Q2)~220 GB
Disk size (Q3)~290 GB
Min RAM (Q2)128 GB
RecommendedM4 Ultra 192 GB

The 91% AIME 2025 score is the most consequential number in the May release cycle. According to LLMCheck cross-reference with public closed-model benchmarks, DeepSeek R2 has surpassed GPT-5o (87%) and matches Claude 4.6 Opus on competition mathematics. This is the first time an MIT-licensed model has held the top spot on a frontier benchmark by a clear margin.

Test-time compute is the underlying mechanism. R2 scales reasoning quality with extra inference budget — doubling the thinking-token budget reliably improves AIME by 3–5 percentage points. This means R2 isn't a single fixed model, it's a family of speed-quality tradeoffs you control at inference time. For local Mac users on Mac Studio M4 Ultra hardware, this is a unique capability: you can dial the model up to "spend extra seconds, get a better answer" mode without any API call.

Mac viability is real but constrained. The 671B total parameter count means even Q2_K quantization needs ~220 GB on disk and 128 GB+ unified memory. Practical local inference is the Mac Studio M4 Ultra 192 GB at Q3_K_M (~12 tok/s with usable quality). Below that tier, R2 is server-class only. But because the license is MIT and the weights are open, anyone can rent a single H200 GPU and serve R2 to their team for less than a single Claude API subscription.

Why this matters strategically: MIT-licensed frontier reasoning means startups can ship products built on top-tier mathematical and analytical reasoning without API costs, rate limits, or per-token billing. The competitive moat that closed reasoning models held since OpenAI o1 has now been substantially eroded.


The full landscape — Open-Source Top 10 (May 9, 2026)

According to LLMCheck benchmarks across 109 standardized data points, here is the open-source leaderboard as of May 9, 2026. Score is the LLMCheck composite (capability + speed + accessibility + license, max 100). Mac Tier is the minimum unified memory needed to run Q4_K_M comfortably.

LLMCheck Open-Source Top 10 — May 9, 2026. See full leaderboard for all 46 models.
Rank Model Family Active Params License Mac Tier Score
1 Qwen 4 Preview 32B-A3B Alibaba 3B Apache 2.0 24 GB 73
2 Phi-5 Mini Microsoft 4B MIT 8 GB 70
3 Qwen 3.6-35B-A3B Alibaba 3B Apache 2.0 24 GB 69
4 Gemma 4 E2B Google 2.3B Gemma 8 GB 67
5 Gemma 4 E4B Google 4B Gemma 8 GB 66
6 DeepSeek R2 DeepSeek 37B MIT Server 66
7 Qwen 3.5 9B Alibaba 9B Apache 2.0 8 GB 66
8 Llama 5 8B Meta 8B Llama 5 16 GB 64
9 Llama 5 Scout Meta 17B Llama 5 64 GB 62
10 Mistral Voyage 24B Mistral 24B Apache 2.0 24 GB 60

Three observations from the table. First, Apache 2.0 and MIT account for 7 of the top 10 entries — permissive licensing has become a competitive requirement, not a nice-to-have. Second, five of the top 10 have 8 GB or 16 GB Mac tiers — a year ago that number was two. Third, three different families now place models in the top 5 (Alibaba, Microsoft, Google), suggesting the open-source ecosystem is healthier and more competitive than at any prior point.


1. MoE is now the default

Seven of the top 10 open models now use mixture-of-experts architectures. Six months ago that number was three. The reason is straightforward: MoE delivers better tok/s per gigabyte of unified memory on Apple Silicon, and Apple Silicon is where the bulk of local-LLM users actually run inference. Expect to see "dense" become a niche choice for specific quality-maximizing scenarios rather than the architectural default.

2. On-device reasoning models are real

Qwen 4 Preview's hybrid mode and DeepSeek R2's test-time compute scaling are both production-quality reasoning systems running on local hardware. Six months ago, "reasoning model" meant "OpenAI o1 via API" — today it means "ollama run qwen4". The category has bifurcated into auto-routing reasoners (Qwen 4) and budget-tunable reasoners (DeepSeek R2), each with different ergonomics.

3. MIT and Apache won the license war

Three years ago, the open-LLM ecosystem was dominated by custom licenses with usage restrictions (Llama 2, Falcon, etc.). Today, the highest-capability open models — Qwen 4, Phi-5, DeepSeek R2, Mistral Voyage — ship under MIT or Apache 2.0. According to LLMCheck data, models under permissive OSI-approved licenses have grown from 22% of the top 25 in May 2024 to 64% in May 2026.

4. Edge models doubled in capability

Phi-5 Mini at 4B parameters scores 82% MMLU. Qwen 2.5-13B scored 78% MMLU one year ago. The edge tier (3B–5B parameters) now meaningfully overlaps with what 13B models could do in Q1, which fundamentally changes the economics of shipping LLMs inside consumer apps. A 4B model fits in 2.6 GB and runs at 140 tok/s; a 13B model needed 8 GB and ran at 60 tok/s. The capability cost has collapsed.

5. Apple Silicon is the de facto open-dev platform

HuggingFace model cards now include MLX install commands by default for major releases. Qwen 4 Preview, Phi-5 Mini, Llama 5, and Mistral Voyage all shipped with day-one MLX-converted weights on the mlx-community namespace. This was not the case in May 2025. The combination of unified memory, the MLX framework, and a large user base of developers running M-series Macs has created a soft monopoly: if a model can't run on Apple Silicon, it doesn't make it into mainstream open-LLM workflows.


By Mac tier — what to run TODAY (May 2026)

Updated recommendations as of May 9, 2026. All numbers are LLMCheck-measured tok/s at Q4_K_M unless noted. Pick by RAM tier:

Mac tier recommendations, May 9, 2026.
Mac RAM Primary pick Speed Backup
8 GB Phi-5 Mini 140 tok/s Gemma 4 E2B (155 tok/s)
16 GB Llama 5 8B 110 tok/s Phi-5 Mini, Qwen 3.5 9B
24 GB Qwen 4 Preview 32B-A3B 58 tok/s Mistral Voyage 24B
32–48 GB Qwen 4 Preview + Voyage 24B 65 tok/s Hermes 4 70B
64 GB Llama 5 Scout 38 tok/s Everything below
128 GB DeepSeek R2 Q2 ~8 tok/s Llama 5 Scout, Voyage
M4 Ultra 192 GB DeepSeek R2 Q3_K_M ~12 tok/s R2 + Qwen 4 simultaneously

The 24 GB sweet spot has decisively shifted to Qwen 4 Preview. If you bought a base-tier 24 GB MacBook Pro in the past two years, you now have access to the most capable open-source model in existence at production speed. This is a meaningful change — in May 2025 the equivalent recommendation was Llama 3.3 70B at painful quantization, and the user experience was mediocre.


What's coming next month

Speculative section — this is what LLMCheck is watching for in the next 30 days based on public roadmaps, leaked release notes, and credible community signals.

Watch list: The single most consequential possible release is a permissively-licensed reasoning model that fits in 24 GB. Qwen 4 Preview already leans this way with its hybrid mode. If a competitor ships a true Q4 reasoning model in the 32B-A3B class with explicit test-time compute control, the local-LLM landscape changes again.