The 30-Day Recap (TL;DR)
Every major open-weights release between April 9 and May 9, 2026, with a one-line takeaway. Six of these seven models now appear in the LLMCheck top 25; three are in the top 10.
- Qwen 4 Preview 32B-A3B May 5 — Apache 2.0. New #1 open model, score 73, hybrid reasoning mode, 76% SWE-Verified.
- Llama 5 8B + Llama 5 Scout 109B-A17B May 1 — Meta's open comeback. 256K context, multimodal (vision + audio) on Scout.
- Phi-5 Mini 4B Apr 28 — Microsoft's small-model king. MIT, 140 tok/s, 82% MMLU at 4B parameters.
- DeepSeek R2 671B-A37B May 7 — MIT-licensed frontier reasoning. 91% AIME, beats GPT-5o on math.
- Mistral Voyage 24B May 3 — Apache 2.0, dense balanced workhorse. 32K context, strong tool use.
- Hermes 4 70B May 6 — Nous Research's Llama 5 finetune with system-prompt control and reduced refusals.
- SmolLM3 3B Apr 30 — HuggingFace's open recipe edge model with full training data published.
Headline story: Open source no longer trails by 6–9 months. As of May 2026, the gap to GPT-5o is measured in single-digit percentage points on most benchmarks — and DeepSeek R2 has actually surpassed it on AIME and competition math. The center of gravity in AI development is shifting back to weights you can host yourself.
The new #1: Qwen 4 Preview 32B-A3B
Alibaba dropped Qwen 4 Preview on May 5 with surgical precision: same 3B active-parameter MoE shape that made Qwen 3.6 the runaway hit of April, but trained on 2 trillion more tokens with a knowledge cutoff of April 2026. The result is the most capable open model that still runs comfortably on a 24 GB Mac.
Architecture
Benchmarks
The hybrid reasoning mode is the headline feature. Qwen 3.6 had a manual /think toggle that users had to remember to flip for hard prompts. Qwen 4 Preview replaces this with an automatic classifier that detects reasoning-shaped prompts and engages internal chain-of-thought before answering. In LLMCheck testing, the auto-router correctly engages thinking mode on 94% of math, code-debugging, and multi-step planning prompts — while keeping non-reasoning queries at full speed.
Mac performance is the second story. According to LLMCheck benchmarks, the Q4_K_M quantization measures 78 tok/s on M5 Max 128 GB, 65 tok/s on M5 Max 64 GB, and 58 tok/s on M4 Pro 24 GB — the first 32B-class model that hits production-usable speeds on the entry-tier MacBook Pro. The MoE structure (only 3B active per token) is doing the heavy lifting; from a memory-bandwidth perspective, Qwen 4 Preview behaves like a 3B dense model while delivering 32B-class quality.
License is fully Apache 2.0 — commercial use is unambiguously permitted, no MAU caps, no field-of-use restrictions. This puts Qwen 4 Preview in the same legal posture as the Linux kernel for downstream products. Combined with the 76% SWE-Verified score (within 4 points of GPT-5o), the practical implication is that startups can now ship coding agents on locally-hosted weights with no API spend and no licensing risk.
Install via the standard channels:
# Ollama (one-line install, ~20 GB download)
ollama run qwen4:32b-a3b
# MLX (fastest on Apple Silicon)
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen4-32B-A3B-4bit \
--prompt "Refactor this function for clarity..."
# LM Studio: search "Qwen 4 Preview 32B A3B" in Discover tabVerdict: Qwen 4 Preview is the new default recommendation for any Mac with 24 GB RAM or more. It replaces Qwen 3.6-35B-A3B at the top of the LLMCheck leaderboard and is the strongest open-weights coding model that runs on consumer hardware as of May 9, 2026.
Llama 5 — Meta is back in the open game
After a quiet six months following Llama 4 Maverick, Meta shipped Llama 5 on May 1 with two simultaneous releases: a dense Llama 5 8B aimed at the mainstream Mac tier, and a Scout 109B-A17B MoE that targets developer workstations. Both ship with full multimodal capability and the new Llama 5 Community License.
Llama 5 8B (dense)
The 8B dense model is a meaningful step up from Llama 3.3 8B and a clear competitor to Qwen 3.5 9B in the 16 GB Mac tier. According to LLMCheck benchmarks, it measures ~110 tok/s on M5 Max via MLX, scores 78% on MMLU and 80% on HumanEval, and uses approximately 6.5 GB of RAM at Q4_K_M. Where Qwen 3.5 9B leans coding, Llama 5 8B is broader — stronger on creative writing, multilingual chat, and instruction following.
Llama 5 Scout 109B-A17B (MoE)
Scout is the more interesting release. It is a 109-billion-parameter MoE with 17 billion active parameters per token, which puts it in the "edge-optimized frontier" niche — too large for 24 GB Macs but comfortable on 64 GB and up. The architecture borrows from Llama 4 Maverick but with a denser router and what Meta calls "activation pruning," a technique that skips experts whose contribution falls below a learned threshold. The result is roughly 38 tok/s on M5 Max 64 GB at Q4_K_M.
Scout supports a full 256K-token context window and is the first open Llama with native multimodal input across text, image, and audio. The audio path was previously exclusive to closed models like Gemini 2.5 Flash. According to LLMCheck testing, Scout transcribes 30-second audio clips with ~94% word-level accuracy, which puts it in the same band as Whisper Large v3 while also producing reasoned answers about the content.
Llama 5 Community License — the asterisk
Both models ship under the Llama 5 Community License, Meta's house license. It is permissive for the vast majority of users but carries the same 700-million-monthly-active-user clause that gated Llama 3 and 4. Companies above that threshold must request a separate commercial license. For 99.9% of developers, the license is functionally Apache 2.0; for hyperscalers, it is a hard wall. The Open Source Initiative continues to decline to classify Llama as "open source" in the strict OSI sense.
# Install both Llama 5 variants
ollama run llama5:8b # ~6.5 GB, dense
ollama run llama5:scout # ~62 GB, MoE 17B-active
# MLX equivalents
mlx_lm.generate --model mlx-community/Llama-5-8B-Instruct-4bit
mlx_lm.generate --model mlx-community/Llama-5-Scout-109B-A17B-4bitStrategically, the Scout release is Meta's clearest statement yet that it sees on-device AI as a category. Audio + vision + 256K context on weights you can host locally is the kind of capability that closed APIs charge per-token for. For Mac power users with 64 GB+ machines, Scout instantly becomes one of the most interesting models to experiment with this month.
Phi-5 Mini owns the 8 GB tier
Microsoft shipped Phi-5 Mini on April 28 and it has quietly become the most-downloaded model on Ollama's library this month. The pitch is simple: a 4-billion-parameter dense model under MIT license that beats every model under 10B parameters on MMLU and AIME. According to LLMCheck benchmarks, Phi-5 Mini scores 82% MMLU, 61% AIME 2025, and 86% HumanEval at just 4B parameters — numbers that 13B models from Q1 2026 could not reach.
The "phi recipe" — Microsoft's deliberately curated synthetic-data training pipeline — has scaled cleanly from Phi-3 to Phi-5. The team is now generating training data from frontier models, filtering aggressively for textbook-quality samples, and using that to train smaller students. The trick that previously hit a quality ceiling around 7B parameters now appears to scale through 4B with no degradation, mostly because the synthetic data pipeline itself has gotten dramatically better as the teacher models have improved.
Mac performance is exceptional. Phi-5 Mini runs at ~140 tok/s on M5 Max, ~115 tok/s on M4 Max, and ~85 tok/s on M2 Air 16 GB — meaning it is the first model that's both genuinely capable and genuinely fast on entry-tier MacBook Air hardware. RAM footprint at Q4_K_M is approximately 2.6 GB, which leaves plenty of headroom for the OS and other apps even on 8 GB Macs.
Native context is 256K with full attention (no sliding-window compression), and the MIT license removes every legal concern. This combination gives Phi-5 Mini the highest LLMCheck score (70) in the 8 GB tier, edging out Gemma 4 E2B and E4B for the first time.
# 8 GB Mac install (~2.6 GB on disk)
ollama run phi5:mini
# MLX
mlx_lm.generate --model mlx-community/Phi-5-Mini-4B-Instruct-4bit \
--prompt "Explain transformers to me"For users on 8 GB or 16 GB Macs — which remain the majority of Apple Silicon laptops in the wild — Phi-5 Mini is the new default recommendation for general-purpose chat, reasoning, and coding assistance. The combination of MIT licensing and 140 tok/s makes it ideal for embedded use cases too: shipping it inside a Mac app no longer means shipping a slow or weak model.
DeepSeek R2 enters the frontier
DeepSeek released R2 on May 7 and it is genuinely a frontier-class reasoning model with weights you can download. The architecture is 671 billion total parameters with 37 billion active per token — a direct successor to R1 that retains the MIT license and adds substantial improvements to math, multi-step reasoning, and test-time compute scaling.
DeepSeek R2 Benchmarks
Mac Viability
The 91% AIME 2025 score is the most consequential number in the May release cycle. According to LLMCheck cross-reference with public closed-model benchmarks, DeepSeek R2 has surpassed GPT-5o (87%) and matches Claude 4.6 Opus on competition mathematics. This is the first time an MIT-licensed model has held the top spot on a frontier benchmark by a clear margin.
Test-time compute is the underlying mechanism. R2 scales reasoning quality with extra inference budget — doubling the thinking-token budget reliably improves AIME by 3–5 percentage points. This means R2 isn't a single fixed model, it's a family of speed-quality tradeoffs you control at inference time. For local Mac users on Mac Studio M4 Ultra hardware, this is a unique capability: you can dial the model up to "spend extra seconds, get a better answer" mode without any API call.
Mac viability is real but constrained. The 671B total parameter count means even Q2_K quantization needs ~220 GB on disk and 128 GB+ unified memory. Practical local inference is the Mac Studio M4 Ultra 192 GB at Q3_K_M (~12 tok/s with usable quality). Below that tier, R2 is server-class only. But because the license is MIT and the weights are open, anyone can rent a single H200 GPU and serve R2 to their team for less than a single Claude API subscription.
Why this matters strategically: MIT-licensed frontier reasoning means startups can ship products built on top-tier mathematical and analytical reasoning without API costs, rate limits, or per-token billing. The competitive moat that closed reasoning models held since OpenAI o1 has now been substantially eroded.
The full landscape — Open-Source Top 10 (May 9, 2026)
According to LLMCheck benchmarks across 109 standardized data points, here is the open-source leaderboard as of May 9, 2026. Score is the LLMCheck composite (capability + speed + accessibility + license, max 100). Mac Tier is the minimum unified memory needed to run Q4_K_M comfortably.
| Rank | Model | Family | Active Params | License | Mac Tier | Score |
|---|---|---|---|---|---|---|
| 1 | Qwen 4 Preview 32B-A3B | Alibaba | 3B | Apache 2.0 | 24 GB | 73 |
| 2 | Phi-5 Mini | Microsoft | 4B | MIT | 8 GB | 70 |
| 3 | Qwen 3.6-35B-A3B | Alibaba | 3B | Apache 2.0 | 24 GB | 69 |
| 4 | Gemma 4 E2B | 2.3B | Gemma | 8 GB | 67 | |
| 5 | Gemma 4 E4B | 4B | Gemma | 8 GB | 66 | |
| 6 | DeepSeek R2 | DeepSeek | 37B | MIT | Server | 66 |
| 7 | Qwen 3.5 9B | Alibaba | 9B | Apache 2.0 | 8 GB | 66 |
| 8 | Llama 5 8B | Meta | 8B | Llama 5 | 16 GB | 64 |
| 9 | Llama 5 Scout | Meta | 17B | Llama 5 | 64 GB | 62 |
| 10 | Mistral Voyage 24B | Mistral | 24B | Apache 2.0 | 24 GB | 60 |
Three observations from the table. First, Apache 2.0 and MIT account for 7 of the top 10 entries — permissive licensing has become a competitive requirement, not a nice-to-have. Second, five of the top 10 have 8 GB or 16 GB Mac tiers — a year ago that number was two. Third, three different families now place models in the top 5 (Alibaba, Microsoft, Google), suggesting the open-source ecosystem is healthier and more competitive than at any prior point.
5 things that changed in May 2026
1. MoE is now the default
Seven of the top 10 open models now use mixture-of-experts architectures. Six months ago that number was three. The reason is straightforward: MoE delivers better tok/s per gigabyte of unified memory on Apple Silicon, and Apple Silicon is where the bulk of local-LLM users actually run inference. Expect to see "dense" become a niche choice for specific quality-maximizing scenarios rather than the architectural default.
2. On-device reasoning models are real
Qwen 4 Preview's hybrid mode and DeepSeek R2's test-time compute scaling are both production-quality reasoning systems running on local hardware. Six months ago, "reasoning model" meant "OpenAI o1 via API" — today it means "ollama run qwen4". The category has bifurcated into auto-routing reasoners (Qwen 4) and budget-tunable reasoners (DeepSeek R2), each with different ergonomics.
3. MIT and Apache won the license war
Three years ago, the open-LLM ecosystem was dominated by custom licenses with usage restrictions (Llama 2, Falcon, etc.). Today, the highest-capability open models — Qwen 4, Phi-5, DeepSeek R2, Mistral Voyage — ship under MIT or Apache 2.0. According to LLMCheck data, models under permissive OSI-approved licenses have grown from 22% of the top 25 in May 2024 to 64% in May 2026.
4. Edge models doubled in capability
Phi-5 Mini at 4B parameters scores 82% MMLU. Qwen 2.5-13B scored 78% MMLU one year ago. The edge tier (3B–5B parameters) now meaningfully overlaps with what 13B models could do in Q1, which fundamentally changes the economics of shipping LLMs inside consumer apps. A 4B model fits in 2.6 GB and runs at 140 tok/s; a 13B model needed 8 GB and ran at 60 tok/s. The capability cost has collapsed.
5. Apple Silicon is the de facto open-dev platform
HuggingFace model cards now include MLX install commands by default for major releases. Qwen 4 Preview, Phi-5 Mini, Llama 5, and Mistral Voyage all shipped with day-one MLX-converted weights on the mlx-community namespace. This was not the case in May 2025. The combination of unified memory, the MLX framework, and a large user base of developers running M-series Macs has created a soft monopoly: if a model can't run on Apple Silicon, it doesn't make it into mainstream open-LLM workflows.
By Mac tier — what to run TODAY (May 2026)
Updated recommendations as of May 9, 2026. All numbers are LLMCheck-measured tok/s at Q4_K_M unless noted. Pick by RAM tier:
| Mac RAM | Primary pick | Speed | Backup |
|---|---|---|---|
| 8 GB | Phi-5 Mini | 140 tok/s | Gemma 4 E2B (155 tok/s) |
| 16 GB | Llama 5 8B | 110 tok/s | Phi-5 Mini, Qwen 3.5 9B |
| 24 GB | Qwen 4 Preview 32B-A3B | 58 tok/s | Mistral Voyage 24B |
| 32–48 GB | Qwen 4 Preview + Voyage 24B | 65 tok/s | Hermes 4 70B |
| 64 GB | Llama 5 Scout | 38 tok/s | Everything below |
| 128 GB | DeepSeek R2 Q2 | ~8 tok/s | Llama 5 Scout, Voyage |
| M4 Ultra 192 GB | DeepSeek R2 Q3_K_M | ~12 tok/s | R2 + Qwen 4 simultaneously |
The 24 GB sweet spot has decisively shifted to Qwen 4 Preview. If you bought a base-tier 24 GB MacBook Pro in the past two years, you now have access to the most capable open-source model in existence at production speed. This is a meaningful change — in May 2025 the equivalent recommendation was Llama 3.3 70B at painful quantization, and the user experience was mediocre.
What's coming next month
Speculative section — this is what LLMCheck is watching for in the next 30 days based on public roadmaps, leaked release notes, and credible community signals.
- Qwen 4 full release (non-preview). Alibaba's pattern since Qwen 2 has been Preview → full release within 4–8 weeks. Expect the production Qwen 4 32B-A3B in late May or early June, likely with improved instruction tuning and possibly extended context to 1M native (rather than the current 1M extended via YaRN).
- Mistral Voyage Pro 70B. Mistral's roadmap teased a "Voyage Pro" tier following the May 3 Voyage 24B release. A 70B-class dense Apache 2.0 model would slot into the 64 GB Mac tier and compete directly with Llama 5 Scout. Internal benchmarks have leaked at ~75% SWE-Verified.
- Apple MLX 1.0. Apple's MLX team has been hinting at a 1.0 release with stabilized API surface, optimized Metal kernels for MoE inference, and possibly first-party quantization tooling. If the hints land, expect a 15–25% tok/s lift on existing models from the same hardware — effectively a free speed upgrade.
- Llama 5 70B. Meta released the 8B and Scout 109B-A17B simultaneously but conspicuously held back a 70B dense variant. Community speculation suggests a June release, which would directly target the Mac Studio tier and compete with DeepSeek R2 on reasoning while keeping costs much lower.
Watch list: The single most consequential possible release is a permissively-licensed reasoning model that fits in 24 GB. Qwen 4 Preview already leans this way with its hybrid mode. If a competitor ships a true Q4 reasoning model in the 32B-A3B class with explicit test-time compute control, the local-LLM landscape changes again.