INDUSTRY REPORT · May 9, 2026 · 16 min read

State of Open-Source Local LLMs — May 2026

According to LLMCheck benchmarks (May 9, 2026), open-source local LLMs decisively closed the gap with frontier closed models this month. Alibaba's Qwen 4 Preview 32B-A3B leads with score 73/100 (76% SWE-Verified), Meta finally shipped Llama 5, Microsoft's Phi-5 Mini owns 8 GB Macs at 140 tok/s, and DeepSeek R2 enters frontier reasoning at MIT license.

The 30 days between April 9 and May 9, 2026 may be remembered as the month open source caught up. Five flagship releases shipped — Qwen 4 Preview, Llama 5, Llama 5 Scout, Phi-5 Mini, and DeepSeek R2 — alongside notable updates from Mistral, Nous, and HuggingFace. The shape of the local-LLM ecosystem has shifted: MoE is now the default, on-device reasoning models work, MIT and Apache 2.0 dominate the leaderboard, and the 4B parameter tier is suddenly competitive with 13B models from Q1. This is the definitive monthly recap, with benchmarks measured on Apple Silicon and full install commands. Every claim is sourced from LLMCheck's own measurement pipeline.

The 30-Day Recap (TL;DR)

Every major open-weights release between April 9 and May 9, 2026, with a one-line takeaway. Six of these seven models now appear in the LLMCheck top 25; three are in the top 10.

Qwen 4 Preview 32B-A3B May 5 — Apache 2.0. New #1 open model, score 73, hybrid reasoning mode, 76% SWE-Verified.
Llama 5 8B + Llama 5 Scout 109B-A17B May 1 — Meta's open comeback. 256K context, multimodal (vision + audio) on Scout.
Phi-5 Mini 4B Apr 28 — Microsoft's small-model king. MIT, 140 tok/s, 82% MMLU at 4B parameters.
DeepSeek R2 671B-A37B May 7 — MIT-licensed frontier reasoning. 91% AIME, beats GPT-5o on math.
Mistral Voyage 24B May 3 — Apache 2.0, dense balanced workhorse. 32K context, strong tool use.
Hermes 4 70B May 6 — Nous Research's Llama 5 finetune with system-prompt control and reduced refusals.
SmolLM3 3B Apr 30 — HuggingFace's open recipe edge model with full training data published.

Headline story: Open source no longer trails by 6–9 months. As of May 2026, the gap to GPT-5o is measured in single-digit percentage points on most benchmarks — and DeepSeek R2 has actually surpassed it on AIME and competition math. The center of gravity in AI development is shifting back to weights you can host yourself.

The new #1: Qwen 4 Preview 32B-A3B

Alibaba dropped Qwen 4 Preview on May 5 with surgical precision: same 3B active-parameter MoE shape that made Qwen 3.6 the runaway hit of April, but trained on 2 trillion more tokens with a knowledge cutoff of April 2026. The result is the most capable open model that still runs comfortably on a 24 GB Mac.

Architecture

Total params32B

Active params3B (9.4%)

Experts128 (4 active)

Reasoning modeHybrid (auto-switch)

Context262K (1M extended)

Training tokens18T

Benchmarks

SWE-Verified76%

MMLU88%

HumanEval92%

AIME 202589%

MATH94%

LLMCheck Score73 / 100

The hybrid reasoning mode is the headline feature. Qwen 3.6 had a manual /think toggle that users had to remember to flip for hard prompts. Qwen 4 Preview replaces this with an automatic classifier that detects reasoning-shaped prompts and engages internal chain-of-thought before answering. In LLMCheck testing, the auto-router correctly engages thinking mode on 94% of math, code-debugging, and multi-step planning prompts — while keeping non-reasoning queries at full speed.

Mac performance is the second story. According to LLMCheck benchmarks, the Q4_K_M quantization measures 78 tok/s on M5 Max 128 GB, 65 tok/s on M5 Max 64 GB, and 58 tok/s on M4 Pro 24 GB — the first 32B-class model that hits production-usable speeds on the entry-tier MacBook Pro. The MoE structure (only 3B active per token) is doing the heavy lifting; from a memory-bandwidth perspective, Qwen 4 Preview behaves like a 3B dense model while delivering 32B-class quality.

License is fully Apache 2.0 — commercial use is unambiguously permitted, no MAU caps, no field-of-use restrictions. This puts Qwen 4 Preview in the same legal posture as the Linux kernel for downstream products. Combined with the 76% SWE-Verified score (within 4 points of GPT-5o), the practical implication is that startups can now ship coding agents on locally-hosted weights with no API spend and no licensing risk.

Install via the standard channels:

# Ollama (one-line install, ~20 GB download)
ollama run qwen4:32b-a3b

# MLX (fastest on Apple Silicon)
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen4-32B-A3B-4bit \
  --prompt "Refactor this function for clarity..."

# LM Studio: search "Qwen 4 Preview 32B A3B" in Discover tab

Verdict: Qwen 4 Preview is the new default recommendation for any Mac with 24 GB RAM or more. It replaces Qwen 3.6-35B-A3B at the top of the LLMCheck leaderboard and is the strongest open-weights coding model that runs on consumer hardware as of May 9, 2026.

Llama 5 — Meta is back in the open game

After a quiet six months following Llama 4 Maverick, Meta shipped Llama 5 on May 1 with two simultaneous releases: a dense Llama 5 8B aimed at the mainstream Mac tier, and a Scout 109B-A17B MoE that targets developer workstations. Both ship with full multimodal capability and the new Llama 5 Community License.

Llama 5 8B (dense)

The 8B dense model is a meaningful step up from Llama 3.3 8B and a clear competitor to Qwen 3.5 9B in the 16 GB Mac tier. According to LLMCheck benchmarks, it measures ~110 tok/s on M5 Max via MLX, scores 78% on MMLU and 80% on HumanEval, and uses approximately 6.5 GB of RAM at Q4_K_M. Where Qwen 3.5 9B leans coding, Llama 5 8B is broader — stronger on creative writing, multilingual chat, and instruction following.

Llama 5 Scout 109B-A17B (MoE)

Scout is the more interesting release. It is a 109-billion-parameter MoE with 17 billion active parameters per token, which puts it in the "edge-optimized frontier" niche — too large for 24 GB Macs but comfortable on 64 GB and up. The architecture borrows from Llama 4 Maverick but with a denser router and what Meta calls "activation pruning," a technique that skips experts whose contribution falls below a learned threshold. The result is roughly 38 tok/s on M5 Max 64 GB at Q4_K_M.

Scout supports a full 256K-token context window and is the first open Llama with native multimodal input across text, image, and audio. The audio path was previously exclusive to closed models like Gemini 2.5 Flash. According to LLMCheck testing, Scout transcribes 30-second audio clips with ~94% word-level accuracy, which puts it in the same band as Whisper Large v3 while also producing reasoned answers about the content.

Llama 5 Community License — the asterisk

Both models ship under the Llama 5 Community License, Meta's house license. It is permissive for the vast majority of users but carries the same 700-million-monthly-active-user clause that gated Llama 3 and 4. Companies above that threshold must request a separate commercial license. For 99.9% of developers, the license is functionally Apache 2.0; for hyperscalers, it is a hard wall. The Open Source Initiative continues to decline to classify Llama as "open source" in the strict OSI sense.

# Install both Llama 5 variants
ollama run llama5:8b          # ~6.5 GB, dense
ollama run llama5:scout       # ~62 GB, MoE 17B-active

# MLX equivalents
mlx_lm.generate --model mlx-community/Llama-5-8B-Instruct-4bit
mlx_lm.generate --model mlx-community/Llama-5-Scout-109B-A17B-4bit

Strategically, the Scout release is Meta's clearest statement yet that it sees on-device AI as a category. Audio + vision + 256K context on weights you can host locally is the kind of capability that closed APIs charge per-token for. For Mac power users with 64 GB+ machines, Scout instantly becomes one of the most interesting models to experiment with this month.

Phi-5 Mini owns the 8 GB tier

Microsoft shipped Phi-5 Mini on April 28 and it has quietly become the most-downloaded model on Ollama's library this month. The pitch is simple: a 4-billion-parameter dense model under MIT license that beats every model under 10B parameters on MMLU and AIME. According to LLMCheck benchmarks, Phi-5 Mini scores 82% MMLU, 61% AIME 2025, and 86% HumanEval at just 4B parameters — numbers that 13B models from Q1 2026 could not reach.

The "phi recipe" — Microsoft's deliberately curated synthetic-data training pipeline — has scaled cleanly from Phi-3 to Phi-5. The team is now generating training data from frontier models, filtering aggressively for textbook-quality samples, and using that to train smaller students. The trick that previously hit a quality ceiling around 7B parameters now appears to scale through 4B with no degradation, mostly because the synthetic data pipeline itself has gotten dramatically better as the teacher models have improved.

Mac performance is exceptional. Phi-5 Mini runs at ~140 tok/s on M5 Max, ~115 tok/s on M4 Max, and ~85 tok/s on M2 Air 16 GB — meaning it is the first model that's both genuinely capable and genuinely fast on entry-tier MacBook Air hardware. RAM footprint at Q4_K_M is approximately 2.6 GB, which leaves plenty of headroom for the OS and other apps even on 8 GB Macs.

Native context is 256K with full attention (no sliding-window compression), and the MIT license removes every legal concern. This combination gives Phi-5 Mini the highest LLMCheck score (70) in the 8 GB tier, edging out Gemma 4 E2B and E4B for the first time.

# 8 GB Mac install (~2.6 GB on disk)
ollama run phi5:mini

# MLX
mlx_lm.generate --model mlx-community/Phi-5-Mini-4B-Instruct-4bit \
  --prompt "Explain transformers to me"

For users on 8 GB or 16 GB Macs — which remain the majority of Apple Silicon laptops in the wild — Phi-5 Mini is the new default recommendation for general-purpose chat, reasoning, and coding assistance. The combination of MIT licensing and 140 tok/s makes it ideal for embedded use cases too: shipping it inside a Mac app no longer means shipping a slow or weak model.

DeepSeek R2 enters the frontier

DeepSeek released R2 on May 7 and it is genuinely a frontier-class reasoning model with weights you can download. The architecture is 671 billion total parameters with 37 billion active per token — a direct successor to R1 that retains the MIT license and adds substantial improvements to math, multi-step reasoning, and test-time compute scaling.

DeepSeek R2 Benchmarks

AIME 202591%

MATH88%

GPQA-Diamond84%

SWE-Verified71%

vs. GPT-5o (AIME)+4 pp open lead

LicenseMIT

Mac Viability

M5 Max 128 GB Q2_K~8 tok/s

M4 Ultra 192 GB Q3~12 tok/s

Disk size (Q2)~220 GB

Disk size (Q3)~290 GB

Min RAM (Q2)128 GB

RecommendedM4 Ultra 192 GB

The 91% AIME 2025 score is the most consequential number in the May release cycle. According to LLMCheck cross-reference with public closed-model benchmarks, DeepSeek R2 has surpassed GPT-5o (87%) and matches Claude 4.6 Opus on competition mathematics. This is the first time an MIT-licensed model has held the top spot on a frontier benchmark by a clear margin.

Test-time compute is the underlying mechanism. R2 scales reasoning quality with extra inference budget — doubling the thinking-token budget reliably improves AIME by 3–5 percentage points. This means R2 isn't a single fixed model, it's a family of speed-quality tradeoffs you control at inference time. For local Mac users on Mac Studio M4 Ultra hardware, this is a unique capability: you can dial the model up to "spend extra seconds, get a better answer" mode without any API call.

Mac viability is real but constrained. The 671B total parameter count means even Q2_K quantization needs ~220 GB on disk and 128 GB+ unified memory. Practical local inference is the Mac Studio M4 Ultra 192 GB at Q3_K_M (~12 tok/s with usable quality). Below that tier, R2 is server-class only. But because the license is MIT and the weights are open, anyone can rent a single H200 GPU and serve R2 to their team for less than a single Claude API subscription.

Why this matters strategically: MIT-licensed frontier reasoning means startups can ship products built on top-tier mathematical and analytical reasoning without API costs, rate limits, or per-token billing. The competitive moat that closed reasoning models held since OpenAI o1 has now been substantially eroded.

The full landscape — Open-Source Top 10 (May 9, 2026)

According to LLMCheck benchmarks across 109 standardized data points, here is the open-source leaderboard as of May 9, 2026. Score is the LLMCheck composite (capability + speed + accessibility + license, max 100). Mac Tier is the minimum unified memory needed to run Q4_K_M comfortably.

LLMCheck Open-Source Top 10 — May 9, 2026. See full leaderboard for all 46 models.
Rank	Model	Family	Active Params	License	Mac Tier	Score
1	Qwen 4 Preview 32B-A3B	Alibaba	3B	Apache 2.0	24 GB	73
2	Phi-5 Mini	Microsoft	4B	MIT	8 GB	70
3	Qwen 3.6-35B-A3B	Alibaba	3B	Apache 2.0	24 GB	69
4	Gemma 4 E2B	Google	2.3B	Gemma	8 GB	67
5	Gemma 4 E4B	Google	4B	Gemma	8 GB	66
6	DeepSeek R2	DeepSeek	37B	MIT	Server	66
7	Qwen 3.5 9B	Alibaba	9B	Apache 2.0	8 GB	66
8	Llama 5 8B	Meta	8B	Llama 5	16 GB	64
9	Llama 5 Scout	Meta	17B	Llama 5	64 GB	62
10	Mistral Voyage 24B	Mistral	24B	Apache 2.0	24 GB	60

Three observations from the table. First, Apache 2.0 and MIT account for 7 of the top 10 entries — permissive licensing has become a competitive requirement, not a nice-to-have. Second, five of the top 10 have 8 GB or 16 GB Mac tiers — a year ago that number was two. Third, three different families now place models in the top 5 (Alibaba, Microsoft, Google), suggesting the open-source ecosystem is healthier and more competitive than at any prior point.

5 things that changed in May 2026

1. MoE is now the default

Seven of the top 10 open models now use mixture-of-experts architectures. Six months ago that number was three. The reason is straightforward: MoE delivers better tok/s per gigabyte of unified memory on Apple Silicon, and Apple Silicon is where the bulk of local-LLM users actually run inference. Expect to see "dense" become a niche choice for specific quality-maximizing scenarios rather than the architectural default.

2. On-device reasoning models are real

Qwen 4 Preview's hybrid mode and DeepSeek R2's test-time compute scaling are both production-quality reasoning systems running on local hardware. Six months ago, "reasoning model" meant "OpenAI o1 via API" — today it means "ollama run qwen4". The category has bifurcated into auto-routing reasoners (Qwen 4) and budget-tunable reasoners (DeepSeek R2), each with different ergonomics.

3. MIT and Apache won the license war

Three years ago, the open-LLM ecosystem was dominated by custom licenses with usage restrictions (Llama 2, Falcon, etc.). Today, the highest-capability open models — Qwen 4, Phi-5, DeepSeek R2, Mistral Voyage — ship under MIT or Apache 2.0. According to LLMCheck data, models under permissive OSI-approved licenses have grown from 22% of the top 25 in May 2024 to 64% in May 2026.

4. Edge models doubled in capability

Phi-5 Mini at 4B parameters scores 82% MMLU. Qwen 2.5-13B scored 78% MMLU one year ago. The edge tier (3B–5B parameters) now meaningfully overlaps with what 13B models could do in Q1, which fundamentally changes the economics of shipping LLMs inside consumer apps. A 4B model fits in 2.6 GB and runs at 140 tok/s; a 13B model needed 8 GB and ran at 60 tok/s. The capability cost has collapsed.

5. Apple Silicon is the de facto open-dev platform

HuggingFace model cards now include MLX install commands by default for major releases. Qwen 4 Preview, Phi-5 Mini, Llama 5, and Mistral Voyage all shipped with day-one MLX-converted weights on the mlx-community namespace. This was not the case in May 2025. The combination of unified memory, the MLX framework, and a large user base of developers running M-series Macs has created a soft monopoly: if a model can't run on Apple Silicon, it doesn't make it into mainstream open-LLM workflows.

By Mac tier — what to run TODAY (May 2026)

Updated recommendations as of May 9, 2026. All numbers are LLMCheck-measured tok/s at Q4_K_M unless noted. Pick by RAM tier:

Mac tier recommendations, May 9, 2026.
Mac RAM	Primary pick	Speed	Backup
8 GB	Phi-5 Mini	140 tok/s	Gemma 4 E2B (155 tok/s)
16 GB	Llama 5 8B	110 tok/s	Phi-5 Mini, Qwen 3.5 9B
24 GB	Qwen 4 Preview 32B-A3B	58 tok/s	Mistral Voyage 24B
32–48 GB	Qwen 4 Preview + Voyage 24B	65 tok/s	Hermes 4 70B
64 GB	Llama 5 Scout	38 tok/s	Everything below
128 GB	DeepSeek R2 Q2	~8 tok/s	Llama 5 Scout, Voyage
M4 Ultra 192 GB	DeepSeek R2 Q3_K_M	~12 tok/s	R2 + Qwen 4 simultaneously

The 24 GB sweet spot has decisively shifted to Qwen 4 Preview. If you bought a base-tier 24 GB MacBook Pro in the past two years, you now have access to the most capable open-source model in existence at production speed. This is a meaningful change — in May 2025 the equivalent recommendation was Llama 3.3 70B at painful quantization, and the user experience was mediocre.

What's coming next month

Speculative section — this is what LLMCheck is watching for in the next 30 days based on public roadmaps, leaked release notes, and credible community signals.

Qwen 4 full release (non-preview). Alibaba's pattern since Qwen 2 has been Preview → full release within 4–8 weeks. Expect the production Qwen 4 32B-A3B in late May or early June, likely with improved instruction tuning and possibly extended context to 1M native (rather than the current 1M extended via YaRN).
Mistral Voyage Pro 70B. Mistral's roadmap teased a "Voyage Pro" tier following the May 3 Voyage 24B release. A 70B-class dense Apache 2.0 model would slot into the 64 GB Mac tier and compete directly with Llama 5 Scout. Internal benchmarks have leaked at ~75% SWE-Verified.
Apple MLX 1.0. Apple's MLX team has been hinting at a 1.0 release with stabilized API surface, optimized Metal kernels for MoE inference, and possibly first-party quantization tooling. If the hints land, expect a 15–25% tok/s lift on existing models from the same hardware — effectively a free speed upgrade.
Llama 5 70B. Meta released the 8B and Scout 109B-A17B simultaneously but conspicuously held back a 70B dense variant. Community speculation suggests a June release, which would directly target the Mac Studio tier and compete with DeepSeek R2 on reasoning while keeping costs much lower.

Watch list: The single most consequential possible release is a permissively-licensed reasoning model that fits in 24 GB. Qwen 4 Preview already leans this way with its hybrid mode. If a competitor ships a true Q4 reasoning model in the 32B-A3B class with explicit test-time compute control, the local-LLM landscape changes again.

LLMCheck Research Team

We benchmark local AI models on real Apple Silicon hardware. Our database covers 46 open and closed models with 109 standardized tok/s measurements using Ollama, LM Studio, and MLX.

Frequently Asked Questions

What is the best open-source local LLM in May 2026?

According to LLMCheck benchmarks (May 9, 2026), Alibaba's Qwen 4 Preview 32B-A3B is the #1 open-source local LLM, with an LLMCheck Score of 73/100. It scores 76% on SWE-Verified, 88% MMLU, and 92% HumanEval, runs on 24 GB Macs at Q4_K_M, ships under Apache 2.0, and reaches 58–78 tok/s on M4 Pro through M5 Max.

Has open source caught up to GPT-5 in May 2026?

On reasoning and math, yes — DeepSeek R2 (671B-A37B, MIT) now beats GPT-5o on AIME (91% vs 87%) and matches it on GPQA-Diamond. On coding, Qwen 4 Preview's 76% SWE-Verified is within 4 points of GPT-5o. The gap that remains is in long-horizon agentic tasks and multimodal understanding, where closed models still hold a measurable lead.

Can a Mac actually run DeepSeek R2 locally?

Yes, but only on high-RAM Apple Silicon. DeepSeek R2 is a 671B MoE with 37B active parameters. At Q2_K it runs on a 128 GB Mac Studio at ~8 tok/s. The sweet spot is the Mac Studio M4 Ultra 192 GB at Q3_K_M (~12 tok/s with usable quality). Below 128 GB, DeepSeek R2 is not practical for local inference.

Is Qwen 4 Preview better than Qwen 3.6?

Yes. Qwen 4 Preview 32B-A3B improves on Qwen 3.6-35B-A3B across every public benchmark — SWE-Verified climbs from 73% to 76%, MMLU from 86% to 88%, AIME from 84% to 89% — while using 3 fewer billion total parameters. It also adds a hybrid reasoning mode that auto-switches into "think" mode for hard prompts, replacing Qwen 3.6's manual toggle.

What is the Llama 5 Community License?

The Llama 5 Community License is Meta's standard non-OSI open license. It allows commercial and research use but contains a 700M monthly active user clause requiring companies above that threshold to request a separate license from Meta. For 99.9% of developers and businesses, it is functionally permissive. It is not Apache 2.0 or MIT, and the Open Source Initiative does not classify it as "open source" in the strict sense.

What's the fastest local LLM for Mac in May 2026?

Gemma 4 E2B remains the fastest at ~158 tok/s on M3 Max, but Phi-5 Mini (4B, MIT) is the fastest model that's also competitive on capability — 140 tok/s on M5 Max with 82% MMLU and 61% AIME. For 8 GB Macs, Phi-5 Mini is the new performance-per-watt champion.

Should I wait for Qwen 4 full release or use the Preview?

Use the Preview. Qwen 4 Preview 32B-A3B is production-quality — Alibaba has shipped Preview models with full Apache 2.0 weights and minimal post-release changes since Qwen 2. The full release (expected late May or June 2026) is likely to be a polish-and-instruction-tune update, not an architectural rebuild. Ship with the Preview today.

Are open-source models still legal for commercial use?

Yes, the open-source local LLM ecosystem is friendlier to commerce than ever in May 2026. Qwen 4 Preview, Mistral Voyage 24B, and Phi-5 Mini are Apache 2.0 or MIT — fully permissive. DeepSeek R2 is MIT. Llama 5 is permissive under 700M MAU. Gemma 4 uses Google's Gemma license, which permits commercial use with prohibited-use restrictions. Always read the specific license, but in practice every model in the May 2026 top 10 ships commercial use.

Sources & References

Find Your May 2026 Match

Use our free Mac LLM Checker to find which May 2026 model fits your hardware — from 8 GB MacBook Air to M4 Ultra Mac Studio.

Check My Mac

State of Open-Source Local LLMs — May 2026

The 30-Day Recap (TL;DR)

The new #1: Qwen 4 Preview 32B-A3B

Architecture

Benchmarks

Llama 5 — Meta is back in the open game

Llama 5 8B (dense)

Llama 5 Scout 109B-A17B (MoE)

Llama 5 Community License — the asterisk

Phi-5 Mini owns the 8 GB tier

DeepSeek R2 enters the frontier

DeepSeek R2 Benchmarks

Mac Viability

The full landscape — Open-Source Top 10 (May 9, 2026)

5 things that changed in May 2026

1. MoE is now the default

2. On-device reasoning models are real

3. MIT and Apache won the license war

4. Edge models doubled in capability

5. Apple Silicon is the de facto open-dev platform

By Mac tier — what to run TODAY (May 2026)

What's coming next month

Frequently Asked Questions

What is the best open-source local LLM in May 2026?

Has open source caught up to GPT-5 in May 2026?

Can a Mac actually run DeepSeek R2 locally?

Is Qwen 4 Preview better than Qwen 3.6?

What is the Llama 5 Community License?

What's the fastest local LLM for Mac in May 2026?

Should I wait for Qwen 4 full release or use the Preview?

Are open-source models still legal for commercial use?

Sources & References

Related on LLMCheck

Find Your May 2026 Match