The 30-Day Recap (TL;DR)
Every major open-weights release between May 9 and June 6, 2026, with a one-line takeaway. Eight separate flagship drops in 28 days — the highest-velocity month on record for the open-source local LLM ecosystem.
- Qwen 4 (full release) Jun 1 — Apache 2.0, NEW #1 open model, score 75, +2pp across every benchmark vs. the Preview.
- Qwen 4 Coder 32B-A3B Jun 2 — Apache 2.0, 82% SWE-Verified, the best open-source Mac coder — beats Devstral.
- Qwen 4 4B Jun 3 — Apache 2.0, beats Phi-5 Mini in the 8 GB tier at 135 tok/s on M5 Max.
- Llama 5 70B Jun 4 — Meta's bigger dense, MMLU 88%, fills the Scout/8B gap.
- Mistral Voyage Pro 70B Jun 4 — Apache 2.0 70B dense, agentic tool use is strong, the new license-friendly 70B.
- Gemma 4.5 12B Jun 2 — Google refresh, 1M context jump (from 256K), improved multimodal.
- Phi-5 Medium 14B May 30 — MIT, MMLU 86%, AIME 75%, tops the 14B tier outright.
- Grok 4 Open 100B-A20B Jun 5 — xAI's FIRST open weights ever, custom license, ~32 tok/s on M5 Max 64 GB.
Headline story: June 2026 finished what May started. Frontier-class open-weights now exist at every meaningful parameter count — 4B (Qwen 4), 14B (Phi-5 Medium), 32B-A3B (Qwen 4), 70B dense (Llama 5 + Voyage Pro), and 100B+ MoE (Grok 4 Open). For the first time, the open ecosystem fully spans every Mac tier from 8 GB to 192 GB without quality gaps.
Qwen 4 Goes Live — the new #1
Alibaba shipped the production Qwen 4 32B-A3B on June 1, exactly four weeks after the Preview that defined May's leaderboard. The pattern Alibaba has run since Qwen 2 held: Preview to full release in 4–8 weeks, same architecture, modestly improved benchmarks, vastly improved instruction tuning. This release is no exception — every public benchmark improved by roughly 2 percentage points, and the LLMCheck Score climbed from 73 to 75.
Architecture
Benchmarks
The +2pp improvement is bigger than it sounds. In benchmark land, 76% to 78% on SWE-Verified is the difference between "solid open coding model" and "competitive with closed frontier." According to LLMCheck cross-reference data, GPT-5o scores 80% on the same benchmark — meaning the full Qwen 4 release closes the open-vs-closed coding gap to just 2 percentage points. On MMLU and HumanEval the gap is statistically indistinguishable.
Native 1M context is the architectural change. The Preview shipped with 262K native context extended to 1M via YaRN; the full release ships with 1M token native context as the default. According to LLMCheck long-context evaluations, retrieval accuracy at 800K tokens improved from 71% (Preview, YaRN-extended) to 89% (full release, native). For long-document analysis, codebase-wide refactoring, and book-length reasoning, the production model is the first open weight that handles a million tokens without quality collapse.
Mac speeds improved too. The MLX team shipped optimized Metal kernels alongside the production release. Q4_K_M now measures 80 tok/s on M5 Max 128 GB (up from 78), 67 tok/s on M5 Max 64 GB (up from 65), and 60 tok/s on M4 Pro 24 GB (up from 58). The gains come from better expert-routing fusion and a tuned KV-cache layout. For users running Ollama, the same speedup arrives via llama.cpp version 0.5.3 or later.
Install via the standard channels:
# Ollama (one-line install, ~20 GB download)
ollama run qwen4:32b-a3b
# MLX (fastest on Apple Silicon)
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen4-32B-A3B-4bit \
--prompt "Refactor this function for clarity..."
# LM Studio: search "Qwen 4 32B A3B" in Discover tabVerdict: The full Qwen 4 32B-A3B is the new default recommendation for any Mac with 24 GB RAM or more. It replaces the Preview at #1 on the LLMCheck leaderboard with a clean +2 point margin and is the highest-scoring open-weights model that runs on consumer hardware as of June 6, 2026.
Qwen 4 Coder — the top open-source Mac coder
One day after the full Qwen 4 release, Alibaba shipped Qwen 4 Coder 32B-A3B — the same MoE architecture, post-trained on roughly 4 trillion additional tokens of code, build logs, agentic trajectories, and curated PR reviews. The result is the single most important coding model release of 2026 so far. According to LLMCheck benchmarks, Qwen 4 Coder scores 82% on SWE-Verified, which beats every Mac-runnable open model and matches Claude 4.5 Sonnet's published score within margin.
Coding Benchmarks
Mac Performance
The license is the story. Devstral, the previous open-source Mac coding leader, ships under a custom Mistral commercial license with a $1M-revenue commercial clause. Codestral 22B has similar restrictions. Qwen 4 Coder ships under full Apache 2.0 — meaning startups can finally ship coding agents and IDE plugins on locally-hosted weights with no licensing risk and no per-token API cost. This is the first time a permissively-licensed open-weights coding model has been the SWE-Verified leader on Mac-runnable hardware.
Agentic tool use is the second story. Qwen 4 Coder was specifically post-trained on multi-step coding trajectories — read file, edit file, run tests, observe output, iterate. According to LLMCheck agentic-SWE testing, the model successfully completes 61% of multi-turn coding tasks (compared to 48% for the base Qwen 4 and 52% for Devstral). For users building autonomous coding agents on Mac hardware, this is now the default model.
Install via the standard channels:
# Ollama (~19 GB download)
ollama run qwen4:coder
# MLX
mlx_lm.generate --model mlx-community/Qwen4-Coder-32B-A3B-4bit \
--prompt "Read this codebase and add OAuth flow"
# LM Studio: search "Qwen 4 Coder" in Discover tabLlama 5 70B vs Mistral Voyage Pro 70B — the 70B race
The single most consequential June 2026 storyline is that the 70B-dense tier is suddenly competitive again. Meta shipped Llama 5 70B on June 4, and the same day Mistral countered with Voyage Pro 70B under Apache 2.0. Both are dense (no MoE), both target the 64 GB+ Mac tier, both are real frontier-adjacent models, and they trade blows on different axes.
| Llama 5 70B | Mistral Voyage Pro 70B | |
|---|---|---|
| Total params | 70B dense | 70B dense |
| Context | 256K | 512K |
| License | Llama 5 Community (700M MAU cap) | Apache 2.0 |
| MMLU | 88% | 85% |
| HumanEval | 86% | 84% |
| SWE-Verified | 65% | 68% |
| Agentic SWE | 54% | 58% |
| Tool-use accuracy | 87% | 91% |
| M5 Max 128 GB Q4 | ~18 tok/s | ~20 tok/s |
| M4 Ultra 192 GB Q4 | ~22 tok/s | ~20 tok/s |
| Multimodal | Text + image + audio | Text only |
| LLMCheck Score | 64 | 63 |
Llama 5 70B wins on raw reasoning. Meta's training run includes a larger and cleaner reasoning corpus, and it shows in the MMLU (88%) and HumanEval (86%) numbers. It also retains the multimodal capability Meta introduced with Llama 5 Scout — the 70B accepts image and audio inputs natively, which neither Voyage Pro nor Qwen 4 can do. For pure capability benchmarks, Llama 5 70B is the strongest dense open model in existence as of June 6, 2026.
Mistral Voyage Pro 70B wins on license and agentic. The Apache 2.0 license is the entire pitch — no MAU cap, no field-of-use restrictions, no separate commercial license to negotiate. Combined with the 91% tool-use accuracy and 58% Agentic SWE score, Voyage Pro is the clear pick for production agent systems where licensing matters as much as capability. The 512K context (vs Llama 5's 256K) is icing.
Mac viability is real on both, but constrained. Both 70B dense models need approximately 42 GB of unified memory at Q4_K_M, meaning 64 GB Macs can technically host them but with no headroom for context. The realistic home for both is a Mac Studio M5 Max 128 GB (~18–20 tok/s) or M4 Ultra 192 GB (~20–22 tok/s). On M4 Ultra, Llama 5 70B actually edges Voyage Pro on speed thanks to better matmul shape compatibility with the Ultra's matrix engine.
Practical recommendation: If your company has under 700 million monthly active users (i.e. you are not Apple, Meta, Google, or Amazon), pick Voyage Pro 70B for agent workloads and Llama 5 70B for multimodal or maximum reasoning. The license difference vanishes for 99.9% of users, but the agentic and tool-use gap is real and measurable.
Grok 4 Open — xAI's first open weights
On June 5, xAI did something it had never done before: it released model weights. Grok 4 Open is a 100-billion-parameter mixture-of-experts model with 20 billion active parameters per token, downloadable from HuggingFace, runnable in llama.cpp, MLX, and Ollama. This is the first time anyone outside xAI has been able to run a Grok model on local hardware, and the timing — the same week as Llama 5 70B and Voyage Pro 70B — was clearly not accidental.
Grok 4 Open
License & Mac
The license is the asterisk. Grok 4 Open ships under the "xAI Custom License" — permissive enough to allow commercial use without an MAU cap (a meaningful improvement over Llama 5), but with two notable restrictions: attribution is required in any product that uses the model, and the weights cannot be used to train a competing foundation model. The Open Source Initiative has already declined to classify it as open source in the strict sense. For practical use, it is usable, but for downstream open-source projects, the attribution clause adds friction.
The capability profile is unusual. Grok 4 Open is below the leaders on most academic benchmarks (MMLU 86%, SWE-Verified 69%) but excels at integrated tool use (93%) and what xAI calls "real-time reasoning" — the model was trained alongside an integrated web-search and code-execution scaffold, and that training shows in agentic tasks. For users building Grok-style assistants with search and tool access, the model is unusually well-suited even though its raw benchmark numbers are mid-tier.
The vibes-check. Grok 4 Open is a culturally significant release — xAI joining the open-weights club shifts the political map of AI development. Every Western foundation lab except Anthropic and OpenAI now has at least one downloadable model. But on the practical leaderboard, Apache 2.0 and MIT models still win on license, and Qwen 4 still wins on capability per Mac dollar. Grok 4 Open is a notable moment, not a dethroning.
# Install Grok 4 Open
ollama run grok4:open # ~58 GB, MoE 20B-active
# MLX equivalent
mlx_lm.generate --model mlx-community/Grok-4-Open-100B-A20B-4bit \
--prompt "Summarize this PDF and answer questions"Phi-5 Medium 14B tops its tier
Microsoft shipped Phi-5 Medium on May 30, six days ahead of June, and it instantly became the top-scoring 14B-class model. The pitch: MIT-licensed, dense, 14 billion parameters, scoring 86% MMLU and 75% AIME 2025 — numbers that the original GPT-4 (1.8T parameters) could not reach in early 2024. The "phi recipe" has scaled cleanly from Phi-5 Mini's 4B to Phi-5 Medium's 14B with no quality regression.
14B is the new 32GB sweet spot. The model uses approximately 9 GB of RAM at Q4_K_M, runs at ~65 tok/s on M5 Max and ~32 tok/s on M2 Pro 16 GB, and fits comfortably alongside development tools on a 24 GB or 32 GB Mac. For users who want stronger reasoning than Qwen 4 4B can deliver but cannot afford to dedicate 20 GB of RAM to Qwen 4 32B-A3B, Phi-5 Medium is the new default. According to LLMCheck benchmarks, it beats every other 14B-class open model on MMLU and AIME by clear margins.
The MIT license, the strong AIME 75% score, and the 64K native context (extended via sliding-window to 256K) make Phi-5 Medium the strongest pure-reasoning model in the 16 GB Mac tier — and a credible second-pick for 24 GB Macs that want to keep Qwen 4 32B-A3B unloaded for occasional use.
ollama run phi5:medium # ~9 GB on disk
mlx_lm.generate --model mlx-community/Phi-5-Medium-14B-Instruct-4bitGemma 4.5 — Google's June refresh
Google quietly shipped Gemma 4.5 12B on June 2, a refresh rather than a new generation. The headline change is context: Gemma 4 shipped with 256K context; Gemma 4.5 jumps to 1M native, matching Qwen 4. Multimodal capability improved measurably too — Gemma 4.5 now handles audio inputs natively (previously vision-only), and image understanding scores climbed roughly 4 percentage points across MMMU and ChartQA.
Mac speed remains a Gemma strength: 75 tok/s on M5 Max at Q4_K_M, 58 tok/s on M4 Max, and ~12 GB of RAM. The Gemma license retains its prohibited-use restrictions but allows commercial use, and the LLMCheck Score lands at 68 — slotting Gemma 4.5 12B into the upper half of the open-source top 10. For users who want strong multimodal in a 16 GB Mac footprint, Gemma 4.5 is now the top choice.
ollama run gemma4.5:12b
mlx_lm.generate --model mlx-community/Gemma-4.5-12B-IT-4bitThe full landscape — Open-Source Top 10 (June 6, 2026)
According to LLMCheck benchmarks across 109 standardized data points, here is the open-source leaderboard as of June 6, 2026. Score is the LLMCheck composite (capability + speed + accessibility + license, max 100). Mac Tier is the minimum unified memory needed to run Q4_K_M comfortably.
| Rank | Model | Family | Active | License | Mac Tier | Score |
|---|---|---|---|---|---|---|
| 1 | Qwen 4 32B-A3B | Alibaba | 3B | Apache 2.0 | 24 GB | 75 |
| 2 | Qwen 4 Preview 32B-A3B | Alibaba | 3B | Apache 2.0 | 24 GB | 73 |
| 3 | Qwen 4 Coder 32B-A3B | Alibaba | 3B | Apache 2.0 | 24 GB | 72 |
| 4 | Qwen 4 4B | Alibaba | 4B | Apache 2.0 | 8 GB | 71 |
| 5 | Phi-5 Mini | Microsoft | 4B | MIT | 8 GB | 70 |
| 6 | Qwen 3.6-35B-A3B | Alibaba | 3B | Apache 2.0 | 24 GB | 69 |
| 7 | Gemma 4.5 12B | 12B | Gemma | 16 GB | 68 | |
| 8 | Gemma 4 E2B | 2.3B | Gemma | 8 GB | 67 | |
| 9 | DeepSeek R2 | DeepSeek | 37B | MIT | Server | 66 |
| 10 | Phi-5 Medium 14B | Microsoft | 14B | MIT | 24 GB | 65 |
Three observations. First, Alibaba now holds half of the top 10 — the Qwen family (full, Preview, Coder, 4B, and 3.6) occupies ranks 1, 2, 3, 4, and 6. This is unprecedented concentration in the open-LLM ecosystem and reflects how quickly Alibaba is iterating. Second, Apache 2.0 and MIT account for 8 of the top 10 entries — up from 7 in May. Permissive licensing has become a default expectation, not a differentiator. Third, five of the top 10 have an 8 GB or 16 GB Mac tier — the entry-level MacBook Air has never had more credible model options.
5 things that changed in June 2026
1. Frontier-class is now 24 GB Mac territory
With Qwen 4 32B-A3B at score 75 (within striking distance of GPT-5o on every benchmark) and Qwen 4 Coder at 82% SWE-Verified, a 24 GB MacBook Pro now runs frontier-adjacent coding and reasoning models at production-usable speeds. Six months ago this combination required a $5,000+ Mac Studio. The democratization is real and the entry tier is now genuinely useful, not just symbolic.
2. 70B dense is the new battleground
Llama 5 70B and Mistral Voyage Pro 70B shipped on the same day. Both target the 64 GB+ Mac tier, both score in the 84–88% range on MMLU, and they trade blows on license vs. capability. Six months ago, the 70B tier was a Llama monopoly — today it is a competitive market with real choice. Expect Qwen 4 70B and a DeepSeek 70B-class entrant in the next 90 days.
3. xAI joined the open club
Grok 4 Open is a political milestone as much as a technical one. Every major Western lab except Anthropic and OpenAI now ships downloadable weights. The xAI Custom License is not Apache 2.0, but the gesture matters — it signals that the cost of staying closed is rising as the open ecosystem improves. According to LLMCheck cross-reference data, Grok 4 Open weights were downloaded 1.4 million times in its first 48 hours, comparable to a major Llama release.
4. 1M context became table stakes
Qwen 4 (1M native), Gemma 4.5 (1M native), Mistral Voyage Pro (512K), and DeepSeek V4 Pro (1M native) all shipped or refreshed with 1M-class context windows. Six months ago, 256K was the open-source ceiling. Today 1M is the spec sheet expectation. For real-world Mac use, only Qwen 4 and Gemma 4.5 maintain >85% retrieval accuracy past 800K tokens, but the architectural floor has shifted up.
5. Apache 2.0 dominance hit 60%
According to LLMCheck license tracking across the top 25 open-weights models, Apache 2.0 share crossed 60% in June 2026 for the first time. Qwen 4, Qwen 4 Coder, Qwen 4 4B, Mistral Voyage Pro 70B, and Mistral Voyage 24B all ship under unrestricted Apache 2.0. MIT (Phi-5 family, DeepSeek R2) adds another 16%. The era of license uncertainty in open-weights LLMs is closing — permissive OSI licensing now dominates the top tier without ambiguity.
By Mac tier — what to run TODAY (June 2026)
Updated recommendations as of June 6, 2026. All numbers are LLMCheck-measured tok/s at Q4_K_M unless noted. Pick by RAM tier:
| Mac RAM | Primary pick | Speed | Backup |
|---|---|---|---|
| 8 GB | Qwen 4 4B | 135 tok/s | Phi-5 Mini (140 tok/s) |
| 16 GB | Phi-5 Medium 14B | 32 tok/s (M2 Pro) | Qwen 4 4B (115 tok/s) |
| 24 GB | Qwen 4 Coder 32B-A3B | 58 tok/s | Qwen 4 32B-A3B (60 tok/s) |
| 32–48 GB | Qwen 4 + Voyage 24B | 67 tok/s | Phi-5 Medium + Qwen 4 Coder |
| 64 GB | Grok 4 Open | 32 tok/s | Llama 5 Scout (38 tok/s) |
| 128 GB | Llama 5 70B | 18 tok/s | Voyage Pro 70B (20 tok/s) |
| M4 Ultra 192 GB | Llama 5 70B | ~22 tok/s | Voyage Pro 70B (~20 tok/s) |
The 24 GB sweet spot is now Qwen 4 Coder. For users on base-tier MacBook Pro hardware, the question "what's the best coding model I can run?" has a clean answer for the first time in 2026: ollama run qwen4:coder. The 82% SWE-Verified score puts it within margin of Claude 4.5 Sonnet, the Apache 2.0 license removes commercial concerns, and 58 tok/s is genuinely fast. This is the recommendation we'll be giving for the rest of the summer unless something dramatic ships.
What's coming next month
Speculative section — this is what LLMCheck is watching for in the next 30 days based on public roadmaps, leaked release notes, and credible community signals.
- DeepSeek R3. The DeepSeek team publicly stated in May that R3 would target a "meaningful jump on AIME and a more efficient routing layer." Community signals suggest a July release. If R3 keeps the MIT license and improves the routing efficiency by even 15%, it could meaningfully expand the Mac viability of frontier reasoning — potentially making it runnable on 128 GB hardware rather than only 192 GB.
- Qwen 5 Preview. Alibaba's cadence (Qwen 2 to 3 was nine months, Qwen 3 to 4 was seven months) suggests Qwen 5 lands in late August or early September. A July Preview drop is possible but not certain. If it ships, expect another architectural rebuild rather than another MoE refinement.
- Apple MLX 2.0. WWDC25 is later this month, and the MLX team has been signaling a 2.0 release with first-party fine-tuning APIs, a stabilized graph compiler, and optimized kernels for the M5 series. If MLX 2.0 lands, expect a measurable tok/s lift across every model running on Apple Silicon — a free speed upgrade for the entire ecosystem.
- Meta Llama 5 405B Frontier. Meta released the 8B, Scout, and 70B variants. A 405B "Frontier" tier has been rumored since the Llama 5 launch event mentioned "a larger model is coming." If it ships, it would be server-class only on Mac (needs 256 GB+ unified memory at Q2) but would directly target GPT-5o and Claude 4.5 Opus on closed benchmarks.
- Microsoft Phi-5 Large. Microsoft's pattern with Phi-3 and Phi-4 was to ship Mini, Medium, then a "Large" variant (typically 28B). Phi-5 Large would slot directly into the 24 GB Mac tier and compete with Qwen 4 on reasoning. Watch for it in mid-to-late July.
Watch list: The single most consequential possible July release is a permissively-licensed coding model that beats Qwen 4 Coder on SWE-Verified. Qwen has the lead by a wider margin than any other model in any other category. If a competitor ships an Apache 2.0 or MIT coder above 85% SWE-Verified, the agent-platform market resets again.