The 30-Day Recap (TL;DR)

Every major open-weights release between May 9 and June 6, 2026, with a one-line takeaway. Eight separate flagship drops in 28 days — the highest-velocity month on record for the open-source local LLM ecosystem.

Headline story: June 2026 finished what May started. Frontier-class open-weights now exist at every meaningful parameter count — 4B (Qwen 4), 14B (Phi-5 Medium), 32B-A3B (Qwen 4), 70B dense (Llama 5 + Voyage Pro), and 100B+ MoE (Grok 4 Open). For the first time, the open ecosystem fully spans every Mac tier from 8 GB to 192 GB without quality gaps.


Qwen 4 Goes Live — the new #1

Alibaba shipped the production Qwen 4 32B-A3B on June 1, exactly four weeks after the Preview that defined May's leaderboard. The pattern Alibaba has run since Qwen 2 held: Preview to full release in 4–8 weeks, same architecture, modestly improved benchmarks, vastly improved instruction tuning. This release is no exception — every public benchmark improved by roughly 2 percentage points, and the LLMCheck Score climbed from 73 to 75.

Architecture

Total params32B
Active params3B (9.4%)
Experts128 (4 active)
Reasoning modeHybrid (auto)
Context1M native
Training tokens20T

Benchmarks

SWE-Verified78%
MMLU89%
HumanEval94%
AIME 202591%
MATH95%
LLMCheck Score75 / 100

The +2pp improvement is bigger than it sounds. In benchmark land, 76% to 78% on SWE-Verified is the difference between "solid open coding model" and "competitive with closed frontier." According to LLMCheck cross-reference data, GPT-5o scores 80% on the same benchmark — meaning the full Qwen 4 release closes the open-vs-closed coding gap to just 2 percentage points. On MMLU and HumanEval the gap is statistically indistinguishable.

Native 1M context is the architectural change. The Preview shipped with 262K native context extended to 1M via YaRN; the full release ships with 1M token native context as the default. According to LLMCheck long-context evaluations, retrieval accuracy at 800K tokens improved from 71% (Preview, YaRN-extended) to 89% (full release, native). For long-document analysis, codebase-wide refactoring, and book-length reasoning, the production model is the first open weight that handles a million tokens without quality collapse.

Mac speeds improved too. The MLX team shipped optimized Metal kernels alongside the production release. Q4_K_M now measures 80 tok/s on M5 Max 128 GB (up from 78), 67 tok/s on M5 Max 64 GB (up from 65), and 60 tok/s on M4 Pro 24 GB (up from 58). The gains come from better expert-routing fusion and a tuned KV-cache layout. For users running Ollama, the same speedup arrives via llama.cpp version 0.5.3 or later.

Install via the standard channels:

# Ollama (one-line install, ~20 GB download)
ollama run qwen4:32b-a3b

# MLX (fastest on Apple Silicon)
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen4-32B-A3B-4bit \
--prompt "Refactor this function for clarity..."

# LM Studio: search "Qwen 4 32B A3B" in Discover tab

Verdict: The full Qwen 4 32B-A3B is the new default recommendation for any Mac with 24 GB RAM or more. It replaces the Preview at #1 on the LLMCheck leaderboard with a clean +2 point margin and is the highest-scoring open-weights model that runs on consumer hardware as of June 6, 2026.


Qwen 4 Coder — the top open-source Mac coder

One day after the full Qwen 4 release, Alibaba shipped Qwen 4 Coder 32B-A3B — the same MoE architecture, post-trained on roughly 4 trillion additional tokens of code, build logs, agentic trajectories, and curated PR reviews. The result is the single most important coding model release of 2026 so far. According to LLMCheck benchmarks, Qwen 4 Coder scores 82% on SWE-Verified, which beats every Mac-runnable open model and matches Claude 4.5 Sonnet's published score within margin.

Coding Benchmarks

SWE-Verified82%
HumanEval96%
MBPP+91%
LiveCodeBench79%
Agentic SWE61%
LLMCheck Score72 / 100

Mac Performance

M5 Max 128 GB78 tok/s
M5 Max 64 GB65 tok/s
M4 Pro 24 GB58 tok/s
RAM (Q4_K_M)~19 GB
LicenseApache 2.0
Context1M native

The license is the story. Devstral, the previous open-source Mac coding leader, ships under a custom Mistral commercial license with a $1M-revenue commercial clause. Codestral 22B has similar restrictions. Qwen 4 Coder ships under full Apache 2.0 — meaning startups can finally ship coding agents and IDE plugins on locally-hosted weights with no licensing risk and no per-token API cost. This is the first time a permissively-licensed open-weights coding model has been the SWE-Verified leader on Mac-runnable hardware.

Agentic tool use is the second story. Qwen 4 Coder was specifically post-trained on multi-step coding trajectories — read file, edit file, run tests, observe output, iterate. According to LLMCheck agentic-SWE testing, the model successfully completes 61% of multi-turn coding tasks (compared to 48% for the base Qwen 4 and 52% for Devstral). For users building autonomous coding agents on Mac hardware, this is now the default model.

Install via the standard channels:

# Ollama (~19 GB download)
ollama run qwen4:coder

# MLX
mlx_lm.generate --model mlx-community/Qwen4-Coder-32B-A3B-4bit \
--prompt "Read this codebase and add OAuth flow"

# LM Studio: search "Qwen 4 Coder" in Discover tab

Llama 5 70B vs Mistral Voyage Pro 70B — the 70B race

The single most consequential June 2026 storyline is that the 70B-dense tier is suddenly competitive again. Meta shipped Llama 5 70B on June 4, and the same day Mistral countered with Voyage Pro 70B under Apache 2.0. Both are dense (no MoE), both target the 64 GB+ Mac tier, both are real frontier-adjacent models, and they trade blows on different axes.

Llama 5 70B vs Mistral Voyage Pro 70B — June 6, 2026.
Llama 5 70B Mistral Voyage Pro 70B
Total params70B dense70B dense
Context256K512K
LicenseLlama 5 Community (700M MAU cap)Apache 2.0
MMLU88%85%
HumanEval86%84%
SWE-Verified65%68%
Agentic SWE54%58%
Tool-use accuracy87%91%
M5 Max 128 GB Q4~18 tok/s~20 tok/s
M4 Ultra 192 GB Q4~22 tok/s~20 tok/s
MultimodalText + image + audioText only
LLMCheck Score6463

Llama 5 70B wins on raw reasoning. Meta's training run includes a larger and cleaner reasoning corpus, and it shows in the MMLU (88%) and HumanEval (86%) numbers. It also retains the multimodal capability Meta introduced with Llama 5 Scout — the 70B accepts image and audio inputs natively, which neither Voyage Pro nor Qwen 4 can do. For pure capability benchmarks, Llama 5 70B is the strongest dense open model in existence as of June 6, 2026.

Mistral Voyage Pro 70B wins on license and agentic. The Apache 2.0 license is the entire pitch — no MAU cap, no field-of-use restrictions, no separate commercial license to negotiate. Combined with the 91% tool-use accuracy and 58% Agentic SWE score, Voyage Pro is the clear pick for production agent systems where licensing matters as much as capability. The 512K context (vs Llama 5's 256K) is icing.

Mac viability is real on both, but constrained. Both 70B dense models need approximately 42 GB of unified memory at Q4_K_M, meaning 64 GB Macs can technically host them but with no headroom for context. The realistic home for both is a Mac Studio M5 Max 128 GB (~18–20 tok/s) or M4 Ultra 192 GB (~20–22 tok/s). On M4 Ultra, Llama 5 70B actually edges Voyage Pro on speed thanks to better matmul shape compatibility with the Ultra's matrix engine.

Practical recommendation: If your company has under 700 million monthly active users (i.e. you are not Apple, Meta, Google, or Amazon), pick Voyage Pro 70B for agent workloads and Llama 5 70B for multimodal or maximum reasoning. The license difference vanishes for 99.9% of users, but the agentic and tool-use gap is real and measurable.


Grok 4 Open — xAI's first open weights

On June 5, xAI did something it had never done before: it released model weights. Grok 4 Open is a 100-billion-parameter mixture-of-experts model with 20 billion active parameters per token, downloadable from HuggingFace, runnable in llama.cpp, MLX, and Ollama. This is the first time anyone outside xAI has been able to run a Grok model on local hardware, and the timing — the same week as Llama 5 70B and Voyage Pro 70B — was clearly not accidental.

Grok 4 Open

Total params100B
Active params20B
MMLU86%
SWE-Verified69%
AIME 202582%
Tool use93% (integrated)

License & Mac

LicensexAI Custom
CommercialYes (with attribution)
M5 Max 64 GB Q4~32 tok/s
M5 Max 128 GB Q4~36 tok/s
RAM (Q4_K_M)~58 GB
LLMCheck Score60 / 100

The license is the asterisk. Grok 4 Open ships under the "xAI Custom License" — permissive enough to allow commercial use without an MAU cap (a meaningful improvement over Llama 5), but with two notable restrictions: attribution is required in any product that uses the model, and the weights cannot be used to train a competing foundation model. The Open Source Initiative has already declined to classify it as open source in the strict sense. For practical use, it is usable, but for downstream open-source projects, the attribution clause adds friction.

The capability profile is unusual. Grok 4 Open is below the leaders on most academic benchmarks (MMLU 86%, SWE-Verified 69%) but excels at integrated tool use (93%) and what xAI calls "real-time reasoning" — the model was trained alongside an integrated web-search and code-execution scaffold, and that training shows in agentic tasks. For users building Grok-style assistants with search and tool access, the model is unusually well-suited even though its raw benchmark numbers are mid-tier.

The vibes-check. Grok 4 Open is a culturally significant release — xAI joining the open-weights club shifts the political map of AI development. Every Western foundation lab except Anthropic and OpenAI now has at least one downloadable model. But on the practical leaderboard, Apache 2.0 and MIT models still win on license, and Qwen 4 still wins on capability per Mac dollar. Grok 4 Open is a notable moment, not a dethroning.

# Install Grok 4 Open
ollama run grok4:open # ~58 GB, MoE 20B-active

# MLX equivalent
mlx_lm.generate --model mlx-community/Grok-4-Open-100B-A20B-4bit \
--prompt "Summarize this PDF and answer questions"

Phi-5 Medium 14B tops its tier

Microsoft shipped Phi-5 Medium on May 30, six days ahead of June, and it instantly became the top-scoring 14B-class model. The pitch: MIT-licensed, dense, 14 billion parameters, scoring 86% MMLU and 75% AIME 2025 — numbers that the original GPT-4 (1.8T parameters) could not reach in early 2024. The "phi recipe" has scaled cleanly from Phi-5 Mini's 4B to Phi-5 Medium's 14B with no quality regression.

14B is the new 32GB sweet spot. The model uses approximately 9 GB of RAM at Q4_K_M, runs at ~65 tok/s on M5 Max and ~32 tok/s on M2 Pro 16 GB, and fits comfortably alongside development tools on a 24 GB or 32 GB Mac. For users who want stronger reasoning than Qwen 4 4B can deliver but cannot afford to dedicate 20 GB of RAM to Qwen 4 32B-A3B, Phi-5 Medium is the new default. According to LLMCheck benchmarks, it beats every other 14B-class open model on MMLU and AIME by clear margins.

The MIT license, the strong AIME 75% score, and the 64K native context (extended via sliding-window to 256K) make Phi-5 Medium the strongest pure-reasoning model in the 16 GB Mac tier — and a credible second-pick for 24 GB Macs that want to keep Qwen 4 32B-A3B unloaded for occasional use.

ollama run phi5:medium # ~9 GB on disk
mlx_lm.generate --model mlx-community/Phi-5-Medium-14B-Instruct-4bit

Gemma 4.5 — Google's June refresh

Google quietly shipped Gemma 4.5 12B on June 2, a refresh rather than a new generation. The headline change is context: Gemma 4 shipped with 256K context; Gemma 4.5 jumps to 1M native, matching Qwen 4. Multimodal capability improved measurably too — Gemma 4.5 now handles audio inputs natively (previously vision-only), and image understanding scores climbed roughly 4 percentage points across MMMU and ChartQA.

Mac speed remains a Gemma strength: 75 tok/s on M5 Max at Q4_K_M, 58 tok/s on M4 Max, and ~12 GB of RAM. The Gemma license retains its prohibited-use restrictions but allows commercial use, and the LLMCheck Score lands at 68 — slotting Gemma 4.5 12B into the upper half of the open-source top 10. For users who want strong multimodal in a 16 GB Mac footprint, Gemma 4.5 is now the top choice.

ollama run gemma4.5:12b
mlx_lm.generate --model mlx-community/Gemma-4.5-12B-IT-4bit

The full landscape — Open-Source Top 10 (June 6, 2026)

According to LLMCheck benchmarks across 109 standardized data points, here is the open-source leaderboard as of June 6, 2026. Score is the LLMCheck composite (capability + speed + accessibility + license, max 100). Mac Tier is the minimum unified memory needed to run Q4_K_M comfortably.

LLMCheck Open-Source Top 10 — June 6, 2026. See full leaderboard for all models.
Rank Model Family Active License Mac Tier Score
1 Qwen 4 32B-A3B Alibaba 3B Apache 2.0 24 GB 75
2 Qwen 4 Preview 32B-A3B Alibaba 3B Apache 2.0 24 GB 73
3 Qwen 4 Coder 32B-A3B Alibaba 3B Apache 2.0 24 GB 72
4 Qwen 4 4B Alibaba 4B Apache 2.0 8 GB 71
5 Phi-5 Mini Microsoft 4B MIT 8 GB 70
6 Qwen 3.6-35B-A3B Alibaba 3B Apache 2.0 24 GB 69
7 Gemma 4.5 12B Google 12B Gemma 16 GB 68
8 Gemma 4 E2B Google 2.3B Gemma 8 GB 67
9 DeepSeek R2 DeepSeek 37B MIT Server 66
10 Phi-5 Medium 14B Microsoft 14B MIT 24 GB 65

Three observations. First, Alibaba now holds half of the top 10 — the Qwen family (full, Preview, Coder, 4B, and 3.6) occupies ranks 1, 2, 3, 4, and 6. This is unprecedented concentration in the open-LLM ecosystem and reflects how quickly Alibaba is iterating. Second, Apache 2.0 and MIT account for 8 of the top 10 entries — up from 7 in May. Permissive licensing has become a default expectation, not a differentiator. Third, five of the top 10 have an 8 GB or 16 GB Mac tier — the entry-level MacBook Air has never had more credible model options.


1. Frontier-class is now 24 GB Mac territory

With Qwen 4 32B-A3B at score 75 (within striking distance of GPT-5o on every benchmark) and Qwen 4 Coder at 82% SWE-Verified, a 24 GB MacBook Pro now runs frontier-adjacent coding and reasoning models at production-usable speeds. Six months ago this combination required a $5,000+ Mac Studio. The democratization is real and the entry tier is now genuinely useful, not just symbolic.

2. 70B dense is the new battleground

Llama 5 70B and Mistral Voyage Pro 70B shipped on the same day. Both target the 64 GB+ Mac tier, both score in the 84–88% range on MMLU, and they trade blows on license vs. capability. Six months ago, the 70B tier was a Llama monopoly — today it is a competitive market with real choice. Expect Qwen 4 70B and a DeepSeek 70B-class entrant in the next 90 days.

3. xAI joined the open club

Grok 4 Open is a political milestone as much as a technical one. Every major Western lab except Anthropic and OpenAI now ships downloadable weights. The xAI Custom License is not Apache 2.0, but the gesture matters — it signals that the cost of staying closed is rising as the open ecosystem improves. According to LLMCheck cross-reference data, Grok 4 Open weights were downloaded 1.4 million times in its first 48 hours, comparable to a major Llama release.

4. 1M context became table stakes

Qwen 4 (1M native), Gemma 4.5 (1M native), Mistral Voyage Pro (512K), and DeepSeek V4 Pro (1M native) all shipped or refreshed with 1M-class context windows. Six months ago, 256K was the open-source ceiling. Today 1M is the spec sheet expectation. For real-world Mac use, only Qwen 4 and Gemma 4.5 maintain >85% retrieval accuracy past 800K tokens, but the architectural floor has shifted up.

5. Apache 2.0 dominance hit 60%

According to LLMCheck license tracking across the top 25 open-weights models, Apache 2.0 share crossed 60% in June 2026 for the first time. Qwen 4, Qwen 4 Coder, Qwen 4 4B, Mistral Voyage Pro 70B, and Mistral Voyage 24B all ship under unrestricted Apache 2.0. MIT (Phi-5 family, DeepSeek R2) adds another 16%. The era of license uncertainty in open-weights LLMs is closing — permissive OSI licensing now dominates the top tier without ambiguity.


By Mac tier — what to run TODAY (June 2026)

Updated recommendations as of June 6, 2026. All numbers are LLMCheck-measured tok/s at Q4_K_M unless noted. Pick by RAM tier:

Mac tier recommendations, June 6, 2026.
Mac RAM Primary pick Speed Backup
8 GB Qwen 4 4B 135 tok/s Phi-5 Mini (140 tok/s)
16 GB Phi-5 Medium 14B 32 tok/s (M2 Pro) Qwen 4 4B (115 tok/s)
24 GB Qwen 4 Coder 32B-A3B 58 tok/s Qwen 4 32B-A3B (60 tok/s)
32–48 GB Qwen 4 + Voyage 24B 67 tok/s Phi-5 Medium + Qwen 4 Coder
64 GB Grok 4 Open 32 tok/s Llama 5 Scout (38 tok/s)
128 GB Llama 5 70B 18 tok/s Voyage Pro 70B (20 tok/s)
M4 Ultra 192 GB Llama 5 70B ~22 tok/s Voyage Pro 70B (~20 tok/s)

The 24 GB sweet spot is now Qwen 4 Coder. For users on base-tier MacBook Pro hardware, the question "what's the best coding model I can run?" has a clean answer for the first time in 2026: ollama run qwen4:coder. The 82% SWE-Verified score puts it within margin of Claude 4.5 Sonnet, the Apache 2.0 license removes commercial concerns, and 58 tok/s is genuinely fast. This is the recommendation we'll be giving for the rest of the summer unless something dramatic ships.


What's coming next month

Speculative section — this is what LLMCheck is watching for in the next 30 days based on public roadmaps, leaked release notes, and credible community signals.

Watch list: The single most consequential possible July release is a permissively-licensed coding model that beats Qwen 4 Coder on SWE-Verified. Qwen has the lead by a wider margin than any other model in any other category. If a competitor ships an Apache 2.0 or MIT coder above 85% SWE-Verified, the agent-platform market resets again.