LLMCheck Index Methodology
How the LLMCheck index ranks 79 local and frontier LLMs for Mac — a published estimation model, linked third-party benchmarks, and community submissions. Fully transparent, fully reproducible.
LLMCheck is an independent index of local-LLM performance on Apple Silicon — not a benchmark lab. Every figure is either a transparent estimate from the published model below, a sourced third-party benchmark (linked per row), or a community submission. Each data point is labeled with its provenance, and no figure is claimed as a first-party lab measurement.
Where Every Number Comes From
Every row in the leaderboard, the benchmarks table, and the open dataset carries one of three provenance labels. This mirrors the provenance_policy field published in benchmarks.json:
Derived from the LLMCheck estimation model — memory-bandwidth scaling plus quantization arithmetic, documented in full below. Estimates are useful for planning ("will this model fit and feel usable on my Mac?") but are not measurements. Every estimated speed figure on the site links back to this page.
A published third-party benchmark, with the source URL attached to the row — Arena AI ELO ratings, MMLU/HumanEval/SWE-Bench results from official model cards and papers, or independent hardware test data. We link, we don't re-host or re-run.
A user-submitted run from a real Mac, with the submission URL as its source. Community numbers are sanity-checked against the estimation model and known baselines before inclusion. Own a Mac? Submit a benchmark — real runs beat estimates every time.
Why an index and not a lab? A single lab machine can only test one chip, one runtime, one macOS version. An index that publishes its estimation math, links its sources, and accepts community runs covers the entire Apple Silicon range — and you can audit every step. When we're estimating, we say so.
How We Estimate Performance
Local LLM performance on Apple Silicon is unusually predictable, because two hardware numbers dominate everything: unified memory capacity (can the model fit?) and memory bandwidth (how fast can weights stream through the GPU?). The LLMCheck estimation model is built on those two numbers.
Step 1 — Will it fit? (RAM model)
+ KV cache: grows with context length — ~1–4 GB typical at 4k–32k context
+ macOS overhead: ~2–3 GB for the OS and the inference runtime
Fit rule: total must stay within ~75% of unified memory — macOS lets the GPU address roughly three-quarters of RAM by default
That 75% rule is where the minimum-RAM guidance across the index comes from: a 16 GB Mac has a ~12 GB working budget, a 24 GB Mac ~18 GB, a 64 GB Mac ~48 GB, and so on.
Step 2 — How fast? (Speed model)
Token generation is memory-bandwidth-bound: for every token, the runtime must stream essentially all active model weights from RAM through the GPU. Compute is rarely the bottleneck on Apple Silicon. That gives a simple ceiling:
model_bytes_GB: weight bytes read per token — total size for dense models, active-parameter bytes for MoE
efficiency: ~0.6–0.8 empirically for llama.cpp / MLX-class runtimes on dense models; lower for MoE (routing and expert-loading overhead)
Cross-chip scaling: tok/s scales ≈ with the bandwidth ratio — M5 Max (600) vs M4 Pro (273) ≈ 2.2×
This is also why Mixture-of-Experts models dominate the speed rankings: a 32B MoE with 3B active parameters reads ~1.7 GB per token instead of ~18 GB — a 10× smaller memory bill per token, while still storing the full 32B of knowledge in RAM.
Step 3 — A worked example
▶ Llama 3.3 70B (dense) on an M5 Max, Q4_K_M
- Weights:
70 × 0.57 ≈ 40 GBon disk and in RAM. - Total RAM needed:
40 + ~2 (KV cache) + ~3 (macOS) ≈ 45 GB. - Fit check: a 64 GB Mac budgets
64 × 0.75 = 48 GB— fits. A 48 GB Mac budgets36 GB— the 40 GB of weights alone don't fit. So the index lists min RAM = 64 GB. - Speed ceiling:
600 GB/s ÷ 40 GB = 15 tok/stheoretical maximum. - Apply efficiency:
15 × ~0.7 ≈ 10–11 tok/s— the estimated figure the index shows.
Contrast with an MoE model: Qwen 4.1 32B-A3B activates ~3B parameters per token, so it reads only 3 × 0.57 ≈ 1.7 GB per token. On an M4 Pro (273 GB/s) the ceiling is ~160 tok/s; realized MoE efficiency of ~0.4 (routing overhead, expert scatter across memory) lands the estimate near ~62 tok/s — on a mid-range chip. The full 32B of weights still need ~18 GB of RAM to be resident, which is why it's a 24 GB-Mac model despite generating like a 3B.
What the estimates can't capture
Honest limits of the model: thermals (a fanless MacBook Air throttles on long generations; a Mac Studio doesn't), context length (speed degrades as the KV cache grows — long chats get slower), quantization variants (Q5, Q8, and MLX 4-bit all shift size and speed), and runtime differences (Ollama, LM Studio, llama.cpp, and MLX can differ 10–20% on the same hardware). Real numbers will vary from the estimates — that's exactly why every estimated row is labeled, and why community submissions are invited to replace estimates with real runs.
The LLMCheck Score Formula
The estimation model above feeds one of four components in the composite 0–100 LLMCheck Score used to rank the catalog:
Speed (0–25): est. tok/s on M5 Max × 0.25, capped at 25 — estimated via the model above
Accessibility (0–15): RAM tier: ≤8 GB = 15, ≤16 GB = 12, ≤24 GB = 10, ≤32 GB = 9, ≤64 GB = 5, ≤128 GB = 2, >128 GB = 0
License (0–10): MIT = 10, Apache 2.0 = 8, Gemma = 6, Meta Custom = 5, xAI Custom = 4, CC-BY-NC / N/A = 2
Dimension 1: Capability (50 points) — sourced
Raw model intelligence — reasoning, knowledge, coding, instruction following. Sourced from published third-party benchmark systems, linked per model.
The capability score (capScore) is the most heavily weighted dimension because the primary value of an LLM is output quality. capScore is sourced, not estimated: the index normalizes it from three public benchmark systems:
- Arena AI ELO Rating (40% weight) — Human preference ranking from lmarena.ai. Measures how often real users prefer a model's response over competitors in blind A/B tests. The gold standard for "does this model feel good to use."
- MMLU Score (35% weight) — Massive Multitask Language Understanding. Tests knowledge across 57 academic subjects. Sourced from official model cards on HuggingFace and arXiv papers.
- Coding Benchmarks (25% weight) — HumanEval (code generation) and SWE-Bench Verified (real bug fixing). Sourced from llm-stats.com and official model publications.
When official benchmarks are unavailable for a model (common for very new releases), the index extrapolates from related models in the same family and marks the score as estimated — same provenance rules as everything else. These estimates are replaced with sourced scores within 2–4 weeks of release.
Why 50 points? A very fast model that gives poor answers is less useful than a slower model with excellent reasoning. Capability is weighted highest because it determines whether the model actually solves your problem. Speed and RAM determine whether you can run it — but there's no point running a model that can't help you.
Capability Score Table (All 79 Models)
| Model | Params | capScore | Arena ELO | MMLU | Coding | Sources |
|---|---|---|---|---|---|---|
| Kimi K2.5 | 1T MoE | 50 | ~1480 | — | HumanEval 99% | Arena |
| DeepSeek V4 Pro | 1.6T MoE | 50 | ~1545 | — | SWE-V 80.6%, GPQA 90.1% | HF |
| DeepSeek R2 | 671B MoE | 50 | ~1530 | — | AIME 91%, MATH 88% | HF |
| GLM 5.2 | 744B MoE | 50 | ~1565 | 92% | SWE-Pro 68.5%, beats GPT-5 & Claude | HF |
| DeepSeek R3 | 685B MoE | 50 | ~1558 | — | AIME 95%, MATH 92% | HF |
| Kimi K3 | 1T MoE | 49 | ~1545 | — | Agentic coding leader | HF |
| Kimi K2.6 | 1.05T MoE | 48 | ~1525 | — | Coding 78.6, Agentic 58.3 | HF |
| Qwen 4.1 32B-A3B | 32B MoE | 46 | ~1510 | 90% | SWE-V 80%, HumanEval 95% | HF |
| Qwen 4 | 32B MoE | 45 | ~1505 | 89% | SWE-V 78%, HumanEval 94% | HF |
| Llama 5 405B | 405B | 44 | ~1500 | 91% | HumanEval 90%, dense frontier | HF |
| Qwen 4 Coder 32B-A3B | 32B MoE | 44 | ~1495 | 87% | SWE-V 82%, HumanEval 96% | HF |
| Qwen 4 Preview 32B-A3B | 32B MoE | 42 | ~1490 | 88% | HumanEval 92%, SWE-V 76% | HF |
| Devstral Small 24B | 24B | 24 | ~1385 | 79% | Agentic coding specialist | HF |
| DeepSeek V3.2 | 685B MoE | 42 | ~1445 | 94.2% | SWE 58.2% | HF |
| GLM 5.2 Air | 106B MoE | 40 | ~1470 | 88% | SWE-Pro 58%, Mac-runnable | HF |
| Gemma 4 31B | 31B Dense | 40 | 1452 (#3) | ~88% | AIME 89.2% | |
| Qwen 3.5 (397B) | 397B MoE | 38 | ~1430 | ~90% | — | HF |
| Llama 5 70B | 70B | 38 | ~1450 | 88% | HumanEval 86%, dense | HF |
| Phi-5 Large 28B | 28B | 36 | ~1455 | 88% | AIME 80%, MIT | HF |
| Mistral Voyage Pro 70B | 70B | 36 | ~1440 | 85% | SWE-V 68%, agentic | HF |
| GLM-5.1 | 744B MoE | 48 | ~1460 | — | SWE-Pro 58.4% | HF |
| Qwen3-235B-A22B | 235B MoE | 46 | ~1455 | — | — | HF |
| Qwen 3.6-35B-A3B | 35B MoE | 38 | — | 82.6% | SWE 73.4% | HF |
| DeepSeek V3 | 685B MoE | 37 | ~1410 | 88.5% | — | HF |
| DeepSeek R1 | 671B MoE | 37 | ~1410 | — | — | HF |
| Mistral Large 3 | 675B MoE | 36 | ~1405 | — | — | HF |
| Llama 4 Maverick | 400B MoE | 36 | ~1400 | 85.5% | — | HF |
| Gemma 4 26B-A4B | 26B MoE | 35 | 1441 (#6) | ~85% | — | |
| Llama 3.1 405B | 405B | 35 | ~1395 | — | — | HF |
| MiniMax M2.5 | 230B MoE | 35 | ~1390 | — | SWE 80.2% | Arena |
| Mistral Medium 4 | 41B MoE | 34 | ~1445 | 84% | SWE-V 70%, agentic | HF |
| GLM-4.7 | 355B | 34 | ~1385 | — | HumanEval 94.2% | HF |
| Step-3.5 Flash | 196B MoE | 33 | ~1380 | — | — | est. |
| Gemma 4.5 27B | 27B | 32 | ~1435 | 86% | 1M context, multimodal | HF |
| MiMo-V2-Flash | 309B MoE | 32 | ~1370 | — | — | est. |
| GPT-oss 120B | 117B | 32 | ~1365 | MMLU-Pro 90% | — | HF |
| Command R+ 2 | 104B | 30 | ~1415 | 82% | Enterprise RAG, CC-BY-NC | HF |
| Grok 4 Open 100B-A20B | 100B MoE | 30 | ~1410 | 82% | xAI's first open weights | HF |
| DeepSeek R1 70B | 70B | 30 | ~1350 | — | — | HF |
| Mistral Small 4 | 119B MoE | 34 | ~1385 | — | — | HF |
| Llama 4 Scout | 109B MoE | 30 | ~1345 | — | — | HF |
| Llama 5 Scout 109B-A17B | 109B MoE | 30 | ~1410 | 82% | multimodal | HF |
| Hermes 4 70B | 70B | 28 | ~1400 | — | exceptional instruction-following | HF |
| Phi-5 Medium 14B | 14B | 28 | ~1400 | 86% | AIME 75%, MIT | HF |
| Gemma 4.5 12B | 12B | 28 | ~1395 | 84% | 1M context | HF |
| Mistral Voyage 24B | 24B | 25 | ~1390 | 80% | balanced | HF |
| Nemotron Cascade 2 | 30B | 30 | — | — | — | HF |
| Mixtral 8x22B | 141B MoE | 28 | ~1320 | — | — | est. |
| Qwen 2.5 72B | 72B | 28 | ~1330 | ~86% | — | HF |
| Qwen3-Coder-Next | 80B MoE | 28 | — | — | SWE 70.6% | HF |
| Qwen 3.5 35B | 35B MoE | 27 | ~1340 | ~83% | — | HF |
| Llama 3.3 70B | 70B | 27 | ~1325 | 86% | — | HF |
| QwQ 32B | 32B | 26 | ~1310 | — | — | HF |
| Qwen 3 32B | 32B | 25 | ~1300 | — | — | HF |
| DeepSeek R1 32B | 32B | 24 | ~1290 | — | — | HF |
| Gemma 3 27B | 27B | 22 | ~1270 | — | — | HF |
| Qwen 3 30B-A3B | 30B MoE | 22 | ~1280 | — | — | HF |
| Qwen 4 4B | 4B | 22 | ~1340 | 80% | beats Phi-5 Mini | HF |
| Qwen 3.5 27B | 27B | 21 | ~1275 | — | — | HF |
| Qwen 3 14B | 14B | 20 | ~1250 | — | — | HF |
| Llama 5 8B | 8B | 20 | ~1370 | 78% | HumanEval 80% | HF |
| Phi-4 14B | 14B | 19 | ~1240 | 84.8% | — | HF |
| Phi-5 Mini 4B | 4B | 18 | ~1310 | 82% | HumanEval 78%, AIME 61% | HF |
| Qwen 3.5 9B | 9B | 18 | ~1220 | — | — | HF |
| Qwen 2.5 14B | 14B | 18 | ~1210 | — | — | HF |
| Ministral 14B | 14B | 18 | ~1215 | — | — | est. |
| Gemma 3 12B | 12B | 17 | ~1200 | — | — | HF |
| DeepSeek R1 8B | 8B | 16 | ~1190 | — | — | HF |
| Gemma 4 E4B | 4B (PLE) | 16 | ~1195 | — | — | |
| Qwen 3 8B | 8B | 15 | ~1180 | — | — | HF |
| Phi-4 Mini | 3.8B | 14 | ~1170 | — | — | HF |
| Ministral 8B | 8B | 14 | ~1165 | — | — | est. |
| Gemma 4 E2B | 2.3B (PLE) | 13 | ~1150 | — | — | |
| Mistral 7B | 7B | 13 | ~1140 | — | — | HF |
| Llama 3.1 8B | 8B | 12 | ~1130 | — | — | HF |
| Qwen 3.5 4B | 4B | 12 | ~1125 | — | — | HF |
| Qwen 3 4B | 4B | 11 | ~1110 | — | — | HF |
| Gemma 3 4B | 4B | 10 | ~1090 | — | — | HF |
| SmolLM3 3B | 3B | 10 | ~1180 | — | edge-optimized | HF |
Dimension 2: Speed on Apple Silicon (25 points) — estimated
Estimated tokens per second on the M5 Max 128 GB reference configuration at Q4_K_M — computed with the estimation model above, not measured in a lab.
Speed points are calculated as: est. tok/s × 0.25, capped at 25 points. Any model estimated at 100+ tok/s on the reference configuration receives full speed points. The estimates assume these reference conditions:
- Quantization: Q4_K_M (most common consumer quantization)
- Reference hardware: M5 Max, 128 GB Unified Memory (600 GB/s bandwidth)
- Runtime class: llama.cpp / MLX-class engines (Ollama, LM Studio) with dense efficiency ~0.6–0.8
- Context: short-context generation — long contexts reduce speed as the KV cache grows
- Provenance: rows are upgraded from estimated to sourced or community as linked third-party tests and user submissions come in
Models that cannot run on any Mac (server-only, >128 GB RAM) receive 0 speed points. The full dataset — with a provenance label on every row — is available for download at /data/.
Dimension 3: Accessibility (15 points)
How many Mac users can actually run this model? Lower RAM requirements = higher accessibility score.
Accessibility is a step function based on minimum RAM required at 4-bit quantization, derived from the RAM model above (weights + KV cache + macOS overhead, within the ~75% unified-memory budget). Since a large share of Mac users have 16 GB or less, accessibility is crucial for real-world impact:
| Min RAM (4-bit) | Points | Example Models | Est. Mac Users |
|---|---|---|---|
| ≤ 8 GB | 15 | Gemma 4 E4B, Phi-4 Mini, Qwen 3.5 9B | ~100% of Apple Silicon |
| ≤ 16 GB | 12 | Qwen 3 14B, Gemma 3 12B | ~85% |
| ≤ 24 GB | 10 | Gemma 4 26B-A4B, Gemma 4 31B, Qwen 3.5 35B | ~40% |
| ≤ 32 GB | 9 | QwQ 32B, Qwen 3 32B | ~30% |
| ≤ 64 GB | 5 | DeepSeek R1 70B, Llama 3.3 70B | ~10% |
| ≤ 128 GB | 2 | GPT-oss 120B, Mixtral 8x22B | ~3% |
| > 128 GB | 0 | Kimi K2.5, DeepSeek V3, GLM-5.1 | Server only |
Dimension 4: License Openness (10 points)
How freely can you use, modify, and distribute the model? More open = higher score.
| License | Points | Can Modify? | Commercial Use? | Models |
|---|---|---|---|---|
| MIT | 10 | Yes | Unrestricted | Kimi K2.5, DeepSeek, Phi-4 Mini |
| Apache 2.0 | 8 | Yes | Yes (with notice) | Gemma 4, Qwen 3.5, Mistral |
| Gemma | 6 | Yes | Yes (<700M users) | Gemma 3 (old license) |
| Meta Custom | 5 | Limited | Yes (<700M users) | Llama 5, Llama 4 |
| xAI Custom | 4 | Limited | Yes (restrictions) | Grok 4 Open |
| CC-BY-NC | 2 | Yes | Non-commercial only | Command R+ 2 |
| Proprietary / N/A | 2 | No | API only | MiniMax M2.5 |
Note: CC-BY-NC (non-commercial) scores 2 — usable for research but not commercial deployment. Meta / xAI community licenses score 5 / 4, reflecting modification rights with commercial caps.
Score Examples
Gemma 4 26B-A4B (Score: 67) = capScore 35 + speed min(25, est. 48 tok/s × 0.25 = 12) + accessibility 10 (24 GB) + license 8 (Apache 2.0) + rounding = 65–67. Top-ranked because it combines Arena AI #6 quality with fast MoE inference on a 24 GB Mac.
Qwen 3.5 9B (Score: 66) = capScore 18 + speed min(25, est. 100 tok/s × 0.25 = 25) + accessibility 15 (8 GB) + license 8 (Apache 2.0) = 66. Ranks high because maximum speed + accessibility points compensate for lower raw capability.
Kimi K2.5 (Score: 60) = capScore 50 (highest!) + speed 0 (server only) + accessibility 0 (>128 GB) + license 10 (MIT) = 60. Despite being the most capable model, it scores lower because no Mac user can run it locally.
Limitations & Known Issues
- Arena ELO volatility: Ratings shift weekly as new votes come in. The index snapshots monthly and notes the capture date.
- Estimated capScores: ~15% of capScores are extrapolated from model-family performance when official benchmarks are pending. These are marked "est." in the table above.
- Estimates are not measurements: speed figures come from the bandwidth model, which cannot capture thermals, long-context slowdown, or runtime-specific optimizations. Expect real-world variance of 10–20% either way — sometimes more on fanless machines.
- Quantization baseline: All estimates assume Q4_K_M. Some models are commonly run at Q5, Q8, or MLX 4-bit, which shift both size and speed; the index standardizes on one baseline for fair comparison.
- No multimodal scoring: The current score does not reward multimodal capabilities (image/audio input). Models like Gemma 4 E4B have multimodal features not captured in the 0–100 score.
Corrections and real-world benchmark data are always welcome. If you have runs that differ from the index's estimates, submit them through the community benchmark process — verified community rows replace estimates.
Frequently Asked Questions
Does LLMCheck run its own benchmarks?
No. LLMCheck is an independent index, not a benchmark lab. Every figure on the site is one of three things: an estimate from the published LLMCheck estimation model (memory-bandwidth math and quantization arithmetic, fully documented on this page), a sourced number from a linked third-party benchmark, or a community-submitted run. Each row is labeled with its provenance, and no figure is claimed as a first-party lab measurement.
How does LLMCheck calculate its scores?
The LLMCheck Score is a 0–100 composite metric: Capability (50 pts) sourced from published third-party evaluations such as Arena AI ELO ratings and MMLU/coding benchmarks, Speed (25 pts) from estimated tokens/sec on the M5 Max reference configuration, Accessibility (15 pts) inversely proportional to minimum RAM, and License Openness (10 pts) where MIT scores 10 and restrictive licenses score lower. The formula is fully transparent and reproducible.
Where does LLMCheck get its capability scores?
Capability scores are derived from three public benchmark sources: Arena AI ELO ratings (human preference, weighted 40%), MMLU scores from official model cards (knowledge breadth, weighted 35%), and coding benchmarks like HumanEval and SWE-Bench (weighted 25%). All sources are linked per model. When official benchmarks are unavailable, the index extrapolates from related models in the same family and marks the score as 'estimated'.
How does LLMCheck estimate tokens per second on Apple Silicon?
Token generation on Apple Silicon is memory-bandwidth-bound, so estimated tok/s ≈ (memory bandwidth in GB/s ÷ model size in GB at Q4_K_M) × an efficiency factor of roughly 0.6–0.8 observed for llama.cpp/MLX-class runtimes. Mixture-of-Experts models use active-parameter bytes instead of total size, which is why they are much faster. Scaling across chips follows the bandwidth ratio. The full model is published at llmcheck.net/methodology#estimation, and every estimated figure is labeled as such.
Why does LLMCheck weight capability at 50% of the total score?
Capability receives the highest weight because the primary value of an LLM is the quality of its outputs. A very fast model that gives poor answers is less useful than a slower model with excellent reasoning. However, speed (25%) and accessibility (15%) ensure that models which actually run well on consumer Macs score higher than server-only models with superior capability but no practical local use.
How often are LLMCheck scores updated?
Scores are updated within 48–72 hours of major model releases. The full leaderboard is refreshed monthly with the latest Arena AI ELO ratings and community benchmark submissions. Speed estimates are recomputed as new Apple Silicon hardware specifications become available. All updates are timestamped in the open dataset at llmcheck.net/data/.
Benchmark Sources
- Arena AI / LMSYS Chatbot Arena — Human preference ELO ratings
- HuggingFace Model Hub — Official model cards with MMLU and benchmark data
- LLM Stats — Aggregated benchmark scores (HumanEval, SWE-Bench)
- Onyx Open LLM Leaderboard — Tier rankings and benchmark aggregation
- Google DeepMind Gemma 4 Blog — Official Gemma 4 benchmarks
- GEO: Generative Engine Optimization (Princeton/Georgia Tech) — Research methodology reference
See the Full Leaderboard
79 models ranked by LLMCheck Score. Filter by your Mac's RAM, sort by speed or capability — every figure labeled with its provenance.
View Leaderboard →