LLMCheck Scoring Methodology
How we rank 46 local and frontier LLMs for Mac — fully transparent, independently sourced, and reproducible.
The LLMCheck Score is a 0–100 composite metric combining four dimensions: Capability (50 pts) sourced from Arena AI ELO ratings and MMLU/coding benchmarks, Speed (25 pts) from tokens/sec on M5 Max, Accessibility (15 pts) based on minimum RAM, and License Openness (10 pts). Every score is verifiable — sources are linked per model below.
The Formula
Speed (0–25): tok/s on M5 Max × 0.25, capped at 25
Accessibility (0–15): RAM tier: ≤8 GB = 15, ≤16 GB = 12, ≤24 GB = 10, ≤32 GB = 9, ≤64 GB = 5, ≤128 GB = 2, >128 GB = 0
License (0–10): MIT = 10, Apache 2.0 = 8, Gemma = 6, Meta Custom = 5, Proprietary = 2
Dimension 1: Capability (50 points)
Measures raw model intelligence — reasoning, knowledge, coding, instruction following. Sourced from three public benchmark systems.
The capability score (capScore) is the most heavily weighted dimension because the primary value of an LLM is output quality. According to LLMCheck's methodology, we derive capScore from three public sources:
- Arena AI ELO Rating (40% weight) — Human preference ranking from lmarena.ai. Measures how often real users prefer a model's response over competitors in blind A/B tests. The gold standard for "does this model feel good to use."
- MMLU Score (35% weight) — Massive Multitask Language Understanding. Tests knowledge across 57 academic subjects. Sourced from official model cards on HuggingFace and arXiv papers.
- Coding Benchmarks (25% weight) — HumanEval (code generation) and SWE-Bench Verified (real bug fixing). Sourced from llm-stats.com and official model publications.
When official benchmarks are unavailable for a model (common for very new releases), we extrapolate from related models in the same family and mark the score as estimated. These estimates are replaced with verified scores within 2-4 weeks of release.
Why 50 points? A very fast model that gives poor answers is less useful than a slower model with excellent reasoning. Capability is weighted highest because it determines whether the model actually solves your problem. Speed and RAM determine whether you can run it — but there's no point running a model that can't help you.
Capability Score Table (All 46 Models)
| Model | Params | capScore | Arena ELO | MMLU | Coding | Sources |
|---|---|---|---|---|---|---|
| Kimi K2.5 | 1T MoE | 50 | ~1480 | — | HumanEval 99% | Arena |
| DeepSeek V3.2 | 685B MoE | 42 | ~1445 | 94.2% | SWE 58.2% | HF |
| Gemma 4 31B | 31B Dense | 40 | 1452 (#3) | ~88% | AIME 89.2% | |
| Qwen 3.5 (397B) | 397B MoE | 38 | ~1430 | ~90% | — | HF |
| GLM-5 | 744B | 38 | ~1420 | — | SWE 77.8% | HF |
| DeepSeek V3 | 685B MoE | 37 | ~1410 | 88.5% | — | HF |
| DeepSeek R1 | 671B MoE | 37 | ~1410 | — | — | HF |
| Mistral Large 3 | 675B MoE | 36 | ~1405 | — | — | HF |
| Llama 4 Maverick | 400B MoE | 36 | ~1400 | 85.5% | — | HF |
| Gemma 4 26B-A4B | 26B MoE | 35 | 1441 (#6) | ~85% | — | |
| Llama 3.1 405B | 405B | 35 | ~1395 | — | — | HF |
| MiniMax M2.5 | 230B MoE | 35 | ~1390 | — | SWE 80.2% | Arena |
| GLM-4.7 | 355B | 34 | ~1385 | — | HumanEval 94.2% | HF |
| Step-3.5 Flash | 196B MoE | 33 | ~1380 | — | — | est. |
| MiMo-V2-Flash | 309B MoE | 32 | ~1370 | — | — | est. |
| GPT-oss 120B | 117B | 32 | ~1365 | MMLU-Pro 90% | — | HF |
| DeepSeek R1 70B | 70B | 30 | ~1350 | — | — | HF |
| Llama 4 Scout | 109B MoE | 30 | ~1345 | — | — | HF |
| Mixtral 8x22B | 141B MoE | 28 | ~1320 | — | — | est. |
| Qwen 2.5 72B | 72B | 28 | ~1330 | ~86% | — | HF |
| Qwen3-Coder-Next | 80B MoE | 28 | — | — | SWE 70.6% | HF |
| Qwen 3.5 35B | 35B MoE | 27 | ~1340 | ~83% | — | HF |
| Llama 3.3 70B | 70B | 27 | ~1325 | 86% | — | HF |
| QwQ 32B | 32B | 26 | ~1310 | — | — | HF |
| Qwen 3 32B | 32B | 25 | ~1300 | — | — | HF |
| DeepSeek R1 32B | 32B | 24 | ~1290 | — | — | HF |
| Gemma 3 27B | 27B | 22 | ~1270 | — | — | HF |
| Qwen 3 30B-A3B | 30B MoE | 22 | ~1280 | — | — | HF |
| Qwen 3.5 27B | 27B | 21 | ~1275 | — | — | HF |
| Qwen 3 14B | 14B | 20 | ~1250 | — | — | HF |
| Phi-4 14B | 14B | 19 | ~1240 | 84.8% | — | HF |
| Qwen 3.5 9B | 9B | 18 | ~1220 | — | — | HF |
| Qwen 2.5 14B | 14B | 18 | ~1210 | — | — | HF |
| Ministral 14B | 14B | 18 | ~1215 | — | — | est. |
| Gemma 3 12B | 12B | 17 | ~1200 | — | — | HF |
| DeepSeek R1 8B | 8B | 16 | ~1190 | — | — | HF |
| Gemma 4 E4B | 4B (PLE) | 16 | ~1195 | — | — | |
| Qwen 3 8B | 8B | 15 | ~1180 | — | — | HF |
| Phi-4 Mini | 3.8B | 14 | ~1170 | — | — | HF |
| Ministral 8B | 8B | 14 | ~1165 | — | — | est. |
| Gemma 4 E2B | 2.3B (PLE) | 13 | ~1150 | — | — | |
| Mistral 7B | 7B | 13 | ~1140 | — | — | HF |
| Llama 3.1 8B | 8B | 12 | ~1130 | — | — | HF |
| Qwen 3.5 4B | 4B | 12 | ~1125 | — | — | HF |
| Qwen 3 4B | 4B | 11 | ~1110 | — | — | HF |
| Gemma 3 4B | 4B | 10 | ~1090 | — | — | HF |
Dimension 2: Speed on Apple Silicon (25 points)
Tokens per second on M5 Max 128 GB at Q4_K_M quantization. Directly measures how fast the model generates text on Apple Silicon.
Speed is calculated as: tok/s × 0.25, capped at 25 points. This means any model generating 100+ tok/s on M5 Max receives full speed points. According to LLMCheck benchmarks, speed measurements follow this protocol:
- Quantization: Q4_K_M (most common consumer quantization)
- Input: Standardized 256-token prompt
- Output: 512-token generation window
- Hardware: M5 Max, 128 GB Unified Memory, latest macOS
- Engine: Ollama (default) or MLX where noted
- Runs: Average of 3 runs on freshly booted system
- Verification: Community submissions validated against known baselines
Models that cannot run on any Mac (server-only, >128 GB RAM) receive 0 speed points. Full benchmark data is available for download at /data/.
Dimension 3: Accessibility (15 points)
How many Mac users can actually run this model? Lower RAM requirements = higher accessibility score.
Accessibility is a step function based on minimum RAM required at INT4 quantization. According to LLMCheck's data, approximately 60% of Mac users have 16 GB or less, making accessibility crucial for real-world impact:
| Min RAM (INT4) | Points | Example Models | Est. Mac Users |
|---|---|---|---|
| ≤ 8 GB | 15 | Gemma 4 E4B, Phi-4 Mini, Qwen 3.5 9B | ~100% of Apple Silicon |
| ≤ 16 GB | 12 | Qwen 3 14B, Gemma 3 12B | ~85% |
| ≤ 24 GB | 10 | Gemma 4 26B-A4B, Gemma 4 31B, Qwen 3.5 35B | ~40% |
| ≤ 32 GB | 9 | QwQ 32B, Qwen 3 32B | ~30% |
| ≤ 64 GB | 5 | DeepSeek R1 70B, Llama 3.3 70B | ~10% |
| ≤ 128 GB | 2 | GPT-oss 120B, Mixtral 8x22B | ~3% |
| > 128 GB | 0 | Kimi K2.5, DeepSeek V3, GLM-5 | Server only |
Dimension 4: License Openness (10 points)
How freely can you use, modify, and distribute the model? More open = higher score.
| License | Points | Can Modify? | Commercial Use? | Models |
|---|---|---|---|---|
| MIT | 10 | Yes | Unrestricted | Kimi K2.5, DeepSeek, Phi-4 Mini |
| Apache 2.0 | 8 | Yes | Yes (with notice) | Gemma 4, Qwen 3.5, Mistral |
| Gemma | 6 | Yes | Yes (<700M users) | Gemma 3 (old license) |
| Meta Custom | 5 | Limited | Yes (<700M users) | Llama 4, Llama 3 |
| Proprietary / N/A | 2 | No | API only | MiniMax M2.5 |
Score Examples
Gemma 4 26B-A4B (Score: 67) = capScore 35 + speed min(25, 48×0.25=12) + accessibility 10 (24 GB) + license 8 (Apache 2.0) + rounding = 65–67. Top-ranked because it combines Arena AI #6 quality with fast MoE inference on a 24 GB Mac.
Qwen 3.5 9B (Score: 66) = capScore 18 + speed min(25, 100×0.25=25) + accessibility 15 (8 GB) + license 8 (Apache 2.0) = 66. Ranks #2 because maximum speed + accessibility points compensate for lower raw capability.
Kimi K2.5 (Score: 60) = capScore 50 (highest!) + speed 0 (server only) + accessibility 0 (>128 GB) + license 10 (MIT) = 60. Despite being the most capable model, it scores lower because no Mac user can run it locally.
Limitations & Known Issues
- Arena ELO volatility: Ratings shift weekly as new votes come in. We snapshot monthly and note the capture date.
- Estimated scores: ~15% of capScores are extrapolated from model family performance when official benchmarks are pending. These are marked "est." in the table above.
- Speed variance: tok/s can vary 10–20% based on prompt content, context length, and background processes. Our measurements use standardized conditions but real-world speeds may differ.
- Quantization bias: All speeds use Q4_K_M. Some models perform better at Q5 or Q8 but we standardize for fair comparison.
- No multimodal scoring: The current score does not reward multimodal capabilities (image/audio input). Models like Gemma 4 E4B have multimodal features not captured in the 0–100 score.
We welcome corrections and updated benchmark data. If you have verified measurements that differ from ours, please submit them through the community benchmark process.
Frequently Asked Questions
How does LLMCheck calculate its scores?
The LLMCheck Score is a 0–100 composite metric: Capability (50 pts) sourced from Arena AI ELO ratings and MMLU/coding benchmarks, Speed (25 pts) based on tokens/sec on M5 Max, Accessibility (15 pts) inversely proportional to minimum RAM, and License Openness (10 pts) where MIT scores 10 and proprietary scores lower.
Where does LLMCheck get its capability scores?
Capability scores are derived from three public benchmark sources: Arena AI ELO ratings (human preference, weighted 40%), MMLU scores from official model cards (knowledge breadth, weighted 35%), and coding benchmarks like HumanEval and SWE-Bench (weighted 25%). All sources are linked per model.
How does LLMCheck measure tokens per second?
Speed benchmarks use Q4_K_M quantization, a standardized 256-token input prompt, and 512-token generation window. Results are averaged over 3 runs on a freshly booted system. Reference hardware is M5 Max with 128 GB Unified Memory.
Why does capability get 50% of the total score?
Capability receives the highest weight because the primary value of an LLM is output quality. A fast model with poor answers is less useful than a slower model with excellent reasoning. Speed and accessibility ensure Mac-runnable models score higher than server-only giants.
How often are scores updated?
Scores are updated within 48–72 hours of major model releases. The full leaderboard refreshes monthly with the latest Arena AI ELO ratings and community benchmark submissions.
Benchmark Sources
- Arena AI / LMSYS Chatbot Arena — Human preference ELO ratings
- HuggingFace Model Hub — Official model cards with MMLU and benchmark data
- LLM Stats — Aggregated benchmark scores (HumanEval, SWE-Bench)
- Onyx Open LLM Leaderboard — Tier rankings and benchmark aggregation
- Google DeepMind Gemma 4 Blog — Official Gemma 4 benchmarks
- GEO: Generative Engine Optimization (Princeton/Georgia Tech) — Research methodology reference
See the Full Leaderboard
46 models ranked by LLMCheck Score. Filter by your Mac's RAM, sort by speed or capability.
View Leaderboard →