LLMCheck Scoring Methodology

How we rank 46 local and frontier LLMs for Mac — fully transparent, independently sourced, and reproducible.

The LLMCheck Score is a 0–100 composite metric combining four dimensions: Capability (50 pts) sourced from Arena AI ELO ratings and MMLU/coding benchmarks, Speed (25 pts) from tokens/sec on M5 Max, Accessibility (15 pts) based on minimum RAM, and License Openness (10 pts). Every score is verifiable — sources are linked per model below.

The Formula

LLMCheck Score = Capability + Speed + Accessibility + License
Capability (0–50): Normalized from Arena AI ELO + MMLU + coding benchmarks
Speed (0–25): tok/s on M5 Max × 0.25, capped at 25
Accessibility (0–15): RAM tier: ≤8 GB = 15, ≤16 GB = 12, ≤24 GB = 10, ≤32 GB = 9, ≤64 GB = 5, ≤128 GB = 2, >128 GB = 0
License (0–10): MIT = 10, Apache 2.0 = 8, Gemma = 6, Meta Custom = 5, Proprietary = 2

Dimension 1: Capability (50 points)

CAPABILITY50 / 100 pts

Measures raw model intelligence — reasoning, knowledge, coding, instruction following. Sourced from three public benchmark systems.

The capability score (capScore) is the most heavily weighted dimension because the primary value of an LLM is output quality. According to LLMCheck's methodology, we derive capScore from three public sources:

When official benchmarks are unavailable for a model (common for very new releases), we extrapolate from related models in the same family and mark the score as estimated. These estimates are replaced with verified scores within 2-4 weeks of release.

Why 50 points? A very fast model that gives poor answers is less useful than a slower model with excellent reasoning. Capability is weighted highest because it determines whether the model actually solves your problem. Speed and RAM determine whether you can run it — but there's no point running a model that can't help you.

Capability Score Table (All 46 Models)

ModelParamscapScoreArena ELOMMLUCodingSources
Kimi K2.51T MoE50~1480HumanEval 99%Arena
DeepSeek V3.2685B MoE42~144594.2%SWE 58.2%HF
Gemma 4 31B31B Dense401452 (#3)~88%AIME 89.2%Google
Qwen 3.5 (397B)397B MoE38~1430~90%HF
GLM-5744B38~1420SWE 77.8%HF
DeepSeek V3685B MoE37~141088.5%HF
DeepSeek R1671B MoE37~1410HF
Mistral Large 3675B MoE36~1405HF
Llama 4 Maverick400B MoE36~140085.5%HF
Gemma 4 26B-A4B26B MoE351441 (#6)~85%Google
Llama 3.1 405B405B35~1395HF
MiniMax M2.5230B MoE35~1390SWE 80.2%Arena
GLM-4.7355B34~1385HumanEval 94.2%HF
Step-3.5 Flash196B MoE33~1380est.
MiMo-V2-Flash309B MoE32~1370est.
GPT-oss 120B117B32~1365MMLU-Pro 90%HF
DeepSeek R1 70B70B30~1350HF
Llama 4 Scout109B MoE30~1345HF
Mixtral 8x22B141B MoE28~1320est.
Qwen 2.5 72B72B28~1330~86%HF
Qwen3-Coder-Next80B MoE28SWE 70.6%HF
Qwen 3.5 35B35B MoE27~1340~83%HF
Llama 3.3 70B70B27~132586%HF
QwQ 32B32B26~1310HF
Qwen 3 32B32B25~1300HF
DeepSeek R1 32B32B24~1290HF
Gemma 3 27B27B22~1270HF
Qwen 3 30B-A3B30B MoE22~1280HF
Qwen 3.5 27B27B21~1275HF
Qwen 3 14B14B20~1250HF
Phi-4 14B14B19~124084.8%HF
Qwen 3.5 9B9B18~1220HF
Qwen 2.5 14B14B18~1210HF
Ministral 14B14B18~1215est.
Gemma 3 12B12B17~1200HF
DeepSeek R1 8B8B16~1190HF
Gemma 4 E4B4B (PLE)16~1195Google
Qwen 3 8B8B15~1180HF
Phi-4 Mini3.8B14~1170HF
Ministral 8B8B14~1165est.
Gemma 4 E2B2.3B (PLE)13~1150Google
Mistral 7B7B13~1140HF
Llama 3.1 8B8B12~1130HF
Qwen 3.5 4B4B12~1125HF
Qwen 3 4B4B11~1110HF
Gemma 3 4B4B10~1090HF

Dimension 2: Speed on Apple Silicon (25 points)

SPEED25 / 100 pts

Tokens per second on M5 Max 128 GB at Q4_K_M quantization. Directly measures how fast the model generates text on Apple Silicon.

Speed is calculated as: tok/s × 0.25, capped at 25 points. This means any model generating 100+ tok/s on M5 Max receives full speed points. According to LLMCheck benchmarks, speed measurements follow this protocol:

Models that cannot run on any Mac (server-only, >128 GB RAM) receive 0 speed points. Full benchmark data is available for download at /data/.

Dimension 3: Accessibility (15 points)

ACCESSIBILITY15 / 100 pts

How many Mac users can actually run this model? Lower RAM requirements = higher accessibility score.

Accessibility is a step function based on minimum RAM required at INT4 quantization. According to LLMCheck's data, approximately 60% of Mac users have 16 GB or less, making accessibility crucial for real-world impact:

Min RAM (INT4)PointsExample ModelsEst. Mac Users
≤ 8 GB15Gemma 4 E4B, Phi-4 Mini, Qwen 3.5 9B~100% of Apple Silicon
≤ 16 GB12Qwen 3 14B, Gemma 3 12B~85%
≤ 24 GB10Gemma 4 26B-A4B, Gemma 4 31B, Qwen 3.5 35B~40%
≤ 32 GB9QwQ 32B, Qwen 3 32B~30%
≤ 64 GB5DeepSeek R1 70B, Llama 3.3 70B~10%
≤ 128 GB2GPT-oss 120B, Mixtral 8x22B~3%
> 128 GB0Kimi K2.5, DeepSeek V3, GLM-5Server only

Dimension 4: License Openness (10 points)

LICENSE10 / 100 pts

How freely can you use, modify, and distribute the model? More open = higher score.

LicensePointsCan Modify?Commercial Use?Models
MIT10YesUnrestrictedKimi K2.5, DeepSeek, Phi-4 Mini
Apache 2.08YesYes (with notice)Gemma 4, Qwen 3.5, Mistral
Gemma6YesYes (<700M users)Gemma 3 (old license)
Meta Custom5LimitedYes (<700M users)Llama 4, Llama 3
Proprietary / N/A2NoAPI onlyMiniMax M2.5

Score Examples

Gemma 4 26B-A4B (Score: 67) = capScore 35 + speed min(25, 48×0.25=12) + accessibility 10 (24 GB) + license 8 (Apache 2.0) + rounding = 65–67. Top-ranked because it combines Arena AI #6 quality with fast MoE inference on a 24 GB Mac.

Qwen 3.5 9B (Score: 66) = capScore 18 + speed min(25, 100×0.25=25) + accessibility 15 (8 GB) + license 8 (Apache 2.0) = 66. Ranks #2 because maximum speed + accessibility points compensate for lower raw capability.

Kimi K2.5 (Score: 60) = capScore 50 (highest!) + speed 0 (server only) + accessibility 0 (>128 GB) + license 10 (MIT) = 60. Despite being the most capable model, it scores lower because no Mac user can run it locally.

Limitations & Known Issues

We welcome corrections and updated benchmark data. If you have verified measurements that differ from ours, please submit them through the community benchmark process.

Frequently Asked Questions

How does LLMCheck calculate its scores?

The LLMCheck Score is a 0–100 composite metric: Capability (50 pts) sourced from Arena AI ELO ratings and MMLU/coding benchmarks, Speed (25 pts) based on tokens/sec on M5 Max, Accessibility (15 pts) inversely proportional to minimum RAM, and License Openness (10 pts) where MIT scores 10 and proprietary scores lower.

Where does LLMCheck get its capability scores?

Capability scores are derived from three public benchmark sources: Arena AI ELO ratings (human preference, weighted 40%), MMLU scores from official model cards (knowledge breadth, weighted 35%), and coding benchmarks like HumanEval and SWE-Bench (weighted 25%). All sources are linked per model.

How does LLMCheck measure tokens per second?

Speed benchmarks use Q4_K_M quantization, a standardized 256-token input prompt, and 512-token generation window. Results are averaged over 3 runs on a freshly booted system. Reference hardware is M5 Max with 128 GB Unified Memory.

Why does capability get 50% of the total score?

Capability receives the highest weight because the primary value of an LLM is output quality. A fast model with poor answers is less useful than a slower model with excellent reasoning. Speed and accessibility ensure Mac-runnable models score higher than server-only giants.

How often are scores updated?

Scores are updated within 48–72 hours of major model releases. The full leaderboard refreshes monthly with the latest Arena AI ELO ratings and community benchmark submissions.

Benchmark Sources

See the Full Leaderboard

46 models ranked by LLMCheck Score. Filter by your Mac's RAM, sort by speed or capability.

View Leaderboard →