How does LLMCheck measure tokens per second on Apple Silicon?

Speed benchmarks use Q4_K_M quantization, a standardized 256-token input prompt, and 512-token generation window. Results are averaged over 3 runs on a freshly booted system with no background applications. The reference hardware is M5 Max with 128 GB Unified Memory running the latest macOS. Community submissions are validated against known baselines before inclusion.

Why does LLMCheck weight capability at 50% of the total score?

Capability receives the highest weight because the primary value of an LLM is the quality of its outputs. A very fast model that gives poor answers is less useful than a slower model with excellent reasoning. However, speed (25%) and accessibility (15%) ensure that models which actually run well on consumer Macs score higher than server-only models with superior capability but no practical local use.

How often are LLMCheck scores updated?

Scores are updated within 48-72 hours of major model releases. The full leaderboard is refreshed monthly with the latest Arena AI ELO ratings and community benchmark submissions. Speed benchmarks are updated as new Apple Silicon hardware becomes available. All updates are timestamped in the leaderboard metadata.

LLMCheck Scoring Methodology

How we rank 50 local and frontier LLMs for Mac — fully transparent, independently sourced, and reproducible.

The LLMCheck Score is a 0–100 composite metric combining four dimensions: Capability (50 pts) sourced from Arena AI ELO ratings and MMLU/coding benchmarks, Speed (25 pts) from tokens/sec on M5 Max, Accessibility (15 pts) based on minimum RAM, and License Openness (10 pts). Every score is verifiable — sources are linked per model below.

The Formula

LLMCheck Score = Capability + Speed + Accessibility + License

Capability (0–50): Normalized from Arena AI ELO + MMLU + coding benchmarks
Speed (0–25): tok/s on M5 Max × 0.25, capped at 25
Accessibility (0–15): RAM tier: ≤8 GB = 15, ≤16 GB = 12, ≤24 GB = 10, ≤32 GB = 9, ≤64 GB = 5, ≤128 GB = 2, >128 GB = 0
License (0–10): MIT = 10, Apache 2.0 = 8, Gemma = 6, Meta Custom = 5, Proprietary = 2

Dimension 1: Capability (50 points)

CAPABILITY50 / 100 pts

Measures raw model intelligence — reasoning, knowledge, coding, instruction following. Sourced from three public benchmark systems.

The capability score (capScore) is the most heavily weighted dimension because the primary value of an LLM is output quality. According to LLMCheck's methodology, we derive capScore from three public sources:

Arena AI ELO Rating (40% weight) — Human preference ranking from lmarena.ai. Measures how often real users prefer a model's response over competitors in blind A/B tests. The gold standard for "does this model feel good to use."
MMLU Score (35% weight) — Massive Multitask Language Understanding. Tests knowledge across 57 academic subjects. Sourced from official model cards on HuggingFace and arXiv papers.
Coding Benchmarks (25% weight) — HumanEval (code generation) and SWE-Bench Verified (real bug fixing). Sourced from llm-stats.com and official model publications.

When official benchmarks are unavailable for a model (common for very new releases), we extrapolate from related models in the same family and mark the score as estimated. These estimates are replaced with verified scores within 2-4 weeks of release.

Why 50 points? A very fast model that gives poor answers is less useful than a slower model with excellent reasoning. Capability is weighted highest because it determines whether the model actually solves your problem. Speed and RAM determine whether you can run it — but there's no point running a model that can't help you.

Capability Score Table (All 61 Models)

Model	Params	capScore	Arena ELO	MMLU	Coding	Sources
Kimi K2.5	1T MoE	50	~1480	—	HumanEval 99%	Arena
DeepSeek V4 Pro	1.6T MoE	50	~1545	—	SWE-V 80.6%, GPQA 90.1%	HF
DeepSeek R2	671B MoE	50	~1530	—	AIME 91%, MATH 88%	HF
Kimi K2.6	1.05T MoE	48	~1525	—	Coding 78.6, Agentic 58.3	HF
Qwen 4 Preview 32B-A3B	32B MoE	42	~1490	88%	HumanEval 92%, SWE-V 76%	HF
Devstral Small 24B	24B	24	~1385	79%	Agentic coding specialist	HF
DeepSeek V3.2	685B MoE	42	~1445	94.2%	SWE 58.2%	HF
Gemma 4 31B	31B Dense	40	1452 (#3)	~88%	AIME 89.2%	Google
Qwen 3.5 (397B)	397B MoE	38	~1430	~90%	—	HF
GLM-5.1	744B MoE	48	~1460	—	SWE-Pro 58.4%	HF
Qwen3-235B-A22B	235B MoE	46	~1455	—	—	HF
Qwen 3.6-35B-A3B	35B MoE	38	—	82.6%	SWE 73.4%	HF
DeepSeek V3	685B MoE	37	~1410	88.5%	—	HF
DeepSeek R1	671B MoE	37	~1410	—	—	HF
Mistral Large 3	675B MoE	36	~1405	—	—	HF
Llama 4 Maverick	400B MoE	36	~1400	85.5%	—	HF
Gemma 4 26B-A4B	26B MoE	35	1441 (#6)	~85%	—	Google
Llama 3.1 405B	405B	35	~1395	—	—	HF
MiniMax M2.5	230B MoE	35	~1390	—	SWE 80.2%	Arena
GLM-4.7	355B	34	~1385	—	HumanEval 94.2%	HF
Step-3.5 Flash	196B MoE	33	~1380	—	—	est.
MiMo-V2-Flash	309B MoE	32	~1370	—	—	est.
GPT-oss 120B	117B	32	~1365	MMLU-Pro 90%	—	HF
DeepSeek R1 70B	70B	30	~1350	—	—	HF
Mistral Small 4	119B MoE	34	~1385	—	—	HF
Llama 4 Scout	109B MoE	30	~1345	—	—	HF
Llama 5 Scout 109B-A17B	109B MoE	30	~1410	82%	multimodal	HF
Hermes 4 70B	70B	28	~1400	—	exceptional instruction-following	HF
Mistral Voyage 24B	24B	25	~1390	80%	balanced	HF
Nemotron Cascade 2	30B	30	—	—	—	HF
Mixtral 8x22B	141B MoE	28	~1320	—	—	est.
Qwen 2.5 72B	72B	28	~1330	~86%	—	HF
Qwen3-Coder-Next	80B MoE	28	—	—	SWE 70.6%	HF
Qwen 3.5 35B	35B MoE	27	~1340	~83%	—	HF
Llama 3.3 70B	70B	27	~1325	86%	—	HF
QwQ 32B	32B	26	~1310	—	—	HF
Qwen 3 32B	32B	25	~1300	—	—	HF
DeepSeek R1 32B	32B	24	~1290	—	—	HF
Gemma 3 27B	27B	22	~1270	—	—	HF
Qwen 3 30B-A3B	30B MoE	22	~1280	—	—	HF
Qwen 3.5 27B	27B	21	~1275	—	—	HF
Qwen 3 14B	14B	20	~1250	—	—	HF
Llama 5 8B	8B	20	~1370	78%	HumanEval 80%	HF
Phi-4 14B	14B	19	~1240	84.8%	—	HF
Phi-5 Mini 4B	4B	18	~1310	82%	HumanEval 78%, AIME 61%	HF
Qwen 3.5 9B	9B	18	~1220	—	—	HF
Qwen 2.5 14B	14B	18	~1210	—	—	HF
Ministral 14B	14B	18	~1215	—	—	est.
Gemma 3 12B	12B	17	~1200	—	—	HF
DeepSeek R1 8B	8B	16	~1190	—	—	HF
Gemma 4 E4B	4B (PLE)	16	~1195	—	—	Google
Qwen 3 8B	8B	15	~1180	—	—	HF
Phi-4 Mini	3.8B	14	~1170	—	—	HF
Ministral 8B	8B	14	~1165	—	—	est.
Gemma 4 E2B	2.3B (PLE)	13	~1150	—	—	Google
Mistral 7B	7B	13	~1140	—	—	HF
Llama 3.1 8B	8B	12	~1130	—	—	HF
Qwen 3.5 4B	4B	12	~1125	—	—	HF
Qwen 3 4B	4B	11	~1110	—	—	HF
Gemma 3 4B	4B	10	~1090	—	—	HF
SmolLM3 3B	3B	10	~1180	—	edge-optimized	HF

Dimension 2: Speed on Apple Silicon (25 points)

SPEED25 / 100 pts

Tokens per second on M5 Max 128 GB at Q4_K_M quantization. Directly measures how fast the model generates text on Apple Silicon.

Speed is calculated as: tok/s × 0.25, capped at 25 points. This means any model generating 100+ tok/s on M5 Max receives full speed points. According to LLMCheck benchmarks, speed measurements follow this protocol:

Quantization: Q4_K_M (most common consumer quantization)
Input: Standardized 256-token prompt
Output: 512-token generation window
Hardware: M5 Max, 128 GB Unified Memory, latest macOS
Engine: Ollama (default) or MLX where noted
Runs: Average of 3 runs on freshly booted system
Verification: Community submissions validated against known baselines

Models that cannot run on any Mac (server-only, >128 GB RAM) receive 0 speed points. Full benchmark data is available for download at /data/.

Dimension 3: Accessibility (15 points)

ACCESSIBILITY15 / 100 pts

How many Mac users can actually run this model? Lower RAM requirements = higher accessibility score.

Accessibility is a step function based on minimum RAM required at INT4 quantization. According to LLMCheck's data, approximately 60% of Mac users have 16 GB or less, making accessibility crucial for real-world impact:

Min RAM (INT4)	Points	Example Models	Est. Mac Users
≤ 8 GB	15	Gemma 4 E4B, Phi-4 Mini, Qwen 3.5 9B	~100% of Apple Silicon
≤ 16 GB	12	Qwen 3 14B, Gemma 3 12B	~85%
≤ 24 GB	10	Gemma 4 26B-A4B, Gemma 4 31B, Qwen 3.5 35B	~40%
≤ 32 GB	9	QwQ 32B, Qwen 3 32B	~30%
≤ 64 GB	5	DeepSeek R1 70B, Llama 3.3 70B	~10%
≤ 128 GB	2	GPT-oss 120B, Mixtral 8x22B	~3%
> 128 GB	0	Kimi K2.5, DeepSeek V3, GLM-5.1	Server only

Dimension 4: License Openness (10 points)

LICENSE10 / 100 pts

How freely can you use, modify, and distribute the model? More open = higher score.

License	Points	Can Modify?	Commercial Use?	Models
MIT	10	Yes	Unrestricted	Kimi K2.5, DeepSeek, Phi-4 Mini
Apache 2.0	8	Yes	Yes (with notice)	Gemma 4, Qwen 3.5, Mistral
Gemma	6	Yes	Yes (<700M users)	Gemma 3 (old license)
Meta Custom	5	Limited	Yes (<700M users)	Llama 4, Llama 3
Proprietary / N/A	2	No	API only	MiniMax M2.5

Score Examples

Gemma 4 26B-A4B (Score: 67) = capScore 35 + speed min(25, 48×0.25=12) + accessibility 10 (24 GB) + license 8 (Apache 2.0) + rounding = 65–67. Top-ranked because it combines Arena AI #6 quality with fast MoE inference on a 24 GB Mac.

Qwen 3.5 9B (Score: 66) = capScore 18 + speed min(25, 100×0.25=25) + accessibility 15 (8 GB) + license 8 (Apache 2.0) = 66. Ranks #2 because maximum speed + accessibility points compensate for lower raw capability.

Kimi K2.5 (Score: 60) = capScore 50 (highest!) + speed 0 (server only) + accessibility 0 (>128 GB) + license 10 (MIT) = 60. Despite being the most capable model, it scores lower because no Mac user can run it locally.

Limitations & Known Issues

Arena ELO volatility: Ratings shift weekly as new votes come in. We snapshot monthly and note the capture date.
Estimated scores: ~15% of capScores are extrapolated from model family performance when official benchmarks are pending. These are marked "est." in the table above.
Speed variance: tok/s can vary 10–20% based on prompt content, context length, and background processes. Our measurements use standardized conditions but real-world speeds may differ.
Quantization bias: All speeds use Q4_K_M. Some models perform better at Q5 or Q8 but we standardize for fair comparison.
No multimodal scoring: The current score does not reward multimodal capabilities (image/audio input). Models like Gemma 4 E4B have multimodal features not captured in the 0–100 score.

We welcome corrections and updated benchmark data. If you have verified measurements that differ from ours, please submit them through the community benchmark process.

Frequently Asked Questions

How does LLMCheck calculate its scores?

The LLMCheck Score is a 0–100 composite metric: Capability (50 pts) sourced from Arena AI ELO ratings and MMLU/coding benchmarks, Speed (25 pts) based on tokens/sec on M5 Max, Accessibility (15 pts) inversely proportional to minimum RAM, and License Openness (10 pts) where MIT scores 10 and proprietary scores lower.

Where does LLMCheck get its capability scores?

Capability scores are derived from three public benchmark sources: Arena AI ELO ratings (human preference, weighted 40%), MMLU scores from official model cards (knowledge breadth, weighted 35%), and coding benchmarks like HumanEval and SWE-Bench (weighted 25%). All sources are linked per model.

How does LLMCheck measure tokens per second?

Speed benchmarks use Q4_K_M quantization, a standardized 256-token input prompt, and 512-token generation window. Results are averaged over 3 runs on a freshly booted system. Reference hardware is M5 Max with 128 GB Unified Memory.

Why does capability get 50% of the total score?

Capability receives the highest weight because the primary value of an LLM is output quality. A fast model with poor answers is less useful than a slower model with excellent reasoning. Speed and accessibility ensure Mac-runnable models score higher than server-only giants.

How often are scores updated?

Scores are updated within 48–72 hours of major model releases. The full leaderboard refreshes monthly with the latest Arena AI ELO ratings and community benchmark submissions.

Benchmark Sources

Arena AI / LMSYS Chatbot Arena — Human preference ELO ratings
HuggingFace Model Hub — Official model cards with MMLU and benchmark data
LLM Stats — Aggregated benchmark scores (HumanEval, SWE-Bench)
Onyx Open LLM Leaderboard — Tier rankings and benchmark aggregation
Google DeepMind Gemma 4 Blog — Official Gemma 4 benchmarks
GEO: Generative Engine Optimization (Princeton/Georgia Tech) — Research methodology reference

See the Full Leaderboard

61 models ranked by LLMCheck Score. Filter by your Mac's RAM, sort by speed or capability.

View Leaderboard →