LLMCheck Index Methodology

How the LLMCheck index ranks 79 local and frontier LLMs for Mac — a published estimation model, linked third-party benchmarks, and community submissions. Fully transparent, fully reproducible.

LLMCheck is an independent index of local-LLM performance on Apple Silicon — not a benchmark lab. Every figure is either a transparent estimate from the published model below, a sourced third-party benchmark (linked per row), or a community submission. Each data point is labeled with its provenance, and no figure is claimed as a first-party lab measurement.

Where Every Number Comes From

Every row in the leaderboard, the benchmarks table, and the open dataset carries one of three provenance labels. This mirrors the provenance_policy field published in benchmarks.json:

Estimated

Derived from the LLMCheck estimation model — memory-bandwidth scaling plus quantization arithmetic, documented in full below. Estimates are useful for planning ("will this model fit and feel usable on my Mac?") but are not measurements. Every estimated speed figure on the site links back to this page.

Sourced

A published third-party benchmark, with the source URL attached to the row — Arena AI ELO ratings, MMLU/HumanEval/SWE-Bench results from official model cards and papers, or independent hardware test data. We link, we don't re-host or re-run.

Community

A user-submitted run from a real Mac, with the submission URL as its source. Community numbers are sanity-checked against the estimation model and known baselines before inclusion. Own a Mac? Submit a benchmark — real runs beat estimates every time.

Why an index and not a lab? A single lab machine can only test one chip, one runtime, one macOS version. An index that publishes its estimation math, links its sources, and accepts community runs covers the entire Apple Silicon range — and you can audit every step. When we're estimating, we say so.

How We Estimate Performance

Local LLM performance on Apple Silicon is unusually predictable, because two hardware numbers dominate everything: unified memory capacity (can the model fit?) and memory bandwidth (how fast can weights stream through the GPU?). The LLMCheck estimation model is built on those two numbers.

Step 1 — Will it fit? (RAM model)

model_size_GB (Q4_K_M) ≈ params_B × 0.57

≈ 4.5 bits/param effective — Q4_K_M is nominally 4-bit but carries scale factors, higher-precision layers, and embedding tables
+ KV cache: grows with context length — ~1–4 GB typical at 4k–32k context
+ macOS overhead: ~2–3 GB for the OS and the inference runtime
Fit rule: total must stay within ~75% of unified memory — macOS lets the GPU address roughly three-quarters of RAM by default

That 75% rule is where the minimum-RAM guidance across the index comes from: a 16 GB Mac has a ~12 GB working budget, a 24 GB Mac ~18 GB, a 64 GB Mac ~48 GB, and so on.

Step 2 — How fast? (Speed model)

Token generation is memory-bandwidth-bound: for every token, the runtime must stream essentially all active model weights from RAM through the GPU. Compute is rarely the bottleneck on Apple Silicon. That gives a simple ceiling:

est. tok/s ≈ (bandwidth_GB/s ÷ model_bytes_GB) × efficiency

bandwidth_GB/s: published memory bandwidth of the chip (e.g. M4 Pro 273 GB/s, M5 Max 600 GB/s)
model_bytes_GB: weight bytes read per token — total size for dense models, active-parameter bytes for MoE
efficiency: ~0.6–0.8 empirically for llama.cpp / MLX-class runtimes on dense models; lower for MoE (routing and expert-loading overhead)
Cross-chip scaling: tok/s scales ≈ with the bandwidth ratio — M5 Max (600) vs M4 Pro (273) ≈ 2.2×

This is also why Mixture-of-Experts models dominate the speed rankings: a 32B MoE with 3B active parameters reads ~1.7 GB per token instead of ~18 GB — a 10× smaller memory bill per token, while still storing the full 32B of knowledge in RAM.

Step 3 — A worked example

▶ Llama 3.3 70B (dense) on an M5 Max, Q4_K_M

Weights: 70 × 0.57 ≈ 40 GB on disk and in RAM.
Total RAM needed: 40 + ~2 (KV cache) + ~3 (macOS) ≈ 45 GB.
Fit check: a 64 GB Mac budgets 64 × 0.75 = 48 GB — fits. A 48 GB Mac budgets 36 GB — the 40 GB of weights alone don't fit. So the index lists min RAM = 64 GB.
Speed ceiling: 600 GB/s ÷ 40 GB = 15 tok/s theoretical maximum.
Apply efficiency: 15 × ~0.7 ≈ 10–11 tok/s — the estimated figure the index shows.

Contrast with an MoE model: Qwen 4.1 32B-A3B activates ~3B parameters per token, so it reads only 3 × 0.57 ≈ 1.7 GB per token. On an M4 Pro (273 GB/s) the ceiling is ~160 tok/s; realized MoE efficiency of ~0.4 (routing overhead, expert scatter across memory) lands the estimate near ~62 tok/s — on a mid-range chip. The full 32B of weights still need ~18 GB of RAM to be resident, which is why it's a 24 GB-Mac model despite generating like a 3B.

What the estimates can't capture

Honest limits of the model: thermals (a fanless MacBook Air throttles on long generations; a Mac Studio doesn't), context length (speed degrades as the KV cache grows — long chats get slower), quantization variants (Q5, Q8, and MLX 4-bit all shift size and speed), and runtime differences (Ollama, LM Studio, llama.cpp, and MLX can differ 10–20% on the same hardware). Real numbers will vary from the estimates — that's exactly why every estimated row is labeled, and why community submissions are invited to replace estimates with real runs.

The LLMCheck Score Formula

The estimation model above feeds one of four components in the composite 0–100 LLMCheck Score used to rank the catalog:

LLMCheck Score = Capability + Speed + Accessibility + License

Capability (0–50): normalized from Arena AI ELO + MMLU + coding benchmarks — sourced from published third-party evals
Speed (0–25): est. tok/s on M5 Max × 0.25, capped at 25 — estimated via the model above
Accessibility (0–15): RAM tier: ≤8 GB = 15, ≤16 GB = 12, ≤24 GB = 10, ≤32 GB = 9, ≤64 GB = 5, ≤128 GB = 2, >128 GB = 0
License (0–10): MIT = 10, Apache 2.0 = 8, Gemma = 6, Meta Custom = 5, xAI Custom = 4, CC-BY-NC / N/A = 2

Dimension 1: Capability (50 points) — sourced

CAPABILITY50 / 100 pts

Raw model intelligence — reasoning, knowledge, coding, instruction following. Sourced from published third-party benchmark systems, linked per model.

The capability score (capScore) is the most heavily weighted dimension because the primary value of an LLM is output quality. capScore is sourced, not estimated: the index normalizes it from three public benchmark systems:

Arena AI ELO Rating (40% weight) — Human preference ranking from lmarena.ai. Measures how often real users prefer a model's response over competitors in blind A/B tests. The gold standard for "does this model feel good to use."
MMLU Score (35% weight) — Massive Multitask Language Understanding. Tests knowledge across 57 academic subjects. Sourced from official model cards on HuggingFace and arXiv papers.
Coding Benchmarks (25% weight) — HumanEval (code generation) and SWE-Bench Verified (real bug fixing). Sourced from llm-stats.com and official model publications.

When official benchmarks are unavailable for a model (common for very new releases), the index extrapolates from related models in the same family and marks the score as estimated — same provenance rules as everything else. These estimates are replaced with sourced scores within 2–4 weeks of release.

Why 50 points? A very fast model that gives poor answers is less useful than a slower model with excellent reasoning. Capability is weighted highest because it determines whether the model actually solves your problem. Speed and RAM determine whether you can run it — but there's no point running a model that can't help you.

Capability Score Table (All 79 Models)

Model	Params	capScore	Arena ELO	MMLU	Coding	Sources
Kimi K2.5	1T MoE	50	~1480	—	HumanEval 99%	Arena
DeepSeek V4 Pro	1.6T MoE	50	~1545	—	SWE-V 80.6%, GPQA 90.1%	HF
DeepSeek R2	671B MoE	50	~1530	—	AIME 91%, MATH 88%	HF
GLM 5.2	744B MoE	50	~1565	92%	SWE-Pro 68.5%, beats GPT-5 & Claude	HF
DeepSeek R3	685B MoE	50	~1558	—	AIME 95%, MATH 92%	HF
Kimi K3	1T MoE	49	~1545	—	Agentic coding leader	HF
Kimi K2.6	1.05T MoE	48	~1525	—	Coding 78.6, Agentic 58.3	HF
Qwen 4.1 32B-A3B	32B MoE	46	~1510	90%	SWE-V 80%, HumanEval 95%	HF
Qwen 4	32B MoE	45	~1505	89%	SWE-V 78%, HumanEval 94%	HF
Llama 5 405B	405B	44	~1500	91%	HumanEval 90%, dense frontier	HF
Qwen 4 Coder 32B-A3B	32B MoE	44	~1495	87%	SWE-V 82%, HumanEval 96%	HF
Qwen 4 Preview 32B-A3B	32B MoE	42	~1490	88%	HumanEval 92%, SWE-V 76%	HF
Devstral Small 24B	24B	24	~1385	79%	Agentic coding specialist	HF
DeepSeek V3.2	685B MoE	42	~1445	94.2%	SWE 58.2%	HF
GLM 5.2 Air	106B MoE	40	~1470	88%	SWE-Pro 58%, Mac-runnable	HF
Gemma 4 31B	31B Dense	40	1452 (#3)	~88%	AIME 89.2%	Google
Qwen 3.5 (397B)	397B MoE	38	~1430	~90%	—	HF
Llama 5 70B	70B	38	~1450	88%	HumanEval 86%, dense	HF
Phi-5 Large 28B	28B	36	~1455	88%	AIME 80%, MIT	HF
Mistral Voyage Pro 70B	70B	36	~1440	85%	SWE-V 68%, agentic	HF
GLM-5.1	744B MoE	48	~1460	—	SWE-Pro 58.4%	HF
Qwen3-235B-A22B	235B MoE	46	~1455	—	—	HF
Qwen 3.6-35B-A3B	35B MoE	38	—	82.6%	SWE 73.4%	HF
DeepSeek V3	685B MoE	37	~1410	88.5%	—	HF
DeepSeek R1	671B MoE	37	~1410	—	—	HF
Mistral Large 3	675B MoE	36	~1405	—	—	HF
Llama 4 Maverick	400B MoE	36	~1400	85.5%	—	HF
Gemma 4 26B-A4B	26B MoE	35	1441 (#6)	~85%	—	Google
Llama 3.1 405B	405B	35	~1395	—	—	HF
MiniMax M2.5	230B MoE	35	~1390	—	SWE 80.2%	Arena
Mistral Medium 4	41B MoE	34	~1445	84%	SWE-V 70%, agentic	HF
GLM-4.7	355B	34	~1385	—	HumanEval 94.2%	HF
Step-3.5 Flash	196B MoE	33	~1380	—	—	est.
Gemma 4.5 27B	27B	32	~1435	86%	1M context, multimodal	HF
MiMo-V2-Flash	309B MoE	32	~1370	—	—	est.
GPT-oss 120B	117B	32	~1365	MMLU-Pro 90%	—	HF
Command R+ 2	104B	30	~1415	82%	Enterprise RAG, CC-BY-NC	HF
Grok 4 Open 100B-A20B	100B MoE	30	~1410	82%	xAI's first open weights	HF
DeepSeek R1 70B	70B	30	~1350	—	—	HF
Mistral Small 4	119B MoE	34	~1385	—	—	HF
Llama 4 Scout	109B MoE	30	~1345	—	—	HF
Llama 5 Scout 109B-A17B	109B MoE	30	~1410	82%	multimodal	HF
Hermes 4 70B	70B	28	~1400	—	exceptional instruction-following	HF
Phi-5 Medium 14B	14B	28	~1400	86%	AIME 75%, MIT	HF
Gemma 4.5 12B	12B	28	~1395	84%	1M context	HF
Mistral Voyage 24B	24B	25	~1390	80%	balanced	HF
Nemotron Cascade 2	30B	30	—	—	—	HF
Mixtral 8x22B	141B MoE	28	~1320	—	—	est.
Qwen 2.5 72B	72B	28	~1330	~86%	—	HF
Qwen3-Coder-Next	80B MoE	28	—	—	SWE 70.6%	HF
Qwen 3.5 35B	35B MoE	27	~1340	~83%	—	HF
Llama 3.3 70B	70B	27	~1325	86%	—	HF
QwQ 32B	32B	26	~1310	—	—	HF
Qwen 3 32B	32B	25	~1300	—	—	HF
DeepSeek R1 32B	32B	24	~1290	—	—	HF
Gemma 3 27B	27B	22	~1270	—	—	HF
Qwen 3 30B-A3B	30B MoE	22	~1280	—	—	HF
Qwen 4 4B	4B	22	~1340	80%	beats Phi-5 Mini	HF
Qwen 3.5 27B	27B	21	~1275	—	—	HF
Qwen 3 14B	14B	20	~1250	—	—	HF
Llama 5 8B	8B	20	~1370	78%	HumanEval 80%	HF
Phi-4 14B	14B	19	~1240	84.8%	—	HF
Phi-5 Mini 4B	4B	18	~1310	82%	HumanEval 78%, AIME 61%	HF
Qwen 3.5 9B	9B	18	~1220	—	—	HF
Qwen 2.5 14B	14B	18	~1210	—	—	HF
Ministral 14B	14B	18	~1215	—	—	est.
Gemma 3 12B	12B	17	~1200	—	—	HF
DeepSeek R1 8B	8B	16	~1190	—	—	HF
Gemma 4 E4B	4B (PLE)	16	~1195	—	—	Google
Qwen 3 8B	8B	15	~1180	—	—	HF
Phi-4 Mini	3.8B	14	~1170	—	—	HF
Ministral 8B	8B	14	~1165	—	—	est.
Gemma 4 E2B	2.3B (PLE)	13	~1150	—	—	Google
Mistral 7B	7B	13	~1140	—	—	HF
Llama 3.1 8B	8B	12	~1130	—	—	HF
Qwen 3.5 4B	4B	12	~1125	—	—	HF
Qwen 3 4B	4B	11	~1110	—	—	HF
Gemma 3 4B	4B	10	~1090	—	—	HF
SmolLM3 3B	3B	10	~1180	—	edge-optimized	HF

Dimension 2: Speed on Apple Silicon (25 points) — estimated

SPEED25 / 100 pts

Estimated tokens per second on the M5 Max 128 GB reference configuration at Q4_K_M — computed with the estimation model above, not measured in a lab.

Speed points are calculated as: est. tok/s × 0.25, capped at 25 points. Any model estimated at 100+ tok/s on the reference configuration receives full speed points. The estimates assume these reference conditions:

Quantization: Q4_K_M (most common consumer quantization)
Reference hardware: M5 Max, 128 GB Unified Memory (600 GB/s bandwidth)
Runtime class: llama.cpp / MLX-class engines (Ollama, LM Studio) with dense efficiency ~0.6–0.8
Context: short-context generation — long contexts reduce speed as the KV cache grows
Provenance: rows are upgraded from estimated to sourced or community as linked third-party tests and user submissions come in

Models that cannot run on any Mac (server-only, >128 GB RAM) receive 0 speed points. The full dataset — with a provenance label on every row — is available for download at /data/.

Dimension 3: Accessibility (15 points)

ACCESSIBILITY15 / 100 pts

How many Mac users can actually run this model? Lower RAM requirements = higher accessibility score.

Accessibility is a step function based on minimum RAM required at 4-bit quantization, derived from the RAM model above (weights + KV cache + macOS overhead, within the ~75% unified-memory budget). Since a large share of Mac users have 16 GB or less, accessibility is crucial for real-world impact:

Min RAM (4-bit)	Points	Example Models	Est. Mac Users
≤ 8 GB	15	Gemma 4 E4B, Phi-4 Mini, Qwen 3.5 9B	~100% of Apple Silicon
≤ 16 GB	12	Qwen 3 14B, Gemma 3 12B	~85%
≤ 24 GB	10	Gemma 4 26B-A4B, Gemma 4 31B, Qwen 3.5 35B	~40%
≤ 32 GB	9	QwQ 32B, Qwen 3 32B	~30%
≤ 64 GB	5	DeepSeek R1 70B, Llama 3.3 70B	~10%
≤ 128 GB	2	GPT-oss 120B, Mixtral 8x22B	~3%
> 128 GB	0	Kimi K2.5, DeepSeek V3, GLM-5.1	Server only

Dimension 4: License Openness (10 points)

LICENSE10 / 100 pts

How freely can you use, modify, and distribute the model? More open = higher score.

License	Points	Can Modify?	Commercial Use?	Models
MIT	10	Yes	Unrestricted	Kimi K2.5, DeepSeek, Phi-4 Mini
Apache 2.0	8	Yes	Yes (with notice)	Gemma 4, Qwen 3.5, Mistral
Gemma	6	Yes	Yes (<700M users)	Gemma 3 (old license)
Meta Custom	5	Limited	Yes (<700M users)	Llama 5, Llama 4
xAI Custom	4	Limited	Yes (restrictions)	Grok 4 Open
CC-BY-NC	2	Yes	Non-commercial only	Command R+ 2
Proprietary / N/A	2	No	API only	MiniMax M2.5

Note: CC-BY-NC (non-commercial) scores 2 — usable for research but not commercial deployment. Meta / xAI community licenses score 5 / 4, reflecting modification rights with commercial caps.

Score Examples

Gemma 4 26B-A4B (Score: 67) = capScore 35 + speed min(25, est. 48 tok/s × 0.25 = 12) + accessibility 10 (24 GB) + license 8 (Apache 2.0) + rounding = 65–67. Top-ranked because it combines Arena AI #6 quality with fast MoE inference on a 24 GB Mac.

Qwen 3.5 9B (Score: 66) = capScore 18 + speed min(25, est. 100 tok/s × 0.25 = 25) + accessibility 15 (8 GB) + license 8 (Apache 2.0) = 66. Ranks high because maximum speed + accessibility points compensate for lower raw capability.

Kimi K2.5 (Score: 60) = capScore 50 (highest!) + speed 0 (server only) + accessibility 0 (>128 GB) + license 10 (MIT) = 60. Despite being the most capable model, it scores lower because no Mac user can run it locally.

Limitations & Known Issues

Arena ELO volatility: Ratings shift weekly as new votes come in. The index snapshots monthly and notes the capture date.
Estimated capScores: ~15% of capScores are extrapolated from model-family performance when official benchmarks are pending. These are marked "est." in the table above.
Estimates are not measurements: speed figures come from the bandwidth model, which cannot capture thermals, long-context slowdown, or runtime-specific optimizations. Expect real-world variance of 10–20% either way — sometimes more on fanless machines.
Quantization baseline: All estimates assume Q4_K_M. Some models are commonly run at Q5, Q8, or MLX 4-bit, which shift both size and speed; the index standardizes on one baseline for fair comparison.
No multimodal scoring: The current score does not reward multimodal capabilities (image/audio input). Models like Gemma 4 E4B have multimodal features not captured in the 0–100 score.

Corrections and real-world benchmark data are always welcome. If you have runs that differ from the index's estimates, submit them through the community benchmark process — verified community rows replace estimates.

Frequently Asked Questions

Does LLMCheck run its own benchmarks?

No. LLMCheck is an independent index, not a benchmark lab. Every figure on the site is one of three things: an estimate from the published LLMCheck estimation model (memory-bandwidth math and quantization arithmetic, fully documented on this page), a sourced number from a linked third-party benchmark, or a community-submitted run. Each row is labeled with its provenance, and no figure is claimed as a first-party lab measurement.

How does LLMCheck calculate its scores?

The LLMCheck Score is a 0–100 composite metric: Capability (50 pts) sourced from published third-party evaluations such as Arena AI ELO ratings and MMLU/coding benchmarks, Speed (25 pts) from estimated tokens/sec on the M5 Max reference configuration, Accessibility (15 pts) inversely proportional to minimum RAM, and License Openness (10 pts) where MIT scores 10 and restrictive licenses score lower. The formula is fully transparent and reproducible.

Where does LLMCheck get its capability scores?

Capability scores are derived from three public benchmark sources: Arena AI ELO ratings (human preference, weighted 40%), MMLU scores from official model cards (knowledge breadth, weighted 35%), and coding benchmarks like HumanEval and SWE-Bench (weighted 25%). All sources are linked per model. When official benchmarks are unavailable, the index extrapolates from related models in the same family and marks the score as 'estimated'.

How does LLMCheck estimate tokens per second on Apple Silicon?

Token generation on Apple Silicon is memory-bandwidth-bound, so estimated tok/s ≈ (memory bandwidth in GB/s ÷ model size in GB at Q4_K_M) × an efficiency factor of roughly 0.6–0.8 observed for llama.cpp/MLX-class runtimes. Mixture-of-Experts models use active-parameter bytes instead of total size, which is why they are much faster. Scaling across chips follows the bandwidth ratio. The full model is published at llmcheck.net/methodology#estimation, and every estimated figure is labeled as such.

Why does LLMCheck weight capability at 50% of the total score?

Capability receives the highest weight because the primary value of an LLM is the quality of its outputs. A very fast model that gives poor answers is less useful than a slower model with excellent reasoning. However, speed (25%) and accessibility (15%) ensure that models which actually run well on consumer Macs score higher than server-only models with superior capability but no practical local use.

How often are LLMCheck scores updated?

Scores are updated within 48–72 hours of major model releases. The full leaderboard is refreshed monthly with the latest Arena AI ELO ratings and community benchmark submissions. Speed estimates are recomputed as new Apple Silicon hardware specifications become available. All updates are timestamped in the open dataset at llmcheck.net/data/.

Benchmark Sources

Arena AI / LMSYS Chatbot Arena — Human preference ELO ratings
HuggingFace Model Hub — Official model cards with MMLU and benchmark data
LLM Stats — Aggregated benchmark scores (HumanEval, SWE-Bench)
Onyx Open LLM Leaderboard — Tier rankings and benchmark aggregation
Google DeepMind Gemma 4 Blog — Official Gemma 4 benchmarks
GEO: Generative Engine Optimization (Princeton/Georgia Tech) — Research methodology reference

See the Full Leaderboard

79 models ranked by LLMCheck Score. Filter by your Mac's RAM, sort by speed or capability — every figure labeled with its provenance.

View Leaderboard →