LLMCheck Index Methodology

How the LLMCheck index ranks 79 local and frontier LLMs for Mac — a published estimation model, linked third-party benchmarks, and community submissions. Fully transparent, fully reproducible.

LLMCheck is an independent index of local-LLM performance on Apple Silicon — not a benchmark lab. Every figure is either a transparent estimate from the published model below, a sourced third-party benchmark (linked per row), or a community submission. Each data point is labeled with its provenance, and no figure is claimed as a first-party lab measurement.

Where Every Number Comes From

Every row in the leaderboard, the benchmarks table, and the open dataset carries one of three provenance labels. This mirrors the provenance_policy field published in benchmarks.json:

Estimated

Derived from the LLMCheck estimation model — memory-bandwidth scaling plus quantization arithmetic, documented in full below. Estimates are useful for planning ("will this model fit and feel usable on my Mac?") but are not measurements. Every estimated speed figure on the site links back to this page.

Sourced

A published third-party benchmark, with the source URL attached to the row — Arena AI ELO ratings, MMLU/HumanEval/SWE-Bench results from official model cards and papers, or independent hardware test data. We link, we don't re-host or re-run.

Community

A user-submitted run from a real Mac, with the submission URL as its source. Community numbers are sanity-checked against the estimation model and known baselines before inclusion. Own a Mac? Submit a benchmark — real runs beat estimates every time.

Why an index and not a lab? A single lab machine can only test one chip, one runtime, one macOS version. An index that publishes its estimation math, links its sources, and accepts community runs covers the entire Apple Silicon range — and you can audit every step. When we're estimating, we say so.

How We Estimate Performance

Local LLM performance on Apple Silicon is unusually predictable, because two hardware numbers dominate everything: unified memory capacity (can the model fit?) and memory bandwidth (how fast can weights stream through the GPU?). The LLMCheck estimation model is built on those two numbers.

Step 1 — Will it fit? (RAM model)

model_size_GB (Q4_K_M) ≈ params_B × 0.57
≈ 4.5 bits/param effective — Q4_K_M is nominally 4-bit but carries scale factors, higher-precision layers, and embedding tables
+ KV cache: grows with context length — ~1–4 GB typical at 4k–32k context
+ macOS overhead: ~2–3 GB for the OS and the inference runtime
Fit rule: total must stay within ~75% of unified memory — macOS lets the GPU address roughly three-quarters of RAM by default

That 75% rule is where the minimum-RAM guidance across the index comes from: a 16 GB Mac has a ~12 GB working budget, a 24 GB Mac ~18 GB, a 64 GB Mac ~48 GB, and so on.

Step 2 — How fast? (Speed model)

Token generation is memory-bandwidth-bound: for every token, the runtime must stream essentially all active model weights from RAM through the GPU. Compute is rarely the bottleneck on Apple Silicon. That gives a simple ceiling:

est. tok/s ≈ (bandwidth_GB/s ÷ model_bytes_GB) × efficiency
bandwidth_GB/s: published memory bandwidth of the chip (e.g. M4 Pro 273 GB/s, M5 Max 600 GB/s)
model_bytes_GB: weight bytes read per token — total size for dense models, active-parameter bytes for MoE
efficiency: ~0.6–0.8 empirically for llama.cpp / MLX-class runtimes on dense models; lower for MoE (routing and expert-loading overhead)
Cross-chip scaling: tok/s scales ≈ with the bandwidth ratio — M5 Max (600) vs M4 Pro (273) ≈ 2.2×

This is also why Mixture-of-Experts models dominate the speed rankings: a 32B MoE with 3B active parameters reads ~1.7 GB per token instead of ~18 GB — a 10× smaller memory bill per token, while still storing the full 32B of knowledge in RAM.

Step 3 — A worked example

▶ Llama 3.3 70B (dense) on an M5 Max, Q4_K_M

  1. Weights: 70 × 0.57 ≈ 40 GB on disk and in RAM.
  2. Total RAM needed: 40 + ~2 (KV cache) + ~3 (macOS) ≈ 45 GB.
  3. Fit check: a 64 GB Mac budgets 64 × 0.75 = 48 GB — fits. A 48 GB Mac budgets 36 GB — the 40 GB of weights alone don't fit. So the index lists min RAM = 64 GB.
  4. Speed ceiling: 600 GB/s ÷ 40 GB = 15 tok/s theoretical maximum.
  5. Apply efficiency: 15 × ~0.7 ≈ 10–11 tok/s — the estimated figure the index shows.

Contrast with an MoE model: Qwen 4.1 32B-A3B activates ~3B parameters per token, so it reads only 3 × 0.57 ≈ 1.7 GB per token. On an M4 Pro (273 GB/s) the ceiling is ~160 tok/s; realized MoE efficiency of ~0.4 (routing overhead, expert scatter across memory) lands the estimate near ~62 tok/s — on a mid-range chip. The full 32B of weights still need ~18 GB of RAM to be resident, which is why it's a 24 GB-Mac model despite generating like a 3B.

What the estimates can't capture

Honest limits of the model: thermals (a fanless MacBook Air throttles on long generations; a Mac Studio doesn't), context length (speed degrades as the KV cache grows — long chats get slower), quantization variants (Q5, Q8, and MLX 4-bit all shift size and speed), and runtime differences (Ollama, LM Studio, llama.cpp, and MLX can differ 10–20% on the same hardware). Real numbers will vary from the estimates — that's exactly why every estimated row is labeled, and why community submissions are invited to replace estimates with real runs.

The LLMCheck Score Formula

The estimation model above feeds one of four components in the composite 0–100 LLMCheck Score used to rank the catalog:

LLMCheck Score = Capability + Speed + Accessibility + License
Capability (0–50): normalized from Arena AI ELO + MMLU + coding benchmarks — sourced from published third-party evals
Speed (0–25): est. tok/s on M5 Max × 0.25, capped at 25 — estimated via the model above
Accessibility (0–15): RAM tier: ≤8 GB = 15, ≤16 GB = 12, ≤24 GB = 10, ≤32 GB = 9, ≤64 GB = 5, ≤128 GB = 2, >128 GB = 0
License (0–10): MIT = 10, Apache 2.0 = 8, Gemma = 6, Meta Custom = 5, xAI Custom = 4, CC-BY-NC / N/A = 2

Dimension 1: Capability (50 points) — sourced

CAPABILITY50 / 100 pts

Raw model intelligence — reasoning, knowledge, coding, instruction following. Sourced from published third-party benchmark systems, linked per model.

The capability score (capScore) is the most heavily weighted dimension because the primary value of an LLM is output quality. capScore is sourced, not estimated: the index normalizes it from three public benchmark systems:

When official benchmarks are unavailable for a model (common for very new releases), the index extrapolates from related models in the same family and marks the score as estimated — same provenance rules as everything else. These estimates are replaced with sourced scores within 2–4 weeks of release.

Why 50 points? A very fast model that gives poor answers is less useful than a slower model with excellent reasoning. Capability is weighted highest because it determines whether the model actually solves your problem. Speed and RAM determine whether you can run it — but there's no point running a model that can't help you.

Capability Score Table (All 79 Models)

ModelParamscapScoreArena ELOMMLUCodingSources
Kimi K2.51T MoE50~1480HumanEval 99%Arena
DeepSeek V4 Pro1.6T MoE50~1545SWE-V 80.6%, GPQA 90.1%HF
DeepSeek R2671B MoE50~1530AIME 91%, MATH 88%HF
GLM 5.2744B MoE50~156592%SWE-Pro 68.5%, beats GPT-5 & ClaudeHF
DeepSeek R3685B MoE50~1558AIME 95%, MATH 92%HF
Kimi K31T MoE49~1545Agentic coding leaderHF
Kimi K2.61.05T MoE48~1525Coding 78.6, Agentic 58.3HF
Qwen 4.1 32B-A3B32B MoE46~151090%SWE-V 80%, HumanEval 95%HF
Qwen 432B MoE45~150589%SWE-V 78%, HumanEval 94%HF
Llama 5 405B405B44~150091%HumanEval 90%, dense frontierHF
Qwen 4 Coder 32B-A3B32B MoE44~149587%SWE-V 82%, HumanEval 96%HF
Qwen 4 Preview 32B-A3B32B MoE42~149088%HumanEval 92%, SWE-V 76%HF
Devstral Small 24B24B24~138579%Agentic coding specialistHF
DeepSeek V3.2685B MoE42~144594.2%SWE 58.2%HF
GLM 5.2 Air106B MoE40~147088%SWE-Pro 58%, Mac-runnableHF
Gemma 4 31B31B Dense401452 (#3)~88%AIME 89.2%Google
Qwen 3.5 (397B)397B MoE38~1430~90%HF
Llama 5 70B70B38~145088%HumanEval 86%, denseHF
Phi-5 Large 28B28B36~145588%AIME 80%, MITHF
Mistral Voyage Pro 70B70B36~144085%SWE-V 68%, agenticHF
GLM-5.1744B MoE48~1460SWE-Pro 58.4%HF
Qwen3-235B-A22B235B MoE46~1455HF
Qwen 3.6-35B-A3B35B MoE3882.6%SWE 73.4%HF
DeepSeek V3685B MoE37~141088.5%HF
DeepSeek R1671B MoE37~1410HF
Mistral Large 3675B MoE36~1405HF
Llama 4 Maverick400B MoE36~140085.5%HF
Gemma 4 26B-A4B26B MoE351441 (#6)~85%Google
Llama 3.1 405B405B35~1395HF
MiniMax M2.5230B MoE35~1390SWE 80.2%Arena
Mistral Medium 441B MoE34~144584%SWE-V 70%, agenticHF
GLM-4.7355B34~1385HumanEval 94.2%HF
Step-3.5 Flash196B MoE33~1380est.
Gemma 4.5 27B27B32~143586%1M context, multimodalHF
MiMo-V2-Flash309B MoE32~1370est.
GPT-oss 120B117B32~1365MMLU-Pro 90%HF
Command R+ 2104B30~141582%Enterprise RAG, CC-BY-NCHF
Grok 4 Open 100B-A20B100B MoE30~141082%xAI's first open weightsHF
DeepSeek R1 70B70B30~1350HF
Mistral Small 4119B MoE34~1385HF
Llama 4 Scout109B MoE30~1345HF
Llama 5 Scout 109B-A17B109B MoE30~141082%multimodalHF
Hermes 4 70B70B28~1400exceptional instruction-followingHF
Phi-5 Medium 14B14B28~140086%AIME 75%, MITHF
Gemma 4.5 12B12B28~139584%1M contextHF
Mistral Voyage 24B24B25~139080%balancedHF
Nemotron Cascade 230B30HF
Mixtral 8x22B141B MoE28~1320est.
Qwen 2.5 72B72B28~1330~86%HF
Qwen3-Coder-Next80B MoE28SWE 70.6%HF
Qwen 3.5 35B35B MoE27~1340~83%HF
Llama 3.3 70B70B27~132586%HF
QwQ 32B32B26~1310HF
Qwen 3 32B32B25~1300HF
DeepSeek R1 32B32B24~1290HF
Gemma 3 27B27B22~1270HF
Qwen 3 30B-A3B30B MoE22~1280HF
Qwen 4 4B4B22~134080%beats Phi-5 MiniHF
Qwen 3.5 27B27B21~1275HF
Qwen 3 14B14B20~1250HF
Llama 5 8B8B20~137078%HumanEval 80%HF
Phi-4 14B14B19~124084.8%HF
Phi-5 Mini 4B4B18~131082%HumanEval 78%, AIME 61%HF
Qwen 3.5 9B9B18~1220HF
Qwen 2.5 14B14B18~1210HF
Ministral 14B14B18~1215est.
Gemma 3 12B12B17~1200HF
DeepSeek R1 8B8B16~1190HF
Gemma 4 E4B4B (PLE)16~1195Google
Qwen 3 8B8B15~1180HF
Phi-4 Mini3.8B14~1170HF
Ministral 8B8B14~1165est.
Gemma 4 E2B2.3B (PLE)13~1150Google
Mistral 7B7B13~1140HF
Llama 3.1 8B8B12~1130HF
Qwen 3.5 4B4B12~1125HF
Qwen 3 4B4B11~1110HF
Gemma 3 4B4B10~1090HF
SmolLM3 3B3B10~1180edge-optimizedHF

Dimension 2: Speed on Apple Silicon (25 points) — estimated

SPEED25 / 100 pts

Estimated tokens per second on the M5 Max 128 GB reference configuration at Q4_K_M — computed with the estimation model above, not measured in a lab.

Speed points are calculated as: est. tok/s × 0.25, capped at 25 points. Any model estimated at 100+ tok/s on the reference configuration receives full speed points. The estimates assume these reference conditions:

Models that cannot run on any Mac (server-only, >128 GB RAM) receive 0 speed points. The full dataset — with a provenance label on every row — is available for download at /data/.

Dimension 3: Accessibility (15 points)

ACCESSIBILITY15 / 100 pts

How many Mac users can actually run this model? Lower RAM requirements = higher accessibility score.

Accessibility is a step function based on minimum RAM required at 4-bit quantization, derived from the RAM model above (weights + KV cache + macOS overhead, within the ~75% unified-memory budget). Since a large share of Mac users have 16 GB or less, accessibility is crucial for real-world impact:

Min RAM (4-bit)PointsExample ModelsEst. Mac Users
≤ 8 GB15Gemma 4 E4B, Phi-4 Mini, Qwen 3.5 9B~100% of Apple Silicon
≤ 16 GB12Qwen 3 14B, Gemma 3 12B~85%
≤ 24 GB10Gemma 4 26B-A4B, Gemma 4 31B, Qwen 3.5 35B~40%
≤ 32 GB9QwQ 32B, Qwen 3 32B~30%
≤ 64 GB5DeepSeek R1 70B, Llama 3.3 70B~10%
≤ 128 GB2GPT-oss 120B, Mixtral 8x22B~3%
> 128 GB0Kimi K2.5, DeepSeek V3, GLM-5.1Server only

Dimension 4: License Openness (10 points)

LICENSE10 / 100 pts

How freely can you use, modify, and distribute the model? More open = higher score.

LicensePointsCan Modify?Commercial Use?Models
MIT10YesUnrestrictedKimi K2.5, DeepSeek, Phi-4 Mini
Apache 2.08YesYes (with notice)Gemma 4, Qwen 3.5, Mistral
Gemma6YesYes (<700M users)Gemma 3 (old license)
Meta Custom5LimitedYes (<700M users)Llama 5, Llama 4
xAI Custom4LimitedYes (restrictions)Grok 4 Open
CC-BY-NC2YesNon-commercial onlyCommand R+ 2
Proprietary / N/A2NoAPI onlyMiniMax M2.5

Note: CC-BY-NC (non-commercial) scores 2 — usable for research but not commercial deployment. Meta / xAI community licenses score 5 / 4, reflecting modification rights with commercial caps.

Score Examples

Gemma 4 26B-A4B (Score: 67) = capScore 35 + speed min(25, est. 48 tok/s × 0.25 = 12) + accessibility 10 (24 GB) + license 8 (Apache 2.0) + rounding = 65–67. Top-ranked because it combines Arena AI #6 quality with fast MoE inference on a 24 GB Mac.

Qwen 3.5 9B (Score: 66) = capScore 18 + speed min(25, est. 100 tok/s × 0.25 = 25) + accessibility 15 (8 GB) + license 8 (Apache 2.0) = 66. Ranks high because maximum speed + accessibility points compensate for lower raw capability.

Kimi K2.5 (Score: 60) = capScore 50 (highest!) + speed 0 (server only) + accessibility 0 (>128 GB) + license 10 (MIT) = 60. Despite being the most capable model, it scores lower because no Mac user can run it locally.

Limitations & Known Issues

Corrections and real-world benchmark data are always welcome. If you have runs that differ from the index's estimates, submit them through the community benchmark process — verified community rows replace estimates.

Frequently Asked Questions

Does LLMCheck run its own benchmarks?

No. LLMCheck is an independent index, not a benchmark lab. Every figure on the site is one of three things: an estimate from the published LLMCheck estimation model (memory-bandwidth math and quantization arithmetic, fully documented on this page), a sourced number from a linked third-party benchmark, or a community-submitted run. Each row is labeled with its provenance, and no figure is claimed as a first-party lab measurement.

How does LLMCheck calculate its scores?

The LLMCheck Score is a 0–100 composite metric: Capability (50 pts) sourced from published third-party evaluations such as Arena AI ELO ratings and MMLU/coding benchmarks, Speed (25 pts) from estimated tokens/sec on the M5 Max reference configuration, Accessibility (15 pts) inversely proportional to minimum RAM, and License Openness (10 pts) where MIT scores 10 and restrictive licenses score lower. The formula is fully transparent and reproducible.

Where does LLMCheck get its capability scores?

Capability scores are derived from three public benchmark sources: Arena AI ELO ratings (human preference, weighted 40%), MMLU scores from official model cards (knowledge breadth, weighted 35%), and coding benchmarks like HumanEval and SWE-Bench (weighted 25%). All sources are linked per model. When official benchmarks are unavailable, the index extrapolates from related models in the same family and marks the score as 'estimated'.

How does LLMCheck estimate tokens per second on Apple Silicon?

Token generation on Apple Silicon is memory-bandwidth-bound, so estimated tok/s ≈ (memory bandwidth in GB/s ÷ model size in GB at Q4_K_M) × an efficiency factor of roughly 0.6–0.8 observed for llama.cpp/MLX-class runtimes. Mixture-of-Experts models use active-parameter bytes instead of total size, which is why they are much faster. Scaling across chips follows the bandwidth ratio. The full model is published at llmcheck.net/methodology#estimation, and every estimated figure is labeled as such.

Why does LLMCheck weight capability at 50% of the total score?

Capability receives the highest weight because the primary value of an LLM is the quality of its outputs. A very fast model that gives poor answers is less useful than a slower model with excellent reasoning. However, speed (25%) and accessibility (15%) ensure that models which actually run well on consumer Macs score higher than server-only models with superior capability but no practical local use.

How often are LLMCheck scores updated?

Scores are updated within 48–72 hours of major model releases. The full leaderboard is refreshed monthly with the latest Arena AI ELO ratings and community benchmark submissions. Speed estimates are recomputed as new Apple Silicon hardware specifications become available. All updates are timestamped in the open dataset at llmcheck.net/data/.

Benchmark Sources

See the Full Leaderboard

79 models ranked by LLMCheck Score. Filter by your Mac's RAM, sort by speed or capability — every figure labeled with its provenance.

View Leaderboard →