Question 1

How are these benchmarks measured?

Accepted Answer

Each benchmark measures tokens per second (tok/s) during generation, excluding prompt processing time. Time to first token (TTFT) is measured separately. All tests use a standardized 256-token prompt and generate 512 tokens with default context settings. Quantization is Q4_K_M unless otherwise noted. Results are averaged over 3 runs on a freshly booted system with no other applications running.

Question 2

Why does tok/s vary between Ollama, LM Studio, and MLX?

Accepted Answer

Each inference engine uses different backends and optimizations. MLX is Apple's own framework optimized for Metal GPU acceleration and often delivers the fastest results on Apple Silicon. Ollama uses llama.cpp with Metal support and provides consistent cross-platform performance. LM Studio wraps llama.cpp with additional features but may add slight overhead. The gap between engines is typically 5-15% for the same model and hardware.

Question 3

Which Apple Silicon chip is best for local AI?

Accepted Answer

For most users, the M4 Pro (24 GB) or M5 Max (64-128 GB) offer the best price-to-performance ratio. The M5 Max with 128 GB RAM can run 70B parameter models at usable speeds (10-18 tok/s). For smaller models (7-14B), even an M1 or M2 with 16 GB delivers 40-70 tok/s. The key bottleneck is memory bandwidth — the M5 Max (~600 GB/s) is roughly 3x faster than the base M3 (~200 GB/s) for memory-bound inference.

Question 4

Can I submit my own benchmarks?

Accepted Answer

Yes! We welcome community benchmark submissions. Run your benchmark using Ollama, LM Studio, or MLX with the standard settings (Q4_K_M quantization, default context), note your chip model, RAM, and engine version, then submit via our GitHub repository or email. We verify submissions against known baselines before adding them to the database.

Question 5

What is the fastest local LLM on Apple Silicon?

Accepted Answer

According to LLMCheck benchmarks as of June 2026, Gemma 4 E2B is the fastest at approximately 158 tokens per second on M5 Max via MLX. Phi-4 Mini follows at ~135 tok/s and Gemma 4 E4B at ~128 tok/s. Among larger models, Qwen 3.6-35B-A3B (the #1 ranked model) generates ~55 tok/s on M5 Max, while Gemma 4 26B-A4B (MoE, Arena #6) achieves ~50 tok/s with near-frontier quality.

Question 6

Why is memory bandwidth important for running AI on Mac?

Accepted Answer

Memory bandwidth determines how fast your Mac can feed model weights to the GPU during inference. LLMCheck testing shows a near-linear relationship: the M5 Max (~600 GB/s bandwidth) generates tokens roughly 3x faster than a base M3 (~200 GB/s). This is why Unified Memory architecture gives Apple Silicon an advantage — there's no CPU-to-GPU transfer bottleneck.

Question 7

How does LLMCheck calculate its composite score?

Accepted Answer

The LLMCheck Score is a 0–100 composite metric: 50 points for model capability (reasoning, coding, instruction-following benchmarks), 25 points for Mac-specific speed (tok/s on M5 Max), 15 points for accessibility (minimum RAM, quantization support), and 10 points for license openness (MIT/Apache vs restricted).

Apple Silicon LLM Benchmarks

Methodology

Frequently Asked Questions

Download Raw Benchmark Data