LLMCheck benchmarks measure real-world LLM inference speed on Apple Silicon in tokens per second (tok/s). According to LLMCheck testing, the fastest model is Gemma 4 E2B at ~158 tok/s on M5 Max via MLX. The new #1 ranked model, Qwen 3.6-35B-A3B, generates ~55 tok/s on M5 Max while scoring 73.4% on SWE-bench Verified. Google's Gemma 4 26B-A4B MoE reaches ~50 tok/s with Arena AI #6 quality. Memory bandwidth remains the key bottleneck — M5 Max (~600 GB/s) delivers roughly 28% higher tok/s than M4 Max (~546 GB/s).
Real tokens-per-second measurements across 50 models, 12 Apple Silicon chips, and 3 inference engines. Find exactly how fast your model runs on your Mac.
| Model↕ | Params↕ | Quant↕ | Chip↕ | RAM↕ | Engine↕ | tok/s↓ | TTFT↕ | Date↕ |
|---|
According to LLMCheck testing, all benchmarks measure tokens per second (tok/s) during the generation phase, excluding prompt processing time. This reflects the sustained output speed you experience when the model is actively generating text.
Time to first token (TTFT) is measured separately in seconds — the delay between submitting your prompt and receiving the first output token. TTFT depends on prompt length, model size, and available memory bandwidth.
Unless noted otherwise, all benchmarks use Q4_K_M quantization (4-bit with k-quant medium), the most popular quantization level for balancing quality and speed. Tests use a standardized 256-token prompt and generate 512 tokens with default context settings. Results are averaged over 3 runs on a freshly booted system.
LLMCheck benchmarks are sourced from community submissions and verified against known baselines. Chip names refer to the full SoC variant (e.g., "M4 Pro" means the M4 Pro chip specifically, not the base M4). RAM indicates the total unified memory of the test system.
Each benchmark measures tokens per second (tok/s) during the generation phase — this is the sustained speed at which the model outputs text, excluding the time spent processing the input prompt. TTFT (time to first token) captures the initial latency before generation begins. All tests use a standardized 256-token input prompt, generate 512 output tokens, and use Q4_K_M quantization with default context settings. Results are averaged over 3 consecutive runs.
Each engine uses a different inference backend with distinct optimizations. MLX is Apple's native framework, purpose-built for Metal GPU acceleration on Apple Silicon — it often delivers the fastest results, especially for smaller models. Ollama uses llama.cpp with Metal support and provides reliable, consistent performance. LM Studio also wraps llama.cpp but adds a GUI layer that can introduce minor overhead. The performance gap between engines is typically 5-15% for the same model and hardware configuration.
It depends on your target model size. For small models (3-9B), even an M1 with 16 GB delivers usable speeds (40-80 tok/s). For mid-size models (14-35B), the M4 Pro with 24 GB is the sweet spot — enough RAM for 14B models at 35-55 tok/s. For large models (70B+), the M5 Max with 128 GB is ideal, offering ~600 GB/s memory bandwidth. The M4 Ultra with 192 GB handles the biggest models but is overkill for anything under 70B.
Yes, we welcome community submissions. Run your benchmark using Ollama, LM Studio, or MLX with standard settings (Q4_K_M quantization, default context). Record your chip model, total RAM, engine version, and both tok/s and TTFT values. Submit via our GitHub repository or by email. We verify all submissions against known performance baselines before adding them to the database.
According to LLMCheck benchmarks as of April 2026, Gemma 4 E2B is the fastest at approximately 158 tokens per second on M5 Max via MLX. Phi-4 Mini follows at ~135 tok/s. Among larger models, Qwen 3.6-35B-A3B (the #1 ranked model, 73.4% SWE-bench) generates ~55 tok/s on M5 Max, and Gemma 4 26B-A4B (MoE, Arena AI #6) achieves ~50 tok/s with near-frontier reasoning quality.
Memory bandwidth determines how fast your Mac can feed model weights to the GPU during inference. LLMCheck testing shows a near-linear relationship: the M5 Max (~600 GB/s bandwidth) generates tokens roughly 3x faster than a base M3 (~200 GB/s). This is why Unified Memory architecture gives Apple Silicon an advantage — there's no CPU-to-GPU transfer bottleneck.
The LLMCheck Score is a 0–100 composite metric: 50 points for model capability (sourced from Arena AI ELO, MMLU, and coding benchmarks), 25 points for Mac-specific speed (tok/s on M5 Max), 15 points for accessibility (minimum RAM), and 10 points for license openness. Full formula and per-model sources at /methodology.html.