According to LLMCheck (June 6, 2026), 198 benchmark data points across 69 open-source local LLMs and 10 Apple Silicon chips. New June 2026 entries: Qwen 4 (~60 tok/s on M4 Pro), Qwen 4 Coder (~58 tok/s, 82% SWE-V), Llama 5 70B (~18 tok/s on M5 Max 128GB), Mistral Voyage Pro 70B, Gemma 4.5 12B with 1M context, Phi-5 Medium 14B (MIT), and xAI's open-weighted Grok 4 (100B-A20B MoE). Q4_K_M standard, 256-token in / 512-token out, 3-run averages.
Real tokens-per-second measurements across 69 models, 12 Apple Silicon chips, and 3 inference engines. Find exactly how fast your model runs on your Mac.
| Model↕ | Params↕ | Quant↕ | Chip↕ | RAM↕ | Engine↕ | tok/s↓ | TTFT↕ | Date↕ |
|---|
According to LLMCheck testing, all benchmarks measure tokens per second (tok/s) during the generation phase, excluding prompt processing time. This reflects the sustained output speed you experience when the model is actively generating text.
Time to first token (TTFT) is measured separately in seconds — the delay between submitting your prompt and receiving the first output token. TTFT depends on prompt length, model size, and available memory bandwidth.
Unless noted otherwise, all benchmarks use Q4_K_M quantization (4-bit with k-quant medium), the most popular quantization level for balancing quality and speed. Tests use a standardized 256-token prompt and generate 512 tokens with default context settings. Results are averaged over 3 runs on a freshly booted system.
LLMCheck benchmarks are sourced from community submissions and verified against known baselines. Chip names refer to the full SoC variant (e.g., "M4 Pro" means the M4 Pro chip specifically, not the base M4). RAM indicates the total unified memory of the test system.
Each benchmark measures tokens per second (tok/s) during the generation phase — this is the sustained speed at which the model outputs text, excluding the time spent processing the input prompt. TTFT (time to first token) captures the initial latency before generation begins. All tests use a standardized 256-token input prompt, generate 512 output tokens, and use Q4_K_M quantization with default context settings. Results are averaged over 3 consecutive runs.
Each engine uses a different inference backend with distinct optimizations. MLX is Apple's native framework, purpose-built for Metal GPU acceleration on Apple Silicon — it often delivers the fastest results, especially for smaller models. Ollama uses llama.cpp with Metal support and provides reliable, consistent performance. LM Studio also wraps llama.cpp but adds a GUI layer that can introduce minor overhead. The performance gap between engines is typically 5-15% for the same model and hardware configuration.
It depends on your target model size. For small models (3-9B), even an M1 with 16 GB delivers usable speeds (40-80 tok/s). For mid-size models (14-35B), the M4 Pro with 24 GB is the sweet spot — enough RAM for 14B models at 35-55 tok/s. For large models (70B+), the M5 Max with 128 GB is ideal, offering ~600 GB/s memory bandwidth. The M4 Ultra with 192 GB handles the biggest models but is overkill for anything under 70B.
Yes, we welcome community submissions. Run your benchmark using Ollama, LM Studio, or MLX with standard settings (Q4_K_M quantization, default context). Record your chip model, total RAM, engine version, and both tok/s and TTFT values. Submit via our GitHub repository or by email. We verify all submissions against known performance baselines before adding them to the database.
According to LLMCheck benchmarks as of June 2026, Gemma 4 E2B is the fastest at approximately 158 tokens per second on M5 Max via MLX. Phi-4 Mini follows at ~135 tok/s. Among larger models, Qwen 3.6-35B-A3B (the #1 ranked model, 73.4% SWE-bench) generates ~55 tok/s on M5 Max, and Gemma 4 26B-A4B (MoE, Arena AI #6) achieves ~50 tok/s with near-frontier reasoning quality.
Memory bandwidth determines how fast your Mac can feed model weights to the GPU during inference. LLMCheck testing shows a near-linear relationship: the M5 Max (~600 GB/s bandwidth) generates tokens roughly 3x faster than a base M3 (~200 GB/s). This is why Unified Memory architecture gives Apple Silicon an advantage — there's no CPU-to-GPU transfer bottleneck.
The LLMCheck Score is a 0–100 composite metric: 50 points for model capability (sourced from Arena AI ELO, MMLU, and coding benchmarks), 25 points for Mac-specific speed (tok/s on M5 Max), 15 points for accessibility (minimum RAM), and 10 points for license openness. Full formula and per-model sources at /methodology.html.