🔬 Updated March 2026 · 180+ Benchmarks

Apple Silicon LLM Benchmarks

Real tokens-per-second measurements across 42 models, 12 Apple Silicon chips, and 3 inference engines. Find exactly how fast your model runs on your Mac.

42
Models
12
Chips Tested
3
Engines
Chip:
RAM:
Engine:
Model Params Quant Chip RAM Engine tok/s TTFT Date

Methodology

All benchmarks measure tokens per second (tok/s) during the generation phase, excluding prompt processing time. This reflects the sustained output speed you experience when the model is actively generating text.

Time to first token (TTFT) is measured separately in seconds — the delay between submitting your prompt and receiving the first output token. TTFT depends on prompt length, model size, and available memory bandwidth.

Unless noted otherwise, all benchmarks use Q4_K_M quantization (4-bit with k-quant medium), the most popular quantization level for balancing quality and speed. Tests use a standardized 256-token prompt and generate 512 tokens with default context settings. Results are averaged over 3 runs on a freshly booted system.

Benchmarks are sourced from community submissions and verified against known baselines. Chip names refer to the full SoC variant (e.g., "M4 Pro" means the M4 Pro chip specifically, not the base M4). RAM indicates the total unified memory of the test system.

Frequently Asked Questions

How are these benchmarks measured?

Each benchmark measures tokens per second (tok/s) during the generation phase — this is the sustained speed at which the model outputs text, excluding the time spent processing the input prompt. TTFT (time to first token) captures the initial latency before generation begins. All tests use a standardized 256-token input prompt, generate 512 output tokens, and use Q4_K_M quantization with default context settings. Results are averaged over 3 consecutive runs.

Why does tok/s vary between Ollama, LM Studio, and MLX?

Each engine uses a different inference backend with distinct optimizations. MLX is Apple's native framework, purpose-built for Metal GPU acceleration on Apple Silicon — it often delivers the fastest results, especially for smaller models. Ollama uses llama.cpp with Metal support and provides reliable, consistent performance. LM Studio also wraps llama.cpp but adds a GUI layer that can introduce minor overhead. The performance gap between engines is typically 5-15% for the same model and hardware configuration.

Which Apple Silicon chip is best for local AI?

It depends on your target model size. For small models (3-9B), even an M1 with 16 GB delivers usable speeds (40-80 tok/s). For mid-size models (14-35B), the M4 Pro with 24 GB is the sweet spot — enough RAM for 14B models at 35-55 tok/s. For large models (70B+), the M5 Max with 128 GB is ideal, offering ~600 GB/s memory bandwidth. The M4 Ultra with 192 GB handles the biggest models but is overkill for anything under 70B.

Can I submit my own benchmarks?

Yes, we welcome community submissions. Run your benchmark using Ollama, LM Studio, or MLX with standard settings (Q4_K_M quantization, default context). Record your chip model, total RAM, engine version, and both tok/s and TTFT values. Submit via our GitHub repository or by email. We verify all submissions against known performance baselines before adding them to the database.