What is Quantization?
When a large language model is trained, its weights are stored as 16-bit floating point numbers (FP16). Each weight takes 2 bytes of memory. A 7 billion parameter model at FP16 requires roughly 14GB of RAM — and that is just for the weights, before accounting for context and overhead.
Quantization reduces the precision of these numbers. Instead of using 16 bits per weight, you can use 8, 6, 5, or even 4 bits. This means the same 7B model can fit in 4GB instead of 14GB. The trade-off is a small reduction in output quality — but modern quantization techniques have gotten remarkably good at minimizing this loss.
The key insight is that most model weights cluster around zero and do not need 16 bits of precision. By using fewer bits for the majority of weights and preserving higher precision for the most important ones (this is what the "K" in Q4_K_M means), you can achieve near-FP16 quality at a fraction of the memory cost.
Key takeaway: Quantization lets you run models that would otherwise not fit in your Mac's RAM. A 32GB Mac running Q4_K_M can handle models that would need 64GB+ at full precision.
Quantization Levels Compared
Here is how the most common quantization levels compare for a typical 7B parameter model:
| Level | Bits | Size (7B) | Quality Loss | Speed Impact |
|---|---|---|---|---|
| Q4_K_M | 4-bit | ~4.0 GB | Minimal | Fastest |
| Q5_K_M | 5-bit | ~4.8 GB | Very low | Fast |
| Q6_K | 6-bit | ~5.5 GB | Negligible | Moderate |
| Q8_0 | 8-bit | ~7.2 GB | None | Slower |
| FP16 | 16-bit | ~14 GB | Baseline | Slowest |
The "K_M" suffix stands for "K-quant Medium" — a quantization method that uses different bit widths for different layers of the model. The most important layers (attention layers) get higher precision, while less critical layers get lower precision. This approach preserves quality much better than naive uniform quantization.
Other variants you may encounter include Q4_K_S (K-quant Small, slightly worse quality but slightly smaller) and Q4_0 (older uniform quantization, noticeably worse quality — avoid this one if K-quant versions are available).
How to Choose for Your Mac
Your Mac's unified memory is the primary constraint. Here is a practical decision guide based on your RAM:
8GB Mac (MacBook Air base model)
You are limited to Q4_K_M only, and only for models up to about 7B parameters. After macOS takes its share of memory (roughly 4-5GB), you have about 3-4GB left for the model. Stick with models like Qwen 3.5 4B or Phi-4 Mini 3.8B for the best experience. Larger models will swap to disk and become unusably slow.
16GB Mac
You can comfortably run 7-9B models at Q4_K_M or Q5_K_M. Q5_K_M is worth the extra ~800MB for noticeably better output on reasoning and creative tasks. You can also squeeze in a 14B model at Q4_K_M, though you will want to close other apps first.
32GB+ Mac
This is the sweet spot. Run 7-9B models at Q5_K_M or Q6_K for near-perfect quality, or step up to 32-35B models at Q4_K_M. A 32B model at Q4 typically uses about 18-20GB, leaving plenty of headroom. If quality matters more than speed, Q6_K on a 9B model delivers output nearly indistinguishable from FP16.
64GB+ Mac
You have the luxury of choice. Run smaller models at Q8_0 or even FP16 for perfect quality, or run massive 70B+ models at Q4_K_M. At 64GB, you can run Llama 4 Scout at Q4_K_M (see our Llama 4 guide). At 96GB+, consider Q5_K_M for 70B models for the best balance of quality and capability.
Rule of thumb: Always leave at least 4-6GB of free RAM for macOS and other processes. If a model barely fits, you will experience memory pressure and performance will degrade significantly.
Quality vs Speed Benchmarks
The relationship between quantization and quality is not linear. The jump from FP16 to Q8_0 has virtually zero quality impact. The jump from Q8_0 to Q5_K_M is still very small. The most noticeable quality drop happens below Q4_K_M — which is why Q4_K_M is generally considered the floor for usable quality.
Speed, on the other hand, scales more linearly with quantization level. Lower bit counts mean smaller model files, which means faster memory reads, which means more tokens per second. On Apple Silicon, where memory bandwidth is the primary bottleneck, this relationship is very consistent.
For real-world speed comparisons across different models and Mac configurations, visit our benchmarks page. You can filter by model, quantization level, and Mac hardware to find the exact configuration you are considering.
For model quality rankings and head-to-head comparisons, check the leaderboard.
How to Download Specific Quants
Most tools default to Q4_K_M, but you can request any quantization level:
With Ollama
Ollama uses tags to specify quantization. Add the quant level after a colon:
# Default (usually Q4_K_M)
ollama pull qwen3.5:9b
# Specific quantization
ollama pull qwen3.5:9b-q5_K_M
ollama pull qwen3.5:9b-q8_0
ollama pull qwen3.5:9b-fp16
Run ollama show qwen3.5:9b to see what quantization level a downloaded model is using. Not sure how to set up Ollama? Follow our installation guide.
From Hugging Face (GGUF files)
Hugging Face hosts GGUF files directly. Search for any model name followed by "GGUF" to find quantized versions. Each file is typically named with its quantization level:
# Example GGUF filenames you will see:
model-Q4_K_M.gguf # 4-bit, recommended default
model-Q5_K_M.gguf # 5-bit, higher quality
model-Q6_K.gguf # 6-bit, near-lossless
model-Q8_0.gguf # 8-bit, lossless for most tasks
model-F16.gguf # Full precision, largest file
Download the GGUF file you want and load it directly in LM Studio, llama.cpp, or create a custom Ollama Modelfile to import it.
Tip: When in doubt, start with Q4_K_M. It is the community default for good reason — it offers the best balance of speed, size, and quality for the vast majority of users and tasks. Only move to higher quants if you have the RAM and notice quality issues in your specific use case.