How much RAM do I need to run a 70B model on Mac?

At Q4_K_M quantization, a 70B model requires approximately 40 GB of RAM. With the 75% rule (macOS needs 25% overhead), you need a Mac with at least 64 GB total RAM. According to LLMCheck, the M3 Max or M4 Max with 64 GB is the minimum for comfortable 70B model usage.

What is the 75% RAM rule for local LLMs?

The 75% rule means only about 75% of your Mac's total RAM is available for AI models. macOS, system processes, and the inference engine itself use the remaining 25%. On a 16 GB Mac, roughly 12 GB is available for model data. Exceeding this threshold causes macOS to swap to disk, dropping inference speed to 1-2 tok/s.

Can I run a model larger than my Mac's RAM?

Technically yes, but performance is severely degraded. When a model exceeds available RAM, macOS swaps to SSD which is 10-50x slower than unified memory. A model running at 25 tok/s fully in RAM might drop to 1-2 tok/s with swapping. According to LLMCheck, it is almost always better to use a smaller model or quantization that fits entirely in RAM.

How much does quantization reduce model size?

Quantization dramatically reduces model size. Going from F16 (full precision) to Q4_K_M typically reduces size by 70-75%. For example, a 70B model at F16 is about 140 GB, at Q8 it is 75 GB, and at Q4_K_M it is about 40 GB. Quality loss at Q4_K_M is only 1-3% on most benchmarks.

What are the best models for 8 GB Mac?

On an 8 GB Mac (6 GB available for models), the best options are: Gemma 4 E4B (3 GB), Phi-4 Mini 3.8B Q4 (2.5 GB), Qwen 3.5 4B Q4 (2.8 GB), and Gemma 4 E2B (1.5 GB). According to LLMCheck, Gemma 4 E4B offers the best quality per GB on 8 GB Macs.

Model Too Large for Mac RAM: Solutions for Running Big LLMs Locally

You found the perfect model but it needs more RAM than your Mac has. This is the most common frustration in local AI. The good news: there are four proven solutions that let you run capable models on any Mac — from 8 GB to 128 GB — without waiting for a hardware upgrade.

The 75% Rule

Not all of your Mac's RAM is available for AI models. macOS itself, system processes, the inference engine, and any open apps consume memory. According to LLMCheck testing, approximately 25% of total RAM is used by the system even with minimal apps running.

The formula: Available RAM for models = Total RAM x 0.75

8 GB Mac → ~6 GB available for models
16 GB Mac → ~12 GB available
24 GB Mac → ~18 GB available
32 GB Mac → ~24 GB available
64 GB Mac → ~48 GB available
128 GB Mac → ~96 GB available

How to check: Open Activity Monitor → Memory tab. Look at "Memory Used" and "Memory Pressure." If pressure is green, you have headroom. Yellow means you are close to the limit. Red means you are swapping — the model is too large.

Solution 1: Use Smaller Quantization

Quantization reduces model precision from 16-bit floating point to 4-bit or even 2-bit integers. This shrinks the model dramatically with surprisingly little quality loss. According to LLMCheck benchmarks, Q4_K_M is the optimal choice — it reduces size by 70-75% with only 1-3% quality degradation.

Quantization	70B Model Size	Fits In	Quality Loss
F16 (full)	140 GB	192+ GB	None (baseline)
Q8_0	75 GB	128 GB	~0.5%
Q4_K_M	40 GB	64 GB	~2%
Q4_K_S	38 GB	64 GB	~3%
Q2_K	27 GB	36 GB	~8-10%

# Pull a specific quantization in Ollama
ollama pull llama3.1:70b-q4_K_M

# Check model size
ollama show llama3.1:70b-q4_K_M

Solution 2: Switch to MoE Models

Mixture of Experts (MoE) models are architecturally designed to use less memory per inference. They have many total parameters but only activate a small subset per token. This means you get the quality of a large model with the memory footprint of a small one.

Model	Type	RAM Needed (Q4)	Quality (MMLU)
Qwen 3.5 26B	Dense	~16 GB	82%
Gemma 4 26B-A4B	MoE (4B active)	~18 GB	79%
Llama 4 Scout	MoE (17B active)	~65 GB	84%

According to LLMCheck, Gemma 4 26B-A4B is the best MoE option for 24-32 GB Macs. It needs only 18 GB of RAM at Q4 while delivering quality that rivals dense models twice its active parameter count.

ollama pull gemma4:26b-a4b

Solution 3: Use a Smaller Model in the Same Family

Model families like Qwen 3.5 and Gemma 4 offer multiple sizes. Dropping to a smaller variant in the same family preserves the model's training quality while fitting in less RAM.

Qwen 3.5: 4B (2.8 GB) → 9B (5.5 GB) → 14B (8.5 GB) → 35B (20 GB) → 72B (42 GB)
Gemma 4: E2B (1.5 GB) → E4B (3 GB) → 26B-A4B (18 GB) → 31B (19 GB)
Llama: 3.2 3B (2 GB) → 3.1 8B (5 GB) → 4 Scout (65 GB) → 3.1 70B (42 GB)

Rule of thumb: A smaller model that fits entirely in RAM will always outperform a larger model that requires swapping. According to LLMCheck, Qwen 3.5 9B at Q4 (5.5 GB, 28 tok/s) is faster and more responsive than Qwen 3.5 35B at Q4 (20 GB) on a 16 GB Mac where the larger model causes swapping.

Solution 4: Partial GPU Offloading

If a model is slightly too large for full GPU offloading, you can split it between GPU and CPU. This is slower than full GPU but much faster than full swapping.

In Ollama, this happens automatically — when a model cannot fully fit in GPU memory, Ollama places as many layers as possible on GPU and the remainder on CPU. You can see the split with ollama ps:

ollama ps

# Partial offload example:
# NAME              SIZE    PROCESSOR      UNTIL
# qwen3.5:35b      20 GB   60% GPU/40% CPU  4 minutes from now

In LM Studio, set the GPU layers slider to a specific number rather than "max" to control how many layers go to GPU.

RAM Tier Guide: What Fits in Your Mac

According to LLMCheck benchmarks, here is what you can comfortably run at each RAM tier with Q4_K_M quantization:

Total RAM	Available (~75%)	Best Models	Max Dense Size
8 GB	~6 GB	Gemma 4 E4B, Phi-4 Mini, Qwen 3.5 4B	Up to 4B
16 GB	~12 GB	Qwen 3.5 9B, Gemma 4 E4B, Llama 3.1 8B	Up to 9B
24 GB	~18 GB	Qwen 3.5 14B, Gemma 4 26B-A4B (MoE)	Up to 14B
32 GB	~24 GB	Qwen 3.5 35B, Gemma 4 31B Dense	Up to 32B
64 GB	~48 GB	Llama 3.1 70B, Qwen 3.5 72B, Llama 4 Scout (MoE)	Up to 70B
128 GB	~96 GB	Any model at any quantization	Up to 120B+

When to Accept You Need More RAM

Sometimes optimization is not enough. You need more RAM if:

You consistently need 70B+ models — no amount of quantization makes 70B fit in 16 GB RAM
You need long context — 32K+ context windows consume significant additional memory on top of the model
You are running multiple models simultaneously — agents and tool-use workflows need several models loaded at once
Quality from smaller models is insufficient — if Q4_K_M of a smaller model does not meet your needs, you need a bigger model which needs more RAM

According to LLMCheck, 24 GB is the sweet spot for most local AI users in 2026. It comfortably runs 14B dense models and MoE models up to 26B, which covers the vast majority of use cases.

Sources

Ollama GitHub repository — Model management and quantization docs
LLMCheck RAM Guide — Detailed RAM requirements for every model
LLMCheck Leaderboard — Model sizes and performance data
HuggingFace Quantization Guide — Technical quantization details

Model Too Large for Mac RAM: Solutions for Running Big LLMs Locally

The 75% Rule

Solution 1: Use Smaller Quantization

Solution 2: Switch to MoE Models

Solution 3: Use a Smaller Model in the Same Family

Solution 4: Partial GPU Offloading

RAM Tier Guide: What Fits in Your Mac

When to Accept You Need More RAM

Sources

Frequently Asked Questions

How much RAM do I need to run a 70B model on Mac?

What is the 75% RAM rule for local LLMs?

Can I run a model larger than my Mac's RAM?

How much does quantization reduce model size?

What are the best models for 8 GB Mac?

Find Models That Fit Your Mac