Can you fine-tune an LLM on a Mac?

Yes. MLX-LM supports LoRA and QLoRA fine-tuning natively on Apple Silicon via the mlx_lm.lora command. On a Mac with 32GB or more unified memory you can fine-tune small-to-mid open models such as Qwen 4 4B, Phi-5 Mini, or Gemma 4.5 12B entirely offline. The smallest models can be fine-tuned on 16GB Macs.

LoRA vs full fine-tuning — which on a Mac?

Use LoRA on a Mac. LoRA trains a small set of low-rank adapter weights instead of all of the model's parameters, which slashes memory use and training time while keeping the base weights frozen. Full fine-tuning updates every parameter and needs far more memory than a typical Mac offers. For a full fine-tune of a large model, rent a datacenter GPU instead.

How much RAM do I need to fine-tune?

For LoRA, a 4B model fine-tunes on 16GB, a 12B model wants 32GB, and larger models need 48GB or more. According to LLMCheck, 32GB of unified memory is the practical sweet spot for fine-tuning useful mid-size models with comfortable headroom. Reducing batch size and sequence length lowers the requirement if you are tight on memory.

Which model should I fine-tune locally?

Start with the smallest model that can plausibly do your task. Qwen 4 4B and Phi-5 Mini are excellent low-memory choices, while Gemma 4.5 12B is a strong mid-size option on a 32GB Mac. Smaller models train faster, fit more easily in memory, and are often all a focused, single-domain task requires.

Is MLX fine-tuning faster than the alternatives on Apple Silicon?

Yes. MLX is built specifically for Apple Silicon and uses Metal GPU acceleration with unified memory, so mlx_lm.lora is the fastest practical way to fine-tune on a Mac. Cross-platform stacks like PyTorch run on Apple Silicon but were not designed for it and are generally slower for this workload. For very large models or full fine-tunes, a rented datacenter GPU is faster still.

How to Fine-Tune a Local LLM on Mac with MLX (LoRA) — 2026 Guide

You do not need a cloud GPU to teach an open model your own data. With Apple's MLX-LM and LoRA, you can fine-tune a small-to-mid LLM entirely on Apple Silicon — prepare a dataset, train an adapter with mlx_lm.lora, fuse it into the weights, and run your custom model in Ollama. This guide walks the full pipeline on a Mac.

Why LoRA, and why MLX

Full fine-tuning updates every weight in a model — billions of parameters — which demands far more memory than any Mac has. LoRA (Low-Rank Adaptation) takes a different approach: it freezes the base weights and trains a small set of low-rank "adapter" matrices alongside them. You end up training a tiny fraction of the parameters, which collapses both the memory footprint and the training time. That trade is what makes fine-tuning feasible on a laptop.

MLX-LM is Apple's machine-learning toolkit for Apple Silicon, and it ships a first-class LoRA (and QLoRA) trainer in the mlx_lm.lora command. Because MLX is built for Metal and unified memory, it is the fastest practical way to fine-tune on a Mac. For the framework background, see our MLX framework guide.

Step 1: Hardware Requirements

Fine-tuning is more memory-hungry than inference because gradients and optimizer state live in memory alongside the model. As a rule of thumb for LoRA on MLX:

Base Model	Min RAM (LoRA)	Comfortable
Qwen 4 4B	16 GB	24 GB
Phi-5 Mini	16 GB	24 GB
Gemma 4.5 12B	32 GB	48 GB
20B+ models	48 GB	64 GB+

According to LLMCheck, 32 GB of unified memory is the sweet spot — enough to fine-tune a genuinely useful 12B model with headroom for a reasonable batch size. If you are not sure what your Mac can do, the LLMCheck hardware checker maps your exact chip and RAM to a recommended base model.

Too big for your Mac? For very large models or a full (non-LoRA) fine-tune, the practical path is to rent a datacenter GPU by the hour rather than fight your Mac's memory ceiling. We cover that option in Step 6.

Step 2: Install MLX-LM

Work inside a virtual environment to keep dependencies clean, then install the package:

python3 -m venv ft-env && source ft-env/bin/activate
pip install mlx-lm

That single package pulls in MLX itself plus the LoRA trainer, the fuse utility, and the generation tools. Verify it imports:

python3 -c "import mlx_lm; print('mlx-lm ready')"

All of the commands below — mlx_lm.lora, mlx_lm.generate, mlx_lm.fuse — are installed by this one step.

Step 3: Prepare Your Dataset

MLX-LM reads a folder containing train.jsonl and valid.jsonl (a test.jsonl is optional). Each line is one training example. The simplest format is chat, where each line is a JSON object with a messages array:

{"messages": [{"role": "user", "content": "What is our refund window?"}, {"role": "assistant", "content": "Our refund window is 30 days from the delivery date."}]}
{"messages": [{"role": "user", "content": "Do you ship internationally?"}, {"role": "assistant", "content": "Yes — we ship to 40 countries with tracked delivery."}]}

MLX-LM also accepts a prompt/completion format if that fits your data better:

{"prompt": "Summarize: quarterly revenue rose 12%.", "completion": "Revenue grew 12% this quarter."}

Lay the files out like this:

data/
  train.jsonl   # the bulk of your examples
  valid.jsonl   # a small held-out set to watch for overfitting

You do not need many rows. A few hundred high-quality, consistent examples often beat thousands of noisy ones for a focused single-domain task. Keep the valid.jsonl set genuinely separate so the validation loss means something.

Tip: Match the assistant style you actually want. The model learns tone and format as much as facts — if every answer in your dataset is one tight sentence, your fine-tuned model will tend to answer in one tight sentence.

Step 4: Run the LoRA Fine-Tune

Point the trainer at a base model on Hugging Face (or a local path) and your data folder. This trains a LoRA adapter — the base weights stay frozen:

mlx_lm.lora \
  --model Qwen/Qwen4-4B-Instruct \
  --train \
  --data ./data \
  --iters 600 \
  --batch-size 4 \
  --num-layers 16 \
  --adapter-path ./adapters

What the key flags do:

--iters 600 — number of training iterations. Start in the 300–600 range and watch the validation loss; more is not always better.
--batch-size 4 — lower this to 2 or 1 if you hit memory pressure on a smaller Mac.
--num-layers 16 — how many layers get LoRA adapters. Fewer layers means less memory and faster training; more can capture harder tasks.
--adapter-path ./adapters — where the trained adapter weights are written.

You will see training and validation loss print as it runs. A steadily falling validation loss is healthy; if it bottoms out and starts climbing, you are beginning to overfit — stop or reduce iterations. According to LLMCheck testing, a few-hundred-example LoRA run on a 4B model finishes in minutes on an M3-class Mac.

Low on memory? Add --grad-checkpoint to trade some speed for a smaller memory footprint, and shrink --batch-size. For 4-bit base weights (QLoRA-style), load a quantized base model — MLX trains the adapter on top of it.

Step 5: Test & Fuse the Adapter

Before committing, test the adapter on top of the base model. mlx_lm.generate accepts the adapter path directly:

mlx_lm.generate \
  --model Qwen/Qwen4-4B-Instruct \
  --adapter-path ./adapters \
  --prompt "What is our refund window?"

If the answers reflect your training data, you are ready to fuse — merge the adapter into the base weights to produce one standalone model with no separate adapter to carry around:

mlx_lm.fuse \
  --model Qwen/Qwen4-4B-Instruct \
  --adapter-path ./adapters \
  --save-path ./qwen4-4b-custom

The fused model in ./qwen4-4b-custom behaves exactly like the adapter-plus-base combination, but loads as a single model. This is what you will convert for Ollama next.

Step 6: Convert to GGUF & Run in Ollama

Ollama runs GGUF models, so convert your fused MLX model to GGUF. The llama.cpp converter handles this:

# From a llama.cpp checkout
python convert_hf_to_gguf.py ./qwen4-4b-custom \
  --outfile qwen4-4b-custom.gguf \
  --outtype q4_k_m

Now write a tiny Modelfile that points Ollama at the GGUF file:

# Modelfile
FROM ./qwen4-4b-custom.gguf
PARAMETER temperature 0.7

ollama create qwen4-custom -f Modelfile
ollama run qwen4-custom

That's the whole loop, entirely on your Mac: your data never left the machine, and you now have a custom model you can use anywhere Ollama runs. New to Ollama? Our Ollama setup guide covers the basics.

When to rent a GPU instead

LoRA on MLX is excellent for small-to-mid models, but it has a ceiling. If you want to full-fine-tune a model, train a very large one (30B+), or run many experiments quickly, a Mac's unified memory and single-machine throughput become the bottleneck. In that case, renting a datacenter GPU by the hour is the pragmatic move — you get far more VRAM and compute, and you only pay for the run. Services like Vast.ai let you spin up a GPU instance for a few dollars an hour for exactly these jobs.

Disclosure: the Vast.ai link is a referral — if you sign up through it, LLMCheck may earn a small credit at no extra cost to you. We only mention it because GPU rental is the genuine answer when a job outgrows your Mac.

For most people, though, a focused LoRA on a model that already fits your Mac is all you need — and it keeps everything private and free. To pick a base model that fits your hardware, browse the LLMCheck leaderboard.

How to Fine-Tune a Local LLM on Mac with MLX (LoRA) — 2026 Guide

Why LoRA, and why MLX

Step 1: Hardware Requirements

Step 2: Install MLX-LM

Step 3: Prepare Your Dataset

Step 4: Run the LoRA Fine-Tune

Step 5: Test & Fuse the Adapter

Step 6: Convert to GGUF & Run in Ollama

When to rent a GPU instead

Frequently Asked Questions

Can you fine-tune an LLM on a Mac?

LoRA vs full fine-tuning — which on a Mac?

How much RAM do I need to fine-tune?

Which model should I fine-tune locally?

Is MLX fine-tuning faster than the alternatives on Apple Silicon?

Find the Best Model for Your Mac