Why LoRA, and why MLX

Full fine-tuning updates every weight in a model — billions of parameters — which demands far more memory than any Mac has. LoRA (Low-Rank Adaptation) takes a different approach: it freezes the base weights and trains a small set of low-rank "adapter" matrices alongside them. You end up training a tiny fraction of the parameters, which collapses both the memory footprint and the training time. That trade is what makes fine-tuning feasible on a laptop.

MLX-LM is Apple's machine-learning toolkit for Apple Silicon, and it ships a first-class LoRA (and QLoRA) trainer in the mlx_lm.lora command. Because MLX is built for Metal and unified memory, it is the fastest practical way to fine-tune on a Mac. For the framework background, see our MLX framework guide.

Step 1: Hardware Requirements

Fine-tuning is more memory-hungry than inference because gradients and optimizer state live in memory alongside the model. As a rule of thumb for LoRA on MLX:

Base ModelMin RAM (LoRA)Comfortable
Qwen 4 4B16 GB24 GB
Phi-5 Mini16 GB24 GB
Gemma 4.5 12B32 GB48 GB
20B+ models48 GB64 GB+

According to LLMCheck, 32 GB of unified memory is the sweet spot — enough to fine-tune a genuinely useful 12B model with headroom for a reasonable batch size. If you are not sure what your Mac can do, the LLMCheck hardware checker maps your exact chip and RAM to a recommended base model.

Too big for your Mac? For very large models or a full (non-LoRA) fine-tune, the practical path is to rent a datacenter GPU by the hour rather than fight your Mac's memory ceiling. We cover that option in Step 6.

Step 2: Install MLX-LM

Work inside a virtual environment to keep dependencies clean, then install the package:

python3 -m venv ft-env && source ft-env/bin/activate
pip install mlx-lm

That single package pulls in MLX itself plus the LoRA trainer, the fuse utility, and the generation tools. Verify it imports:

python3 -c "import mlx_lm; print('mlx-lm ready')"

All of the commands below — mlx_lm.lora, mlx_lm.generate, mlx_lm.fuse — are installed by this one step.

Step 3: Prepare Your Dataset

MLX-LM reads a folder containing train.jsonl and valid.jsonl (a test.jsonl is optional). Each line is one training example. The simplest format is chat, where each line is a JSON object with a messages array:

{"messages": [{"role": "user", "content": "What is our refund window?"}, {"role": "assistant", "content": "Our refund window is 30 days from the delivery date."}]}
{"messages": [{"role": "user", "content": "Do you ship internationally?"}, {"role": "assistant", "content": "Yes — we ship to 40 countries with tracked delivery."}]}

MLX-LM also accepts a prompt/completion format if that fits your data better:

{"prompt": "Summarize: quarterly revenue rose 12%.", "completion": "Revenue grew 12% this quarter."}

Lay the files out like this:

data/
  train.jsonl   # the bulk of your examples
  valid.jsonl   # a small held-out set to watch for overfitting

You do not need many rows. A few hundred high-quality, consistent examples often beat thousands of noisy ones for a focused single-domain task. Keep the valid.jsonl set genuinely separate so the validation loss means something.

Tip: Match the assistant style you actually want. The model learns tone and format as much as facts — if every answer in your dataset is one tight sentence, your fine-tuned model will tend to answer in one tight sentence.

Step 4: Run the LoRA Fine-Tune

Point the trainer at a base model on Hugging Face (or a local path) and your data folder. This trains a LoRA adapter — the base weights stay frozen:

mlx_lm.lora \
  --model Qwen/Qwen4-4B-Instruct \
  --train \
  --data ./data \
  --iters 600 \
  --batch-size 4 \
  --num-layers 16 \
  --adapter-path ./adapters

What the key flags do:

You will see training and validation loss print as it runs. A steadily falling validation loss is healthy; if it bottoms out and starts climbing, you are beginning to overfit — stop or reduce iterations. According to LLMCheck testing, a few-hundred-example LoRA run on a 4B model finishes in minutes on an M3-class Mac.

Low on memory? Add --grad-checkpoint to trade some speed for a smaller memory footprint, and shrink --batch-size. For 4-bit base weights (QLoRA-style), load a quantized base model — MLX trains the adapter on top of it.

Step 5: Test & Fuse the Adapter

Before committing, test the adapter on top of the base model. mlx_lm.generate accepts the adapter path directly:

mlx_lm.generate \
  --model Qwen/Qwen4-4B-Instruct \
  --adapter-path ./adapters \
  --prompt "What is our refund window?"

If the answers reflect your training data, you are ready to fuse — merge the adapter into the base weights to produce one standalone model with no separate adapter to carry around:

mlx_lm.fuse \
  --model Qwen/Qwen4-4B-Instruct \
  --adapter-path ./adapters \
  --save-path ./qwen4-4b-custom

The fused model in ./qwen4-4b-custom behaves exactly like the adapter-plus-base combination, but loads as a single model. This is what you will convert for Ollama next.

Step 6: Convert to GGUF & Run in Ollama

Ollama runs GGUF models, so convert your fused MLX model to GGUF. The llama.cpp converter handles this:

# From a llama.cpp checkout
python convert_hf_to_gguf.py ./qwen4-4b-custom \
  --outfile qwen4-4b-custom.gguf \
  --outtype q4_k_m

Now write a tiny Modelfile that points Ollama at the GGUF file:

# Modelfile
FROM ./qwen4-4b-custom.gguf
PARAMETER temperature 0.7

Register it with Ollama and run it like any other model:

ollama create qwen4-custom -f Modelfile
ollama run qwen4-custom

That's the whole loop, entirely on your Mac: your data never left the machine, and you now have a custom model you can use anywhere Ollama runs. New to Ollama? Our Ollama setup guide covers the basics.

When to rent a GPU instead

LoRA on MLX is excellent for small-to-mid models, but it has a ceiling. If you want to full-fine-tune a model, train a very large one (30B+), or run many experiments quickly, a Mac's unified memory and single-machine throughput become the bottleneck. In that case, renting a datacenter GPU by the hour is the pragmatic move — you get far more VRAM and compute, and you only pay for the run. Services like Vast.ai let you spin up a GPU instance for a few dollars an hour for exactly these jobs.

Disclosure: the Vast.ai link is a referral — if you sign up through it, LLMCheck may earn a small credit at no extra cost to you. We only mention it because GPU rental is the genuine answer when a job outgrows your Mac.

For most people, though, a focused LoRA on a model that already fits your Mac is all you need — and it keeps everything private and free. To pick a base model that fits your hardware, browse the LLMCheck leaderboard.