How RAG Works

Retrieval-Augmented Generation adds knowledge to your LLM without changing its weights. The process:

  1. Index — convert your documents into vector embeddings and store them in a vector database (like ChromaDB)
  2. Retrieve — when a user asks a question, embed the query and find the most similar documents
  3. Generate — inject the retrieved documents into the prompt and let the LLM generate an answer with that context

The model itself never changes. You are simply giving it better information to work with at inference time. For a complete setup walkthrough, see our RAG setup guide.

How Fine-Tuning Works on Mac

Fine-tuning modifies the model's actual weights by training it on your data. On Apple Silicon, this is done through LoRA (Low-Rank Adaptation) using the MLX framework — Apple's machine learning library optimized for Unified Memory.

# Install MLX fine-tuning tools
pip install mlx-lm

# Prepare training data (JSONL format)
# Each line: {"text": "your training example"}

# Run LoRA fine-tuning
mlx_lm.lora \
  --model mlx-community/Qwen2.5-7B-4bit \
  --data ./training_data \
  --train \
  --iters 1000 \
  --batch-size 4 \
  --lora-layers 16

LoRA does not retrain the entire model. It adds small trainable adapter layers (typically 0.1-1% of total parameters) that modify the model's behavior. This makes fine-tuning feasible on consumer hardware — you do not need server-grade GPUs.

Key difference: RAG changes what the model knows at query time. Fine-tuning changes how the model behaves permanently. This is the fundamental distinction that should guide your decision.

The 8-Dimension Comparison

According to LLMCheck testing on Apple Silicon, here is how RAG and fine-tuning compare across every dimension that matters:[LLMCheck]

DimensionRAGFine-Tuning (LoRA)
Setup time30 minutes2-6 hours (including training)
Additional RAM~2 GB (embeddings + vector DB)2x model size during training
Knowledge qualityExact — retrieves source textApproximate — encoded in weights
Data freshnessInstant — add new docs anytimeRequires retraining
CostFree (runs on existing hardware)Free but time-intensive
ML expertise neededNone — just Python basicsModerate — data prep, hyperparameters
Data size supportedThousands of documentsHundreds of examples ideal
Style/tone consistencyFollows base model styleLearns your specific style

When RAG Wins

RAG is the better choice when:

When Fine-Tuning Wins

Fine-tuning is the better choice when:

The Hybrid Approach: Fine-Tune + RAG

The most powerful setup combines both techniques. According to LLMCheck testing, the hybrid approach outperforms either method alone:[LLMCheck]

  1. Fine-tune the base model on your domain's style, terminology, and output format using LoRA via MLX
  2. Use RAG to inject specific knowledge at query time from your document collection

The fine-tuned model understands your domain's language better, which improves how it interprets and uses the RAG-retrieved context. For example:

# Hybrid workflow on Mac:
# 1. Fine-tune with MLX (one-time, ~2 hours)
mlx_lm.lora --model mlx-community/Qwen2.5-7B-4bit \
  --data ./style_examples --train --iters 500

# 2. Convert and load into Ollama
mlx_lm.fuse --model mlx-community/Qwen2.5-7B-4bit \
  --adapter-path ./adapters --save-path ./fused_model

# 3. Create Ollama model from fused weights
ollama create my-custom-model -f Modelfile

# 4. Use with RAG pipeline (ongoing)
# Your RAG code now uses "my-custom-model" instead of the base model

RAM Requirements

Here is what each approach costs in terms of unified memory on your Mac:

ConfigurationTraining RAMInference RAMMin Mac
RAG only (7B model)N/A~6 GB (LLM) + ~2 GB (RAG)16 GB
RAG only (32B model)N/A~20 GB (LLM) + ~2 GB (RAG)32 GB
Fine-tune (7B LoRA)~8 GB~4.5 GB (same as base)16 GB
Fine-tune (13B LoRA)~16 GB~8 GB (same as base)24 GB
Hybrid (7B fine-tuned + RAG)~8 GB (training)~6.5 GB (inference)16 GB

Important: Fine-tuning RAM is only needed during the training phase. Once training is complete, the fine-tuned model uses the same RAM as the base model during inference. According to LLMCheck, RAG adds a consistent ~2 GB overhead regardless of model size.

Verdict by Use Case

Company Knowledge Base / Internal Wiki

Use RAG. Your documents change frequently, you have many of them, and you need source attribution. RAG handles this perfectly with minimal setup.

Customer Support Bot with Brand Voice

Use hybrid. Fine-tune on 200-500 examples of your ideal support responses for tone and format, then use RAG to retrieve relevant help docs and policies at query time.

Personal Research Assistant

Use RAG. You are constantly adding new papers and notes. RAG lets you index them immediately without any training step.

Legal Document Drafting

Use hybrid. Fine-tune on your firm's document templates and legal writing style. Use RAG to retrieve relevant case law and precedents during drafting.

Code Generation in Your Framework

Use fine-tuning. Train on examples of your codebase's patterns, conventions, and APIs. The model needs to internalize your coding style, not just reference docs.

Structured Data Extraction

Use fine-tuning. Train the model to output data in your exact JSON schema consistently. RAG cannot teach output formatting — that requires weight changes.