How RAG Works
Retrieval-Augmented Generation adds knowledge to your LLM without changing its weights. The process:
- Index — convert your documents into vector embeddings and store them in a vector database (like ChromaDB)
- Retrieve — when a user asks a question, embed the query and find the most similar documents
- Generate — inject the retrieved documents into the prompt and let the LLM generate an answer with that context
The model itself never changes. You are simply giving it better information to work with at inference time. For a complete setup walkthrough, see our RAG setup guide.
How Fine-Tuning Works on Mac
Fine-tuning modifies the model's actual weights by training it on your data. On Apple Silicon, this is done through LoRA (Low-Rank Adaptation) using the MLX framework — Apple's machine learning library optimized for Unified Memory.
# Install MLX fine-tuning tools
pip install mlx-lm
# Prepare training data (JSONL format)
# Each line: {"text": "your training example"}
# Run LoRA fine-tuning
mlx_lm.lora \
--model mlx-community/Qwen2.5-7B-4bit \
--data ./training_data \
--train \
--iters 1000 \
--batch-size 4 \
--lora-layers 16
LoRA does not retrain the entire model. It adds small trainable adapter layers (typically 0.1-1% of total parameters) that modify the model's behavior. This makes fine-tuning feasible on consumer hardware — you do not need server-grade GPUs.
Key difference: RAG changes what the model knows at query time. Fine-tuning changes how the model behaves permanently. This is the fundamental distinction that should guide your decision.
The 8-Dimension Comparison
According to LLMCheck testing on Apple Silicon, here is how RAG and fine-tuning compare across every dimension that matters:[LLMCheck]
| Dimension | RAG | Fine-Tuning (LoRA) |
|---|---|---|
| Setup time | 30 minutes | 2-6 hours (including training) |
| Additional RAM | ~2 GB (embeddings + vector DB) | 2x model size during training |
| Knowledge quality | Exact — retrieves source text | Approximate — encoded in weights |
| Data freshness | Instant — add new docs anytime | Requires retraining |
| Cost | Free (runs on existing hardware) | Free but time-intensive |
| ML expertise needed | None — just Python basics | Moderate — data prep, hyperparameters |
| Data size supported | Thousands of documents | Hundreds of examples ideal |
| Style/tone consistency | Follows base model style | Learns your specific style |
When RAG Wins
RAG is the better choice when:
- Your data changes frequently — add or update documents without retraining. A company wiki that is updated weekly is a perfect RAG use case.
- You have a large knowledge base — RAG handles thousands of documents. Fine-tuning struggles to absorb large volumes of factual content.
- You need source attribution — RAG retrieves specific document chunks, so you can trace every answer back to its source. Fine-tuning provides no attribution.
- You need quick setup — a working RAG pipeline takes 30 minutes. Fine-tuning requires data preparation, training, and evaluation.
- Factual accuracy is critical — RAG gives the model the exact text to reference. Fine-tuned models can hallucinate "learned" facts that are slightly wrong.
When Fine-Tuning Wins
Fine-tuning is the better choice when:
- You need a consistent voice — train the model to write in your company's tone, your personal style, or a specific format. RAG cannot change how the model writes, only what it writes about.
- You have a specialized domain — medical terminology, legal language, or technical jargon that the base model handles poorly. Fine-tuning teaches the model to reason within your domain.
- Your dataset is small and stable — 100-1,000 high-quality examples is the sweet spot for LoRA. The data should not change frequently.
- You need faster inference — a fine-tuned model does not need the retrieval step, so responses come slightly faster (no embedding + search overhead).
- You want specific output formats — train the model to always produce JSON in a specific schema, or to follow a strict report template.
The Hybrid Approach: Fine-Tune + RAG
The most powerful setup combines both techniques. According to LLMCheck testing, the hybrid approach outperforms either method alone:[LLMCheck]
- Fine-tune the base model on your domain's style, terminology, and output format using LoRA via MLX
- Use RAG to inject specific knowledge at query time from your document collection
The fine-tuned model understands your domain's language better, which improves how it interprets and uses the RAG-retrieved context. For example:
- Fine-tune a model on your company's writing style + RAG your company docs = an AI that sounds like your team and knows your data
- Fine-tune on medical terminology + RAG patient records = a model that reasons medically and references specific cases
- Fine-tune on JSON output format + RAG your API documentation = structured answers grounded in real specs
# Hybrid workflow on Mac:
# 1. Fine-tune with MLX (one-time, ~2 hours)
mlx_lm.lora --model mlx-community/Qwen2.5-7B-4bit \
--data ./style_examples --train --iters 500
# 2. Convert and load into Ollama
mlx_lm.fuse --model mlx-community/Qwen2.5-7B-4bit \
--adapter-path ./adapters --save-path ./fused_model
# 3. Create Ollama model from fused weights
ollama create my-custom-model -f Modelfile
# 4. Use with RAG pipeline (ongoing)
# Your RAG code now uses "my-custom-model" instead of the base model
RAM Requirements
Here is what each approach costs in terms of unified memory on your Mac:
| Configuration | Training RAM | Inference RAM | Min Mac |
|---|---|---|---|
| RAG only (7B model) | N/A | ~6 GB (LLM) + ~2 GB (RAG) | 16 GB |
| RAG only (32B model) | N/A | ~20 GB (LLM) + ~2 GB (RAG) | 32 GB |
| Fine-tune (7B LoRA) | ~8 GB | ~4.5 GB (same as base) | 16 GB |
| Fine-tune (13B LoRA) | ~16 GB | ~8 GB (same as base) | 24 GB |
| Hybrid (7B fine-tuned + RAG) | ~8 GB (training) | ~6.5 GB (inference) | 16 GB |
Important: Fine-tuning RAM is only needed during the training phase. Once training is complete, the fine-tuned model uses the same RAM as the base model during inference. According to LLMCheck, RAG adds a consistent ~2 GB overhead regardless of model size.
Verdict by Use Case
Company Knowledge Base / Internal Wiki
Use RAG. Your documents change frequently, you have many of them, and you need source attribution. RAG handles this perfectly with minimal setup.
Customer Support Bot with Brand Voice
Use hybrid. Fine-tune on 200-500 examples of your ideal support responses for tone and format, then use RAG to retrieve relevant help docs and policies at query time.
Personal Research Assistant
Use RAG. You are constantly adding new papers and notes. RAG lets you index them immediately without any training step.
Legal Document Drafting
Use hybrid. Fine-tune on your firm's document templates and legal writing style. Use RAG to retrieve relevant case law and precedents during drafting.
Code Generation in Your Framework
Use fine-tuning. Train on examples of your codebase's patterns, conventions, and APIs. The model needs to internalize your coding style, not just reference docs.
Structured Data Extraction
Use fine-tuning. Train the model to output data in your exact JSON schema consistently. RAG cannot teach output formatting — that requires weight changes.