What RAG Is and Why It Matters for Local AI

Large language models are trained on general knowledge, but they do not know about your private documents, company data, or recent files. Retrieval-Augmented Generation (RAG) solves this by fetching relevant context from your own documents and injecting it into the LLM prompt before generation.

The result: your local LLM can answer questions about your specific data — without fine-tuning, without sending anything to the cloud, and without the hallucination problems that come from asking a model about information it was never trained on.

RAG is particularly powerful when running locally on Mac because:

RAG Architecture Overview

A local RAG system has three core components that work together:

  1. Embedding model (nomic-embed-text via Ollama) — converts text into numerical vectors that capture semantic meaning. Similar documents produce similar vectors.
  2. Vector store (ChromaDB) — a database optimized for storing and searching embedding vectors. When you query, it finds the most semantically similar documents in milliseconds.
  3. LLM (any Ollama model) — generates the final answer using your query plus the retrieved context as input.

The pipeline flow is: Query → Embed query → Search vector store → Retrieve top-K documents → Inject into prompt → LLM generates answer.

Step 1: Install Ollama + Pull Embedding Model

If you do not already have Ollama installed, grab it from ollama.com or follow our Ollama installation guide. Then pull the embedding model and a chat model:

# Pull the embedding model (~270 MB)
ollama pull nomic-embed-text

# Pull a chat model for generation (pick one for your RAM)
ollama pull qwen3.5:9b      # 16 GB Mac
ollama pull qwen3.5:32b     # 32 GB+ Mac

Verify both models are available:

ollama list

You should see both nomic-embed-text and your chosen chat model in the output.

Why nomic-embed-text? It produces 768-dimensional embeddings, runs natively through Ollama with Metal acceleration, and uses only ~1 GB of RAM. According to LLMCheck benchmarks, it embeds text at over 5,000 tokens per second on M4 Max.

Step 2: Install ChromaDB

ChromaDB is a lightweight, open-source vector database that runs entirely in-process. No separate server needed — it embeds directly into your Python script.

# Create a virtual environment (recommended)
python3 -m venv rag-env
source rag-env/bin/activate

# Install ChromaDB and the Ollama Python client
pip install chromadb ollama

ChromaDB stores its data in a local SQLite database by default, so your vector index persists between sessions. On a typical document collection (10,000 files), the database uses about 500 MB of disk space.

Step 3: Index Your Documents

This script loads text files from a directory, generates embeddings using Ollama, and stores everything in ChromaDB:

import os
import ollama
import chromadb

# Initialize ChromaDB with persistent storage
client = chromadb.PersistentClient(path="./rag_db")
collection = client.get_or_create_collection(
    name="my_documents",
    metadata={"hnsw:space": "cosine"}
)

# Load and index documents
docs_dir = "./documents"
for filename in os.listdir(docs_dir):
    if filename.endswith((".txt", ".md")):
        filepath = os.path.join(docs_dir, filename)
        with open(filepath, "r") as f:
            content = f.read()

        # Generate embedding via Ollama
        response = ollama.embed(
            model="nomic-embed-text",
            input=content
        )

        # Store in ChromaDB
        collection.add(
            ids=[filename],
            documents=[content],
            embeddings=[response["embeddings"][0]]
        )
        print(f"Indexed: {filename}")

print(f"Total documents indexed: {collection.count()}")

Tip: For large documents, split them into chunks of 500-1000 tokens before embedding. This improves retrieval precision because the vector store can return the most relevant section rather than an entire long document.

Step 4: Query with Context Injection

Now build the query pipeline. This retrieves the most relevant documents and feeds them to the LLM as context:[LLMCheck]

import ollama
import chromadb

client = chromadb.PersistentClient(path="./rag_db")
collection = client.get_collection("my_documents")

def rag_query(question, n_results=3):
    # Embed the question
    query_embedding = ollama.embed(
        model="nomic-embed-text",
        input=question
    )["embeddings"][0]

    # Retrieve relevant documents
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )

    # Build context from retrieved docs
    context = "\n\n---\n\n".join(results["documents"][0])

    # Generate answer with context injection
    prompt = f"""Use the following context to answer the question.
If the context doesn't contain the answer, say so.

Context:
{context}

Question: {question}

Answer:"""

    response = ollama.chat(
        model="qwen3.5:9b",
        messages=[{"role": "user", "content": prompt}]
    )

    return response["message"]["content"]

# Example usage
answer = rag_query("What is our refund policy?")
print(answer)

The n_results parameter controls how many documents to retrieve. According to LLMCheck testing, 3-5 chunks is the sweet spot — enough context for accurate answers without overwhelming the model's context window.

Step 5: Connect to Open WebUI for a Chat Interface

If you prefer a visual chat interface over Python scripts, Open WebUI has built-in RAG support that connects directly to Ollama:

# Install and run Open WebUI via Docker
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Once running, open http://localhost:3000 in your browser. Navigate to Settings → Documents and upload your files. Open WebUI will automatically embed them using Ollama and provide a chat interface where you can query your documents naturally.

Performance and RAM Requirements

According to LLMCheck benchmarks on Apple Silicon, here is what to expect from each component:

ComponentRAM UsagePerformance (M4 Max)
nomic-embed-text~1 GB5,200 tokens/sec embedding
ChromaDB (10K docs)~500 MB<50ms retrieval
Qwen 3.5 9B (LLM)~6 GB62 tok/s generation
Qwen 3.5 32B (LLM)~20 GB38 tok/s generation
Total (with 9B)~7.5 GBEnd-to-end: ~2s per query

A 16 GB Mac can comfortably run the full RAG pipeline with a 9B model. For 32B models, you will want 32 GB or more of unified memory.

Local RAG vs Cloud RAG

How does running RAG locally compare to using cloud services like OpenAI + Pinecone?

DimensionLocal RAG (Ollama + ChromaDB)Cloud RAG (OpenAI + Pinecone)
PrivacyComplete — nothing leaves MacData sent to third parties
CostFree after hardware$20-200+/month API fees
Latency~2s end-to-end3-8s (network + API)
Quality (small corpus)Good with 9B+ modelsExcellent with GPT-4
Quality (large corpus)Good retrieval, decent generationBetter generation quality
Offline supportFully offlineRequires internet
Setup complexity30 minutes15 minutes
ScalabilityLimited by Mac RAMVirtually unlimited

For personal knowledge bases, company docs under 50,000 pages, and privacy-sensitive use cases, local RAG is the clear winner. Cloud RAG makes more sense when you need GPT-4-level generation quality or are indexing millions of documents.