What is RAG and why use it locally on Mac?

RAG (Retrieval-Augmented Generation) lets an LLM answer questions using your own documents as context. Running it locally on Mac means your private files never leave your machine — no cloud APIs, no data leaks, and zero ongoing cost after setup.

How much RAM does a local RAG system need on Mac?

According to LLMCheck benchmarks, a local RAG system needs roughly 1 GB for the nomic-embed-text embedding model, 500 MB for ChromaDB, plus whatever your chosen LLM requires (4-20 GB depending on model size). A 16 GB Mac can comfortably run RAG with a 7-9B parameter LLM.

Which embedding model works best for local RAG on Mac?

nomic-embed-text is the best choice for local RAG on Mac in 2026. It runs natively through Ollama, uses only ~1 GB of RAM, and produces 768-dimensional embeddings that perform competitively with cloud embedding APIs like OpenAI ada-002.

Can I use RAG with any Ollama model?

Yes. RAG works with any Ollama model because you inject retrieved context directly into the prompt. However, models with larger context windows (like Qwen 3.5 32B with 128K context) can handle more retrieved chunks and produce better answers.

How fast is local RAG compared to cloud RAG?

According to LLMCheck testing on M4 Max, embedding a 1,000-word document takes about 200ms locally with nomic-embed-text. Retrieval from ChromaDB with 10,000 documents takes under 50ms. The main latency comes from LLM generation, which runs at 40-80 tok/s depending on model size.

How to Build a Local RAG System on Mac with Ollama

Retrieval-Augmented Generation lets your local LLM answer questions using your own documents — company wikis, research papers, personal notes — without retraining the model. This guide walks you through building a fully private RAG pipeline on your Mac using Ollama, nomic-embed-text, and ChromaDB. No cloud APIs, no data leaving your machine.

What RAG Is and Why It Matters for Local AI

Large language models are trained on general knowledge, but they do not know about your private documents, company data, or recent files. Retrieval-Augmented Generation (RAG) solves this by fetching relevant context from your own documents and injecting it into the LLM prompt before generation.

The result: your local LLM can answer questions about your specific data — without fine-tuning, without sending anything to the cloud, and without the hallucination problems that come from asking a model about information it was never trained on.

RAG is particularly powerful when running locally on Mac because:

Complete privacy — your documents, embeddings, and queries never leave your machine
Zero ongoing cost — no per-token API fees, no monthly subscriptions
Low latency — no network round-trip; retrieval happens in milliseconds
Works offline — once set up, the entire pipeline runs without internet

RAG Architecture Overview

A local RAG system has three core components that work together:

Embedding model (nomic-embed-text via Ollama) — converts text into numerical vectors that capture semantic meaning. Similar documents produce similar vectors.
Vector store (ChromaDB) — a database optimized for storing and searching embedding vectors. When you query, it finds the most semantically similar documents in milliseconds.
LLM (any Ollama model) — generates the final answer using your query plus the retrieved context as input.

The pipeline flow is: Query → Embed query → Search vector store → Retrieve top-K documents → Inject into prompt → LLM generates answer.

Step 1: Install Ollama + Pull Embedding Model

If you do not already have Ollama installed, grab it from ollama.com or follow our Ollama installation guide. Then pull the embedding model and a chat model:

# Pull the embedding model (~270 MB)
ollama pull nomic-embed-text

# Pull a chat model for generation (pick one for your RAM)
ollama pull qwen3.5:9b      # 16 GB Mac
ollama pull qwen3.5:32b     # 32 GB+ Mac

Verify both models are available:

ollama list

You should see both nomic-embed-text and your chosen chat model in the output.

Why nomic-embed-text? It produces 768-dimensional embeddings, runs natively through Ollama with Metal acceleration, and uses only ~1 GB of RAM. According to LLMCheck benchmarks, it embeds text at over 5,000 tokens per second on M4 Max.

Step 2: Install ChromaDB

ChromaDB is a lightweight, open-source vector database that runs entirely in-process. No separate server needed — it embeds directly into your Python script.

# Create a virtual environment (recommended)
python3 -m venv rag-env
source rag-env/bin/activate

# Install ChromaDB and the Ollama Python client
pip install chromadb ollama

ChromaDB stores its data in a local SQLite database by default, so your vector index persists between sessions. On a typical document collection (10,000 files), the database uses about 500 MB of disk space.

Step 3: Index Your Documents

This script loads text files from a directory, generates embeddings using Ollama, and stores everything in ChromaDB:

import os
import ollama
import chromadb

# Initialize ChromaDB with persistent storage
client = chromadb.PersistentClient(path="./rag_db")
collection = client.get_or_create_collection(
    name="my_documents",
    metadata={"hnsw:space": "cosine"}
)

# Load and index documents
docs_dir = "./documents"
for filename in os.listdir(docs_dir):
    if filename.endswith((".txt", ".md")):
        filepath = os.path.join(docs_dir, filename)
        with open(filepath, "r") as f:
            content = f.read()

        # Generate embedding via Ollama
        response = ollama.embed(
            model="nomic-embed-text",
            input=content
        )

        # Store in ChromaDB
        collection.add(
            ids=[filename],
            documents=[content],
            embeddings=[response["embeddings"][0]]
        )
        print(f"Indexed: {filename}")

print(f"Total documents indexed: {collection.count()}")

Tip: For large documents, split them into chunks of 500-1000 tokens before embedding. This improves retrieval precision because the vector store can return the most relevant section rather than an entire long document.

Step 4: Query with Context Injection

Now build the query pipeline. This retrieves the most relevant documents and feeds them to the LLM as context:[LLMCheck]

import ollama
import chromadb

client = chromadb.PersistentClient(path="./rag_db")
collection = client.get_collection("my_documents")

def rag_query(question, n_results=3):
    # Embed the question
    query_embedding = ollama.embed(
        model="nomic-embed-text",
        input=question
    )["embeddings"][0]

    # Retrieve relevant documents
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )

    # Build context from retrieved docs
    context = "\n\n---\n\n".join(results["documents"][0])

    # Generate answer with context injection
    prompt = f"""Use the following context to answer the question.
If the context doesn't contain the answer, say so.

Context:
{context}

Question: {question}

Answer:"""

    response = ollama.chat(
        model="qwen3.5:9b",
        messages=[{"role": "user", "content": prompt}]
    )

    return response["message"]["content"]

# Example usage
answer = rag_query("What is our refund policy?")
print(answer)

The n_results parameter controls how many documents to retrieve. According to LLMCheck testing, 3-5 chunks is the sweet spot — enough context for accurate answers without overwhelming the model's context window.

Step 5: Connect to Open WebUI for a Chat Interface

If you prefer a visual chat interface over Python scripts, Open WebUI has built-in RAG support that connects directly to Ollama:

# Install and run Open WebUI via Docker
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Once running, open http://localhost:3000 in your browser. Navigate to Settings → Documents and upload your files. Open WebUI will automatically embed them using Ollama and provide a chat interface where you can query your documents naturally.

Performance and RAM Requirements

According to LLMCheck benchmarks on Apple Silicon, here is what to expect from each component:

Component	RAM Usage	Performance (M4 Max)
nomic-embed-text	~1 GB	5,200 tokens/sec embedding
ChromaDB (10K docs)	~500 MB	<50ms retrieval
Qwen 3.5 9B (LLM)	~6 GB	62 tok/s generation
Qwen 3.5 32B (LLM)	~20 GB	38 tok/s generation
Total (with 9B)	~7.5 GB	End-to-end: ~2s per query

A 16 GB Mac can comfortably run the full RAG pipeline with a 9B model. For 32B models, you will want 32 GB or more of unified memory.

Local RAG vs Cloud RAG

How does running RAG locally compare to using cloud services like OpenAI + Pinecone?

Dimension	Local RAG (Ollama + ChromaDB)	Cloud RAG (OpenAI + Pinecone)
Privacy	Complete — nothing leaves Mac	Data sent to third parties
Cost	Free after hardware	$20-200+/month API fees
Latency	~2s end-to-end	3-8s (network + API)
Quality (small corpus)	Good with 9B+ models	Excellent with GPT-4
Quality (large corpus)	Good retrieval, decent generation	Better generation quality
Offline support	Fully offline	Requires internet
Setup complexity	30 minutes	15 minutes
Scalability	Limited by Mac RAM	Virtually unlimited

For personal knowledge bases, company docs under 50,000 pages, and privacy-sensitive use cases, local RAG is the clear winner. Cloud RAG makes more sense when you need GPT-4-level generation quality or are indexing millions of documents.

How to Build a Local RAG System on Mac with Ollama

What RAG Is and Why It Matters for Local AI

RAG Architecture Overview

Step 1: Install Ollama + Pull Embedding Model

Step 2: Install ChromaDB

Step 3: Index Your Documents

Step 4: Query with Context Injection

Step 5: Connect to Open WebUI for a Chat Interface

Performance and RAM Requirements

Local RAG vs Cloud RAG

Frequently Asked Questions

What is RAG and why use it locally on Mac?

How much RAM does a local RAG system need on Mac?

Which embedding model works best for local RAG on Mac?

Can I use RAG with any Ollama model?

How fast is local RAG compared to cloud RAG?

Find the Best Model for Your RAG Pipeline