Skip to content

Local RAG with Ollama

Local RAG with Ollama

Sometimes you don't want your documents leaving your machine. Maybe it's proprietary code, financial records, or client data with strict residency requirements. Whatever the reason, running RAG locally means your data stays on your hardware. No API calls to OpenAI, no cloud vector databases, no third-party data processing agreements.

This guide walks through building a local RAG server using Ollama for the LLM and embeddings, ChromaDB for vector storage, and LangChain to wire it together.

What is Ollama?

Ollama lets you run LLMs locally. You pull a model (Llama 3, Mistral, Gemma, etc.) and run it on your machine -- similar to how Docker works for containers. It handles model downloading, GPU detection, and exposes a REST API for your applications.

What you get:

  • CLI for model management: ollama pull, ollama run, ollama list
  • REST API: Built in, no extra setup
  • Embedding support: Models like nomic-embed-text generate embeddings locally
  • GPU acceleration: Automatically uses your GPU if available

Why go local?

  • Privacy: Documents never leave your machine
  • No API costs: After setup, inference is free (you're paying in electricity and GPU time)
  • No rate limits: Process as fast as your hardware allows
  • Offline capable: Works without internet once models are downloaded
  • Full control: Choose your model, tune parameters, swap components

Prerequisites

  • 8GB RAM minimum (16GB+ recommended)
  • macOS, Linux, or Windows with WSL2
  • Python 3.8+
  • ~10GB free disk space for models

Step 1: Install Ollama

macOS and Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download from ollama.com/download.

Verify:

ollama --version

Step 2: Pull models

You need two models: one for text generation, one for embeddings.

LLM (pick one based on your hardware):

# 8GB RAM
ollama pull llama3.2:3b

# 16GB+ RAM
ollama pull llama3.1:8b

# 32GB+ RAM
ollama pull llama3.1:70b

# Alternative
ollama pull mistral:7b

Embedding model:

# Higher quality
ollama pull nomic-embed-text

# Faster, smaller
ollama pull all-minilm

Check what's installed:

ollama list

Step 3: Set up Python environment

mkdir local-rag-server
cd local-rag-server
python -m venv venv

# macOS/Linux:
source venv/bin/activate
# Windows:
venv\Scripts\activate

Install dependencies:

pip install langchain langchain-community chromadb ollama pypdf python-dotenv

Step 4: Build the RAG server

Create rag_server.py:

import os
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

class LocalRAGServer:
    def __init__(self, model_name="llama3.1:8b", embedding_model="nomic-embed-text"):
        """Initialize the RAG server with Ollama models."""
        print(f"Initializing RAG server with {model_name}...")

        # Initialize the LLM
        self.llm = Ollama(model=model_name)

        # Initialize embeddings
        self.embeddings = OllamaEmbeddings(model=embedding_model)

        # Initialize vector store
        self.vector_store = None
        self.qa_chain = None

        print("RAG server initialized successfully!")

    def load_documents(self, file_paths):
        """Load documents from various file formats."""
        documents = []

        for file_path in file_paths:
            print(f"Loading {file_path}...")

            if file_path.endswith('.pdf'):
                loader = PyPDFLoader(file_path)
            elif file_path.endswith('.txt'):
                loader = TextLoader(file_path)
            else:
                print(f"Unsupported file type: {file_path}")
                continue

            documents.extend(loader.load())

        print(f"Loaded {len(documents)} document(s)")
        return documents

    def create_vector_store(self, documents):
        """Create vector store from documents."""
        print("Splitting documents into chunks...")

        # Split documents into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            length_function=len
        )

        chunks = text_splitter.split_documents(documents)
        print(f"Created {len(chunks)} chunks")

        # Create vector store
        print("Creating vector embeddings (this may take a while)...")
        self.vector_store = Chroma.from_documents(
            documents=chunks,
            embedding=self.embeddings,
            persist_directory="./chroma_db"
        )

        print("Vector store created successfully!")
        return self.vector_store

    def setup_qa_chain(self):
        """Set up the question-answering chain."""
        if self.vector_store is None:
            raise ValueError("Vector store not initialized. Load documents first.")

        # Create a custom prompt template
        template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.

Context: {context}

Question: {question}

Helpful Answer:"""

        QA_CHAIN_PROMPT = PromptTemplate(
            input_variables=["context", "question"],
            template=template,
        )

        # Create retrieval QA chain
        self.qa_chain = RetrievalQA.from_chain_type(
            llm=self.llm,
            chain_type="stuff",
            retriever=self.vector_store.as_retriever(
                search_kwargs={"k": 3}
            ),
            return_source_documents=True,
            chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
        )

        print("QA chain configured!")

    def query(self, question):
        """Query the RAG system."""
        if self.qa_chain is None:
            raise ValueError("QA chain not initialized. Run setup_qa_chain() first.")

        print(f"\nProcessing query: {question}")
        result = self.qa_chain.invoke({"query": question})

        return {
            "answer": result["result"],
            "source_documents": result["source_documents"]
        }

# Example usage
if __name__ == "__main__":
    # Initialize RAG server
    rag = LocalRAGServer(model_name="llama3.1:8b")

    # Load your documents
    documents = rag.load_documents([
        "path/to/your/document1.pdf",
        "path/to/your/document2.txt"
    ])

    # Create vector store
    rag.create_vector_store(documents)

    # Setup QA chain
    rag.setup_qa_chain()

    # Query the system
    while True:
        user_question = input("\nAsk a question (or 'quit' to exit): ")

        if user_question.lower() == 'quit':
            break

        response = rag.query(user_question)
        print(f"\nAnswer: {response['answer']}")
        print(f"\nSources: {len(response['source_documents'])} document chunks used")

Step 5: Add a web API (optional)

If you want HTTP access, add a Flask layer. Install Flask:

pip install flask flask-cors

Create api_server.py:

from flask import Flask, request, jsonify
from flask_cors import CORS
from rag_server import LocalRAGServer

app = Flask(__name__)
CORS(app)

# Initialize RAG server (do this once at startup)
rag = None

@app.route('/initialize', methods=['POST'])
def initialize():
    """Initialize the RAG server with documents."""
    global rag

    data = request.json
    file_paths = data.get('file_paths', [])
    model_name = data.get('model_name', 'llama3.1:8b')

    try:
        rag = LocalRAGServer(model_name=model_name)
        documents = rag.load_documents(file_paths)
        rag.create_vector_store(documents)
        rag.setup_qa_chain()

        return jsonify({
            "status": "success",
            "message": "RAG server initialized successfully"
        })
    except Exception as e:
        return jsonify({
            "status": "error",
            "message": str(e)
        }), 500

@app.route('/query', methods=['POST'])
def query():
    """Query the RAG system."""
    if rag is None:
        return jsonify({
            "status": "error",
            "message": "RAG server not initialized"
        }), 400

    data = request.json
    question = data.get('question', '')

    if not question:
        return jsonify({
            "status": "error",
            "message": "No question provided"
        }), 400

    try:
        response = rag.query(question)
        return jsonify({
            "status": "success",
            "answer": response['answer'],
            "num_sources": len(response['source_documents'])
        })
    except Exception as e:
        return jsonify({
            "status": "error",
            "message": str(e)
        }), 500

@app.route('/health', methods=['GET'])
def health():
    """Health check endpoint."""
    return jsonify({
        "status": "healthy",
        "initialized": rag is not None
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=True)

Run it:

python api_server.py

Test it

# Initialize
curl -X POST http://localhost:5000/initialize \
  -H "Content-Type: application/json" \
  -d '{
    "file_paths": ["./documents/sample.pdf"],
    "model_name": "llama3.1:8b"
  }'

# Query
curl -X POST http://localhost:5000/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is the main topic of the document?"
  }'

# Health check
curl http://localhost:5000/health

Tuning performance

  • GPU: Ollama auto-detects CUDA GPUs. Make sure drivers are installed.
  • Chunk size: 500-1500 characters depending on your documents. Smaller chunks = more precise retrieval, larger chunks = more context per result.
  • Retrieval count (k): Start with 3-5. More chunks = more context but slower and noisier.
  • Model size: 3B models are fast but imprecise. 7-8B is a good balance. 70B is slow but noticeably better for complex questions.

Swapping ChromaDB

ChromaDB is fine for getting started. For production:

  • PostgreSQL + pgvector if you already run Postgres
  • Weaviate for advanced filtering and multi-tenancy
  • Milvus for large-scale deployments
  • FAISS if you want an in-memory option without a server

Going to production

If you're deploying this beyond your laptop:

  • Containerize the whole stack with Docker
  • Add authentication to the API endpoints
  • Rate limit requests to prevent resource exhaustion
  • Cache repeated queries
  • Monitor query latency and error rates
  • Log queries and retrieved chunks for debugging retrieval quality

Wrapping up

This setup gives you a fully private RAG pipeline. No data leaves your machine, no API bills, no rate limits. The tradeoff is hardware requirements -- you need a decent GPU and enough RAM for your chosen model.

As open-source models keep improving (Llama 3 is already competitive with GPT-3.5 for many tasks), the gap between local and cloud RAG narrows. For use cases where data privacy matters, local-first is increasingly the practical choice.