Local RAG with Ollama

Sometimes you don't want your documents leaving your machine. Maybe it's proprietary code, financial records, or client data with strict residency requirements. Whatever the reason, running RAG locally means your data stays on your hardware. No API calls to OpenAI, no cloud vector databases, no third-party data processing agreements.
This guide walks through building a local RAG server using Ollama for the LLM and embeddings, ChromaDB for vector storage, and LangChain to wire it together.
What is Ollama?
Ollama lets you run LLMs locally. You pull a model (Llama 3, Mistral, Gemma, etc.) and run it on your machine -- similar to how Docker works for containers. It handles model downloading, GPU detection, and exposes a REST API for your applications.
What you get:
- CLI for model management:
ollama pull,ollama run,ollama list - REST API: Built in, no extra setup
- Embedding support: Models like
nomic-embed-textgenerate embeddings locally - GPU acceleration: Automatically uses your GPU if available
Why go local?
- Privacy: Documents never leave your machine
- No API costs: After setup, inference is free (you're paying in electricity and GPU time)
- No rate limits: Process as fast as your hardware allows
- Offline capable: Works without internet once models are downloaded
- Full control: Choose your model, tune parameters, swap components
Prerequisites
- 8GB RAM minimum (16GB+ recommended)
- macOS, Linux, or Windows with WSL2
- Python 3.8+
- ~10GB free disk space for models
Step 1: Install Ollama
macOS and Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows:
Download from ollama.com/download.
Verify:
ollama --version
Step 2: Pull models
You need two models: one for text generation, one for embeddings.
LLM (pick one based on your hardware):
# 8GB RAM
ollama pull llama3.2:3b
# 16GB+ RAM
ollama pull llama3.1:8b
# 32GB+ RAM
ollama pull llama3.1:70b
# Alternative
ollama pull mistral:7b
Embedding model:
# Higher quality
ollama pull nomic-embed-text
# Faster, smaller
ollama pull all-minilm
Check what's installed:
ollama list
Step 3: Set up Python environment
mkdir local-rag-server
cd local-rag-server
python -m venv venv
# macOS/Linux:
source venv/bin/activate
# Windows:
venv\Scripts\activate
Install dependencies:
pip install langchain langchain-community chromadb ollama pypdf python-dotenv
Step 4: Build the RAG server
Create rag_server.py:
import os
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
class LocalRAGServer:
def __init__(self, model_name="llama3.1:8b", embedding_model="nomic-embed-text"):
"""Initialize the RAG server with Ollama models."""
print(f"Initializing RAG server with {model_name}...")
# Initialize the LLM
self.llm = Ollama(model=model_name)
# Initialize embeddings
self.embeddings = OllamaEmbeddings(model=embedding_model)
# Initialize vector store
self.vector_store = None
self.qa_chain = None
print("RAG server initialized successfully!")
def load_documents(self, file_paths):
"""Load documents from various file formats."""
documents = []
for file_path in file_paths:
print(f"Loading {file_path}...")
if file_path.endswith('.pdf'):
loader = PyPDFLoader(file_path)
elif file_path.endswith('.txt'):
loader = TextLoader(file_path)
else:
print(f"Unsupported file type: {file_path}")
continue
documents.extend(loader.load())
print(f"Loaded {len(documents)} document(s)")
return documents
def create_vector_store(self, documents):
"""Create vector store from documents."""
print("Splitting documents into chunks...")
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
# Create vector store
print("Creating vector embeddings (this may take a while)...")
self.vector_store = Chroma.from_documents(
documents=chunks,
embedding=self.embeddings,
persist_directory="./chroma_db"
)
print("Vector store created successfully!")
return self.vector_store
def setup_qa_chain(self):
"""Set up the question-answering chain."""
if self.vector_store is None:
raise ValueError("Vector store not initialized. Load documents first.")
# Create a custom prompt template
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Context: {context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(
input_variables=["context", "question"],
template=template,
)
# Create retrieval QA chain
self.qa_chain = RetrievalQA.from_chain_type(
llm=self.llm,
chain_type="stuff",
retriever=self.vector_store.as_retriever(
search_kwargs={"k": 3}
),
return_source_documents=True,
chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)
print("QA chain configured!")
def query(self, question):
"""Query the RAG system."""
if self.qa_chain is None:
raise ValueError("QA chain not initialized. Run setup_qa_chain() first.")
print(f"\nProcessing query: {question}")
result = self.qa_chain.invoke({"query": question})
return {
"answer": result["result"],
"source_documents": result["source_documents"]
}
# Example usage
if __name__ == "__main__":
# Initialize RAG server
rag = LocalRAGServer(model_name="llama3.1:8b")
# Load your documents
documents = rag.load_documents([
"path/to/your/document1.pdf",
"path/to/your/document2.txt"
])
# Create vector store
rag.create_vector_store(documents)
# Setup QA chain
rag.setup_qa_chain()
# Query the system
while True:
user_question = input("\nAsk a question (or 'quit' to exit): ")
if user_question.lower() == 'quit':
break
response = rag.query(user_question)
print(f"\nAnswer: {response['answer']}")
print(f"\nSources: {len(response['source_documents'])} document chunks used")
Step 5: Add a web API (optional)
If you want HTTP access, add a Flask layer. Install Flask:
pip install flask flask-cors
Create api_server.py:
from flask import Flask, request, jsonify
from flask_cors import CORS
from rag_server import LocalRAGServer
app = Flask(__name__)
CORS(app)
# Initialize RAG server (do this once at startup)
rag = None
@app.route('/initialize', methods=['POST'])
def initialize():
"""Initialize the RAG server with documents."""
global rag
data = request.json
file_paths = data.get('file_paths', [])
model_name = data.get('model_name', 'llama3.1:8b')
try:
rag = LocalRAGServer(model_name=model_name)
documents = rag.load_documents(file_paths)
rag.create_vector_store(documents)
rag.setup_qa_chain()
return jsonify({
"status": "success",
"message": "RAG server initialized successfully"
})
except Exception as e:
return jsonify({
"status": "error",
"message": str(e)
}), 500
@app.route('/query', methods=['POST'])
def query():
"""Query the RAG system."""
if rag is None:
return jsonify({
"status": "error",
"message": "RAG server not initialized"
}), 400
data = request.json
question = data.get('question', '')
if not question:
return jsonify({
"status": "error",
"message": "No question provided"
}), 400
try:
response = rag.query(question)
return jsonify({
"status": "success",
"answer": response['answer'],
"num_sources": len(response['source_documents'])
})
except Exception as e:
return jsonify({
"status": "error",
"message": str(e)
}), 500
@app.route('/health', methods=['GET'])
def health():
"""Health check endpoint."""
return jsonify({
"status": "healthy",
"initialized": rag is not None
})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=True)
Run it:
python api_server.py
Test it
# Initialize
curl -X POST http://localhost:5000/initialize \
-H "Content-Type: application/json" \
-d '{
"file_paths": ["./documents/sample.pdf"],
"model_name": "llama3.1:8b"
}'
# Query
curl -X POST http://localhost:5000/query \
-H "Content-Type: application/json" \
-d '{
"question": "What is the main topic of the document?"
}'
# Health check
curl http://localhost:5000/health
Tuning performance
- GPU: Ollama auto-detects CUDA GPUs. Make sure drivers are installed.
- Chunk size: 500-1500 characters depending on your documents. Smaller chunks = more precise retrieval, larger chunks = more context per result.
- Retrieval count (k): Start with 3-5. More chunks = more context but slower and noisier.
- Model size: 3B models are fast but imprecise. 7-8B is a good balance. 70B is slow but noticeably better for complex questions.
Swapping ChromaDB
ChromaDB is fine for getting started. For production:
- PostgreSQL + pgvector if you already run Postgres
- Weaviate for advanced filtering and multi-tenancy
- Milvus for large-scale deployments
- FAISS if you want an in-memory option without a server
Going to production
If you're deploying this beyond your laptop:
- Containerize the whole stack with Docker
- Add authentication to the API endpoints
- Rate limit requests to prevent resource exhaustion
- Cache repeated queries
- Monitor query latency and error rates
- Log queries and retrieved chunks for debugging retrieval quality
Wrapping up
This setup gives you a fully private RAG pipeline. No data leaves your machine, no API bills, no rate limits. The tradeoff is hardware requirements -- you need a decent GPU and enough RAM for your chosen model.
As open-source models keep improving (Llama 3 is already competitive with GPT-3.5 for many tasks), the gap between local and cloud RAG narrows. For use cases where data privacy matters, local-first is increasingly the practical choice.


