Skip to content

RAG Without Vector Search

RAG Without Vector Search

If you've built RAG systems before, you know the drill: chunk your documents, generate vector embeddings, store them in a vector database, and hope that cosine similarity retrieves the right context for your LLM. It works, but it's fundamentally a brute-force approach. PageIndex throws out the entire vector pipeline and replaces it with something far more intuitive --- LLM reasoning over a hierarchical document index.

The result? A system that achieved 98.7% accuracy on FinanceBench (through their Mafin 2.5 product), outperforming every vector-based competitor for professional document analysis.

The Problem with Vector-Based RAG

Traditional RAG has several well-known pain points that anyone who's built production systems will recognize:

  • Chunking destroys context: Splitting a 200-page financial report into 500-token chunks loses the relationship between sections. A table on page 45 might reference definitions on page 3, but your chunks don't know that.
  • Semantic similarity is not understanding: Vector search finds text that "sounds similar" to your query, not text that actually answers it. Ask "What was the YoY revenue growth?" and you might get chunks mentioning revenue and growth separately, but not the actual comparison.
  • No reasoning over structure: Documents have structure --- chapters, sections, subsections, tables, appendices. Vector search ignores all of this and treats every chunk as an independent island.
  • Opaque retrieval: When your RAG system gives a wrong answer, good luck debugging which chunks were retrieved and why. Vector similarity scores don't explain reasoning.

How PageIndex Works

PageIndex takes a fundamentally different approach built on two steps:

Step 1: Build a Hierarchical Tree Index

Instead of chunking your document into flat pieces, PageIndex analyzes the document and builds a table-of-contents-style tree structure. Each node in the tree represents a logical section of the document with a title, summary, and page range references. This preserves the natural hierarchy and relationships within the document.

{
  "title": "Annual Financial Report 2025",
  "summary": "Comprehensive financial report covering...",
  "start_index": 1,
  "end_index": 200,
  "children": [
    {
      "title": "Executive Summary",
      "summary": "Key highlights including 15% revenue growth...",
      "start_index": 1,
      "end_index": 5
    },
    {
      "title": "Financial Statements",
      "summary": "Detailed income statement, balance sheet...",
      "start_index": 40,
      "end_index": 85,
      "children": [
        {
          "title": "Income Statement",
          "start_index": 40,
          "end_index": 55
        },
        {
          "title": "Balance Sheet",
          "start_index": 56,
          "end_index": 70
        }
      ]
    }
  ]
}

Step 2: Reasoning-Driven Retrieval

When you ask a question, instead of computing vector similarity, PageIndex uses an LLM to reason over the tree index. The LLM reads the node summaries, decides which branches are relevant, navigates deeper into those branches, and retrieves the actual content from the right pages. This mimics how a domain expert would navigate a document --- scanning the table of contents, going to the relevant section, and reading the specific pages that answer the question.

Open Source or Paid? Understanding the Licensing

PageIndex is released under the MIT License, making it fully open source. You can self-host, modify, and use it commercially without restrictions. However, there are cost considerations depending on how you use it:

  • Self-hosted (open source): The core PageIndex library is free. You clone the repo and run it locally. However, it requires an OpenAI API key by default (configurable via the --model flag), so you'll pay for the LLM API calls used during both indexing and querying. The indexing step processes your entire document through the LLM to build the tree structure, which can consume significant tokens for large documents.
  • Cloud service: VectifyAI offers a hosted version at chat.pageindex.ai with a free tier available, plus premium plans for enterprise use. The cloud service also provides MCP integration and a REST API.
  • Enterprise: Private and on-premises arrangements are available by contacting the team directly.

Bottom line: The software itself is free and open source, but you'll need to pay for the underlying LLM API calls (OpenAI by default) when self-hosting. This is similar to how LangChain is free but you still pay for the models you use through it.

Getting Started: Self-Hosted Setup

Let's walk through setting up PageIndex locally and indexing your first document.

Prerequisites

  • Python 3.8+
  • An OpenAI API key
  • A PDF or Markdown document to index

Installation

# Clone the repository
git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex

# Install dependencies
pip3 install --upgrade -r requirements.txt

Configuration

Create a .env file in the project root with your OpenAI API key:

CHATGPT_API_KEY=your_openai_api_key_here

Building Your First Index

Index a PDF document with a single command:

# Index a PDF document
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

# Or index a Markdown file
python3 run_pageindex.py --md_path /path/to/your/document.md

This command reads your document, sends it through the LLM to analyze its structure, and produces a hierarchical tree index as JSON output.

Customization Options

PageIndex provides several flags to control how your index is built:

# Use a different OpenAI model
python3 run_pageindex.py --pdf_path doc.pdf --model gpt-4o-2024-11-20

# Control how many pages per tree node (default: 10)
python3 run_pageindex.py --pdf_path doc.pdf --max-pages-per-node 5

# Control token limits per node (default: 20000)
python3 run_pageindex.py --pdf_path doc.pdf --max-tokens-per-node 15000

# Disable section summaries for faster indexing
python3 run_pageindex.py --pdf_path doc.pdf --if-add-node-summary no

The --max-pages-per-node flag is particularly important. Smaller values create more granular tree nodes (better precision but more LLM calls), while larger values create broader nodes (faster indexing but coarser retrieval).

Using the Cloud API

If you prefer not to self-host, PageIndex offers a cloud API. Here are examples of interacting with it:

Upload and Index a Document

import requests

# Upload a document for indexing
url = "https://api.pageindex.ai/v1/documents"
headers = {"Authorization": "Bearer YOUR_API_KEY"}

with open("report.pdf", "rb") as f:
    response = requests.post(
        url,
        headers=headers,
        files={"file": f}
    )

document_id = response.json()["document_id"]
print(f"Document indexed: {document_id}")

Query the Indexed Document

# Query the document using reasoning-based retrieval
query_url = f"https://api.pageindex.ai/v1/documents/{document_id}/query"

response = requests.post(
    query_url,
    headers=headers,
    json={"question": "What was the year-over-year revenue growth?"}
)

result = response.json()
print(f"Answer: {result['answer']}")
print(f"Sources: Pages {result['pages']}")

MCP Integration

One of PageIndex's standout features is its Model Context Protocol (MCP) integration. This allows AI assistants like Claude to directly interact with your indexed documents as a tool. You can configure the MCP server at pageindex.ai/mcp and give your AI assistant the ability to search and reason over your document library.

PageIndex vs Traditional Vector RAG: A Practical Comparison

Feature               | Vector RAG             | PageIndex
----------------------|------------------------|-------------------------
Document Processing   | Chunk + embed          | Build tree index
Retrieval Method      | Cosine similarity      | LLM reasoning
Infrastructure        | Vector DB required     | No vector DB needed
Explainability        | Similarity scores      | Page & section references
Structural Awareness  | None (flat chunks)     | Full hierarchy preserved
Accuracy (FinanceBench)| ~70-85%               | 98.7% (Mafin 2.5)
Cost Model            | Embedding + DB hosting | LLM API calls
Best For              | Short docs, high volume| Long, structured docs

When to Use PageIndex

PageIndex shines in specific scenarios. Here's when it makes sense and when traditional RAG might still be the better choice:

Use PageIndex when:

  • You're working with long, structured documents (financial reports, legal contracts, technical manuals, research papers)
  • Accuracy matters more than speed --- the reasoning approach is slower per query but significantly more accurate
  • You need explainable retrieval with exact page and section references
  • Your questions require cross-referencing multiple sections of a document
  • You want to avoid managing vector database infrastructure

Stick with vector RAG when:

  • You need sub-second retrieval latency at scale
  • You're searching across thousands of short documents (e.g., support tickets, FAQs)
  • You want to minimize per-query LLM costs --- vector search only uses the LLM for generation, not retrieval
  • Your documents are unstructured with no clear hierarchy

Cost Considerations

Understanding the cost profile of PageIndex is important because it's different from traditional RAG:

  • Indexing cost: Building the tree index requires processing the entire document through an LLM. For a 100-page PDF using GPT-4o, this could cost $1-5 depending on content density. This is a one-time cost per document.
  • Query cost: Each query uses LLM reasoning to navigate the tree, so every retrieval involves LLM API calls. This is more expensive per query than vector similarity search but typically more accurate.
  • No infrastructure cost: You don't need to host a vector database (Pinecone, Weaviate, Qdrant, etc.), which can save significant monthly costs for production deployments.

A Complete Example: Analyzing a Financial Report

Let's walk through a realistic example of using PageIndex to analyze a company's annual report:

# Step 1: Clone and set up PageIndex
git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex
pip3 install --upgrade -r requirements.txt

# Step 2: Configure your API key
echo "CHATGPT_API_KEY=sk-your-key-here" > .env

# Step 3: Index the annual report
python3 run_pageindex.py \
  --pdf_path ./annual_report_2025.pdf \
  --model gpt-4o-2024-11-20 \
  --max-pages-per-node 5 \
  --if-add-node-summary yes

# This produces a tree index JSON file that looks like:
# {
#   "title": "Annual Report 2025",
#   "children": [
#     {"title": "Letter to Shareholders", "pages": "1-3"},
#     {"title": "Business Overview", "pages": "4-20", "children": [...]},
#     {"title": "Financial Statements", "pages": "45-90", "children": [
#       {"title": "Consolidated Income Statement", "pages": "45-52"},
#       {"title": "Balance Sheet", "pages": "53-65"},
#       ...
#     ]}
#   ]
# }

When you query "What was the YoY revenue growth?", the LLM reasons: "Revenue growth would be in the Financial Statements section, specifically the Income Statement. Let me navigate there." It then reads the actual pages (45-52) and gives you a precise answer with page references. No vector similarity guessing involved.

The bigger picture

PageIndex is part of a shift in the RAG space: using LLM reasoning for retrieval, not just generation. As model costs drop, the economics tilt further in favor of this approach. You're trading more LLM calls per query for significantly better accuracy on structured documents.

For long, structured documents where getting the right answer matters -- financial reports, legal contracts, technical manuals -- the reasoning-based approach is worth trying. The MIT license and simple setup mean you can test it on your own docs in an afternoon.

Resources