PageIndex Deep Dive: The Good, The Bad, and The Ugly of Vectorless RAG
Table of Contents
What if everything we know about RAG is built on a flawed assumption?
That’s the provocative claim behind PageIndex, a project from VectifyAI that’s accumulated 19,000+ GitHub stars in under a year. Their argument: similarity is not relevance, and the entire chunk-embed-retrieve pipeline that powers most RAG systems today is fundamentally limited. Instead, PageIndex builds hierarchical tree structures from documents and uses LLM reasoning to navigate them — no vectors, no chunking, no embedding databases.
I spent some time digging through the codebase, the benchmarks, and the competitive landscape. Here’s what I found.
What PageIndex Actually Does
At its core, PageIndex is a two-stage system:
Stage 1 — Index Generation: Given a PDF or Markdown document, PageIndex uses an LLM (GPT-4o by default) to build a hierarchical JSON tree. It detects the table of contents, maps sections to physical page numbers, recursively splits oversized nodes, and generates summaries. The output is a structured representation of the document — think of it as a machine-readable table of contents with metadata.
Stage 2 — Reasoning-Based Retrieval: When you query the system, instead of searching a vector database for “similar” chunks, an LLM reasons over the tree structure. It reads section titles and summaries, decides which branch to explore, drills down, extracts information, and follows cross-references. This mimics how a human expert actually navigates a 200-page annual report.
The pipeline has a self-correcting mechanism too — it verifies its own TOC extraction, and if accuracy drops below 60%, it cascades through three fallback modes (TOC with page numbers → TOC without page numbers → pure LLM segmentation).
from pageindex import page_index
result = page_index(
doc="/path/to/annual-report.pdf",
model="gpt-4o-2024-11-20",
max_pages_per_node=10,
if_add_node_summary="yes"
)
# Returns: hierarchical JSON tree with section summaries
The output looks like this:
{
"doc_name": "2023-annual-report.pdf",
"structure": [
{
"title": "Risk Management",
"node_id": "0005",
"start_index": 42,
"end_index": 67,
"summary": "Covers market risk, credit risk, and operational risk...",
"nodes": [
{
"title": "Interest Rate Risk",
"node_id": "0006",
"start_index": 45,
"end_index": 52,
"summary": "Details hedging strategies and sensitivity analysis..."
}
]
}
]
}
Every retrieved answer traces back to specific page numbers and section paths. No opacity.
The Problem It’s Solving
If you’ve built a RAG system, you’ve fought the chunking war. The core tension is irreconcilable:
- Small chunks (256-512 tokens) embed cleanly and match precisely, but lose all surrounding context. “Revenue grew 3%” — which company? Which quarter?
- Large chunks (1024+ tokens) preserve context but dilute semantic focus, making similarity search less reliable.
The community has layered increasingly sophisticated fixes on top: Anthropic’s Contextual Retrieval prepends per-chunk context (reducing retrieval failure by 49%). Jina AI’s Late Chunking embeds the full document first, then extracts chunk vectors from token-level representations. Semantic chunking groups sentences by embedding similarity.
But these are all patches on the same architecture. PageIndex asks: what if we just skip the chunking entirely?
The Good
A Genuinely Novel Architecture
This isn’t another “improved chunking strategy” paper. PageIndex replaces the entire vector similarity paradigm with tree-based reasoning. The AlphaGo inspiration is apt — just as AlphaGo uses Monte Carlo Tree Search to reason through game states rather than pattern-matching board positions, PageIndex navigates document hierarchies through explicit LLM reasoning rather than embedding similarity.
The ~2,500 lines of Python are lean and readable. Dependencies are minimal: OpenAI SDK, PyMuPDF, tiktoken, and that’s essentially it. No PyTorch, no FAISS, no vector database. The entire system is an LLM talking to a JSON tree.
The FinanceBench Numbers Are Real
VectifyAI’s Mafin 2.5 system (built on PageIndex) achieved 98.7% accuracy on FinanceBench — a benchmark of financial Q&A over SEC filings and earnings reports. For context, traditional vector-based RAG solutions typically score 60-80% on this dataset. That’s not a marginal improvement; it’s a different tier of performance.
This makes sense when you think about it. Financial documents are highly structured. When an analyst asks “What was Microsoft’s operating margin in Q3 2024?”, the answer lives in a specific section of a specific table in a specific filing. Vector similarity search might surface chunks that mention operating margins. Tree-based reasoning can navigate directly to Financials → Income Statement → Q3 → Operating Margin.
Explainability by Design
Every answer comes with a traceable path through the document tree: “I found this on pages 45-52, under Risk Management → Interest Rate Risk.” For regulated industries — finance, healthcare, legal — this isn’t a nice-to-have. It’s a compliance requirement.
Vector RAG gives you “here are the top-5 most similar chunks” with cosine similarity scores. That’s not an explanation anyone can audit.
Cross-Reference Handling
When a document says “see Appendix G for methodology details,” chunk-based systems are blind to this. The chunk containing the reference and the chunk containing Appendix G are separate islands in vector space. PageIndex’s tree navigation can follow these references because it understands document structure, not just text similarity.
Unlimited Document Length
The recursive node splitting is clever: any section exceeding 10 pages or 20,000 tokens gets automatically decomposed into sub-sections via LLM segmentation. This means PageIndex can handle documents of arbitrary length without hitting context window limits — a genuine advantage over approaches that try to stuff everything into a single LLM call.
The Bad
Complete LLM Dependency
Every operation in PageIndex requires GPT-4o API calls. TOC detection? LLM call. Physical page mapping? LLM call. Summary generation? LLM call. Error correction? LLM call. For a 100-page document, you’re looking at 50-200 API calls during indexing.
There’s no offline mode, no local model support out of the box, and no way to run this without an OpenAI API key. The CHATGPT_API_KEY environment variable name tells you everything about where this system’s allegiance lies. While you could theoretically swap in another OpenAI-compatible API, the prompts are tuned for GPT-4o.
Cost Scales Uncomfortably
Indexing a typical 100-page PDF costs roughly $0.50-$5.00 at current GPT-4o pricing. That’s fine for a handful of important documents. It’s not fine for ingesting 50,000 documents into a knowledge base.
But here’s the real cost concern: PageIndex requires LLM calls at retrieval time too, not just indexing time. Every query triggers a reasoning chain through the tree. Traditional vector RAG pays once at embedding time, then retrieval is essentially free (a cosine similarity lookup). PageIndex pays at both ends.
The Table of Contents Dependency
PageIndex works best when documents have clear, well-structured tables of contents. The system has fallback modes for documents without a TOC, but these rely on pure LLM segmentation — slower and potentially less accurate. If you’re processing messy, unstructured documents (think: email threads, Slack exports, support tickets), the tree metaphor starts breaking down.
Only One Benchmark
The 98.7% FinanceBench number is impressive, but it’s the only public benchmark. FinanceBench specifically tests financial document Q&A — a domain where hierarchical, structured documents are the norm. We have no data on how PageIndex performs on:
- Technical documentation with deep cross-referencing
- Academic papers with non-standard structures
- Legal contracts with heavily nested clauses
- Mixed-media documents (text, images, charts)
- Documents in languages other than English
One stellar benchmark does not prove general superiority.
The Ugly
The Scalability Question Nobody’s Answering
Vector databases scale to billions of documents. Pinecone, Weaviate, Qdrant — they’re built for exactly this. A vector lookup is O(log n) with approximate nearest neighbors.
PageIndex’s tree-based reasoning requires an LLM to read and reason over the tree structure for every query. What happens when you have 10,000 documents, each with their own tree? 100,000? The project doesn’t address multi-document search at all. There’s no mechanism for deciding which document tree to even start searching.
For single-document or small-collection use cases, this is fine. For enterprise knowledge bases with millions of documents, this is an unsolved problem.
Latency Is Inherently Higher
A vector similarity search returns results in milliseconds. PageIndex’s multi-step reasoning chain — read tree, select branch, drill down, extract, assess, possibly repeat — takes seconds. For interactive chat applications where users expect sub-second responses, this is a meaningful trade-off.
The “Reasoning” Is an Abstraction
PageIndex describes its retrieval as “reasoning-based” and draws the AlphaGo parallel. But let’s be precise about what’s happening: an LLM is reading a JSON tree and making selection decisions. This is closer to structured prompting than to Monte Carlo Tree Search. AlphaGo’s tree search involved value networks trained on millions of games and explicit exploration-exploitation trade-offs. PageIndex’s “reasoning” is GPT-4o following instructions.
This isn’t a criticism of the approach — it works well. But the framing occasionally oversells the sophistication of what’s happening under the hood.
The OpenAI Lock-In
The codebase is built around OpenAI’s API, OpenAI’s tokenizer (tiktoken), and OpenAI’s model names. While you could point it at an OpenAI-compatible endpoint (like a local vLLM instance), the prompts and temperature settings are tuned for GPT-4o. Running this with Llama, Mistral, or Claude would require non-trivial prompt engineering.
In a world where model providers are commoditizing and multi-model architectures are becoming the norm, this is a limiting design choice.
How Does It Compare?
Here’s where PageIndex sits in the broader landscape:
vs. Traditional Chunk + Vector RAG: PageIndex wins on accuracy for structured documents and explainability. Vector RAG wins on cost, latency, scalability, and versatility across document types.
vs. Anthropic’s Contextual Retrieval / Jina’s Late Chunking: These are fixes layered on top of the vector paradigm — they acknowledge the context-loss problem and compensate. PageIndex sidesteps the problem entirely. But these approaches maintain the cost and speed advantages of vector search.
vs. Layout-Aware Parsers (Unstructured, LlamaParse, Docling): These tools solve document parsing — extracting clean, structured text from complex PDFs. They’re complementary to both vector RAG and PageIndex. In fact, PageIndex would benefit from better PDF parsing (it currently uses PyMuPDF/PyPDF2, which struggle with complex layouts).
vs. Vision-Based Approaches (ColPali, Vision-Guided Chunking): A June 2025 paper showed vision-guided chunking improved RAG accuracy by 14% by treating PDF pages as images. PageIndex also offers a vision RAG mode. These approaches are converging on the same insight: text-only processing loses too much information.
vs. Agentic RAG: PageIndex is essentially a specialized form of agentic RAG — an LLM agent with a structured index to navigate. LlamaIndex and LangChain offer similar agent-based retrieval patterns, but without PageIndex’s pre-built tree structure. The tree is what makes the reasoning practical rather than exploratory.
When Should You Actually Use This?
Use PageIndex when:
- You’re working with well-structured, long-form documents (financial reports, regulatory filings, technical manuals)
- Accuracy matters more than speed or cost
- You need auditable, traceable retrieval paths
- Your document collection is small to medium (dozens to hundreds, not millions)
- Cross-referencing within documents is important
Stick with vector RAG when:
- You’re processing large volumes of diverse, unstructured content
- Sub-second latency matters
- Cost per query needs to be near-zero
- You need to scale to millions of documents
- Your documents lack clear hierarchical structure
Consider a hybrid approach:
- Use vector search to select the right document(s) from a large collection
- Then use PageIndex-style tree reasoning for precise extraction within those documents
This hybrid pattern — coarse retrieval via vectors, fine retrieval via reasoning — may be where the field is heading.
Getting Started
If you want to try it yourself:
git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex
pip3 install -r requirements.txt
export CHATGPT_API_KEY=sk-your-key-here
# Index a PDF
python3 run_pageindex.py --pdf_path /path/to/document.pdf
# Output lands in ./results/{pdf_name}_structure.json
There’s also a cloud-hosted version at chat.pageindex.ai if you want to skip the setup, and an MCP server for Claude/Cursor integration.
The Bigger Picture
PageIndex is interesting not just for what it does, but for what it represents. The RAG ecosystem is moving from “chunk and search” toward “reason and navigate.” Anthropic’s contextual retrieval, Jina’s late chunking, vision-based approaches, and now fully vectorless systems like PageIndex — they’re all symptoms of the same realization: vector similarity is a proxy for relevance, and proxies have limits.
Whether PageIndex specifically becomes the standard is less important than the architectural shift it embodies. The next generation of document retrieval systems will likely combine the scalability of vectors with the precision of structured reasoning.
For now, if you’re building RAG for financial analysis, regulatory compliance, or any domain where “close enough” isn’t good enough — PageIndex is worth a serious look. Just go in with your eyes open about the cost, latency, and scalability trade-offs.
References:
- PageIndex GitHub Repository
- VectifyAI Launches Mafin 2.5 and PageIndex — MarkTechPost
- Anthropic Contextual Retrieval
- Late Chunking in Long-Context Embedding Models — Jina AI
- NVIDIA Chunking Strategy Benchmark
- Vision-Guided Chunking — arXiv 2506.16035
- FinanceBench Dataset — arXiv 2311.11944
- Stack Overflow: Breaking Up Is Hard to Do — Chunking in RAG
Related Posts
Understanding LLM Prompt Injection: The Security Risk You Can't Ignore
Explore LLM prompt injection vulnerabilities, from direct and indirect attacks to multimodal exploits. Learn practical mitigation strategies to secure your AI applications.
Read More →
🕹️ AWS-Powered Tetris: Building a Retro Game with Amazon Q and Amplify
There's something magical about the games we grew up with. The simple mechanics, the blocky graphics, and the maddeningly catchy music are etched into our...
Read More →
Personal AI Infrastructure (PAI): How to Build Your Own AI System
Learn how to build your own Personal AI Infrastructure (PAI). A practical guide to creating AI systems that amplify human capabilities — architecture patterns, Claude Code integration, and real implementation.
Read More →