We Rebuilt Our RAG System 3 Times: Lessons from Building Document Chat

The first version took two weeks. The second version took a month. The third version took three months.

Each time, we thought we'd figured it out. Each time, production taught us otherwise.

Building RAG looks deceptively simple. Chunk documents, create embeddings, vector search, done. The tutorials make it look like a weekend project.

It's not.

This is the story of building NovaKit's document chat system—including every mistake we made and what we learned.

Version 1: The Tutorial Implementation

We followed the standard recipe:

Architecture

PDF Upload → Text Extraction → Fixed Chunks (500 tokens)
    → OpenAI Embeddings → Pinecone → Done

Simple. Clean. By the book.

What Worked

For a demo, it was impressive:

Upload a PDF
Ask questions
Get answers with sources

Our internal testing looked great. We shipped it.

What Broke in Production

Day 1: Users uploaded scanned PDFs. Zero text extracted. System returned "I don't have any information about that" for every question.

Day 3: A user uploaded a 500-page technical manual. We hit Pinecone's metadata limits. Chunks started failing silently. Random documents became unsearchable.

Day 7: Someone asked a question spanning two chunks. The answer was in the document, split across chunk boundaries. System found one half, generated a wrong answer from incomplete information.

Day 14: Memory explosion. A single user uploaded 200 documents. Embedding costs: $47. Their monthly plan: $9.

Day 21: "Why does it keep citing the wrong page numbers?"

The Humbling Realization

Our tutorial implementation worked for:

Clean, text-native PDFs
Small document sets
Simple questions
Demos

It failed for:

Real user documents (scanned, messy, diverse)
Real scale (hundreds of documents)
Real questions (complex, multi-part)
Real budgets (need cost efficiency)

Time to rebuild.

Version 2: The Over-Engineered Rebuild

The pendulum swung hard.

Architecture

Upload → Format Detection → OCR Pipeline → Layout Analysis
    → Intelligent Chunking → Multiple Embedding Models
    → Hybrid Vector + Keyword Index → Query Expansion
    → Re-ranking → Response Generation

We added everything:

Tesseract OCR for scanned documents
LayoutLM for document structure
Semantic chunking based on topic shifts
Three different embedding models (compared at query time)
BM25 keyword index alongside vectors
HyDE for query expansion
Cross-encoder re-ranking
Custom chunk metadata extraction

What Worked

Accuracy improved. Significantly.

The complex queries that broke V1 now worked. Scanned PDFs processed correctly. Page numbers were accurate.

What Broke in Production

Week 1: Processing time. A 50-page PDF took 8 minutes. Users refreshed the page, assumed it failed, uploaded again. Processing queue exploded.

Week 2: OCR costs. Tesseract was free but slow. Google Vision was fast but $1.50 per 1000 pages. A user uploaded 10,000 pages of legal documents. Cost: $15. Their plan: $29/month.

Week 3: "Why is search so slow?" Query latency ballooned to 4-6 seconds. Three embedding models × re-ranking × query expansion = slow.

Week 4: Debugging nightmares. When answers were wrong, which component failed? Chunking? Wrong embedding model? Re-ranking messed up? Hours spent on each bug.

Week 5: Cold starts. Serverless functions loading three embedding models: 12-second cold starts. Users thought the site was down.

The Second Humbling Realization

We solved V1's problems by creating new ones:

Accuracy ↑ but speed ↓
Features ↑ but complexity ↑↑↑
Costs ↑
Maintainability ↓↓

We'd over-corrected. Time for V3.

Version 3: The Right Balance

Third time, we had scars. We knew what mattered.

Principles

Fail gracefully, not silently
Optimize the common case, handle the edge cases
Complexity must pay for itself
Speed is a feature
Costs must be predictable

Architecture

Upload → Quick Classification → Appropriate Pipeline
    → Smart Chunking → Single Good Embedding Model
    → Hybrid Index → Agentic Retrieval → Response

Simpler than V2. More robust than V1.

Key Decisions

Decision 1: Tiered Processing

Not all documents need the same treatment:

If: Native PDF (has text layer)
    → Direct text extraction (fast, free)

If: Scanned PDF (no text layer)
    → OCR with user warning about time
    → Background processing with notification

If: Word/Markdown/Text
    → Direct parsing (instant)

If: Image
    → Vision model description OR OCR
    → User choice

Processing time for 80% of documents: under 10 seconds.

Decision 2: Better Chunking, Not More

We settled on a single chunking strategy that handles most cases:

def smart_chunk(document):
    # Start with structural boundaries (headings, paragraphs)
    chunks = split_by_structure(document)

    # Merge tiny chunks, split huge chunks
    chunks = normalize_chunk_sizes(chunks, min=200, max=800)

    # Add overlap for context continuity
    chunks = add_overlap(chunks, tokens=50)

    # Preserve hierarchy metadata
    chunks = add_hierarchy_context(chunks)

    return chunks

Not fancy. But robust.

Decision 3: One Embedding Model, Chosen Carefully

We benchmarked extensively and chose one model:

Good accuracy across document types
Reasonable speed
Cost-effective
Reliable API

More models didn't improve results enough to justify complexity.

Decision 4: Hybrid Search, Simple Implementation

Vector search (semantic) + Keyword search (BM25)
    → Reciprocal Rank Fusion
    → Top results

No re-ranking. No query expansion. These added latency without proportional accuracy gains for our use case.

Decision 5: Agentic Retrieval

Instead of fixed retrieval:

def agentic_retrieve(query, document_collection):
    # Understand what we're looking for
    analysis = analyze_query(query)

    # First retrieval attempt
    results = search(query, k=5)

    # Check if sufficient
    if not sufficient(query, results):
        # Reformulate and try again
        refined_query = refine_query(query, results)
        more_results = search(refined_query, k=5)
        results = combine_and_dedupe(results, more_results)

    # Check again
    if still_not_sufficient(query, results):
        # Return partial results with honest "incomplete" flag
        return results, confidence="partial"

    return results, confidence="high"

This handles complex queries without over-engineering simple ones.

Production Results

After V3 launched:

Metric	V1	V2	V3
Processing time (median)	3s	8min	5s
Query latency (p50)	1.2s	4.5s	1.8s
Answer accuracy	62%	78%	75%
User satisfaction	3.2/5	3.1/5	4.1/5
Bugs per week	12	23	4
Embedding cost/doc	$0.002	$0.008	$0.003

V3 has slightly lower accuracy than V2 but much higher satisfaction. Speed matters more than we realized.

What We Learned

Lesson 1: Chunking Is Not Solved

Everyone thinks chunking is a solved problem. "Just split every 500 tokens."

It's not solved. Different documents need different approaches:

Legal contracts: Don't split clauses
Technical docs: Keep code blocks together
Research papers: Section boundaries matter
Chat logs: Speaker turns are natural breaks

We settled on structural chunking with size normalization. Not perfect for any document type, good enough for most.

Lesson 2: OCR Is Harder Than It Looks

"Just use Tesseract" they said.

Tesseract struggles with:

Multi-column layouts
Tables
Handwriting
Poor quality scans
Non-English text

We use Tesseract as default (free) with option to upgrade to Google Vision for complex documents. Users who need better OCR can choose it.

Lesson 3: Embeddings Are Commoditized

We spent weeks comparing embedding models. The differences were marginal for our use case.

What matters more:

Chunking quality (garbage in, garbage out)
Search strategy (hybrid beats vector-only)
Result evaluation (did we find what we needed?)

Pick a good embedding model and move on.

Lesson 4: Hybrid Search Is Worth It

Vector search alone misses:

Exact phrases ("error code 0x8007045D")
Proper nouns ("John Smith's proposal")
Numbers and codes

Keyword search alone misses:

Conceptual questions ("What's our refund policy?")
Paraphrased queries
Synonyms

Hybrid catches both. The fusion logic doesn't need to be complex—simple RRF works.

Lesson 5: Speed Beats Accuracy (Sometimes)

Our V2 was more accurate than V3. But users preferred V3.

Why? A 75% accurate answer in 2 seconds is more useful than an 80% accurate answer in 6 seconds. Users iterate. They ask follow-ups. Speed enables that workflow.

The right tradeoff depends on use case. For enterprise compliance, accuracy wins. For quick document lookup, speed wins.

Lesson 6: Graceful Degradation > Silent Failure

V1 failed silently. V3 fails loudly and gracefully:

❌ V1: "I don't have information about that"
   (Document failed to process, user doesn't know)

✅ V3: "This document is still processing (2 min remaining).
        For faster results, try uploading text-based PDFs."
   (User knows what's happening and how to help)

Tell users what's happening. Give them options.

Lesson 7: Logging Saves Lives

In V2, bugs were nightmares. We didn't know where in the 8-component pipeline things failed.

V3 logs everything:

Processing steps with timing
Retrieval results with scores
Chunks considered and selected
Confidence assessments

When something goes wrong, we can trace exactly what happened.

Lesson 8: Start Simple, Complicate Only When Data Proves Necessity

We added re-ranking to V2 because papers said it helped. It added 2 seconds of latency. When we measured, it improved accuracy by 3%.

Not worth it for our use case. We removed it.

Every component must justify its existence with real production data.

Lesson 9: User Feedback > Benchmarks

We optimized for MTEB benchmarks. Users didn't care.

What users actually complained about:

"It's slow"
"Why can't I upload Word docs?"
"The page numbers are wrong"
"It ignores my PDFs with images"

Benchmark-driven development vs. user-driven development produce different systems.

Lesson 10: Documents Are Messy

Academic papers on RAG use clean datasets. Real documents are:

Scanned at weird angles
Missing pages
Password protected (surprise!)
Corrupted
In unexpected formats (PDF that's actually a ZIP)
Too large
Too small (single-page images)

Your system needs to handle everything users throw at it.

Our Current Stack

For those who want specifics:

Text Extraction:

pdfplumber for native PDFs
Tesseract for OCR (default)
Google Vision for complex OCR (optional)
mammoth for Word docs
marked for Markdown

Chunking:

Custom semantic chunker
400 token target, 200-800 range
50 token overlap
Structure-aware splitting

Embeddings:

OpenAI text-embedding-3-small
1536 dimensions
Batched processing

Vector Store:

PostgreSQL with pgvector
HNSW index for speed
Metadata filtering

Keyword Search:

PostgreSQL full-text search
Trigram similarity for fuzzy matching

Retrieval:

Hybrid search with RRF fusion
Agentic iteration for complex queries
Top 5-10 results depending on query complexity

Generation:

Claude or GPT-4 depending on user preference
Structured prompts with retrieved chunks
Source citation required

The Ongoing Journey

V3 isn't perfect. We're still improving:

Current focus:

Better table extraction
Multi-document queries ("Compare these two contracts")
Image understanding within documents
Faster processing for large uploads

Future exploration:

Fine-tuned embedding models for specific domains
Graph-based relationships between document sections
Caching for common queries
Predictive pre-retrieval

RAG is not a solved problem. It's an evolving practice.

Advice for Builders

If you're building a RAG system:

Start with the simplest thing that could work. Add complexity only when you have evidence it helps.
Build robust error handling from day one. Users will upload things you didn't expect.
Instrument everything. You'll need the data to debug and improve.
Talk to users. Benchmarks lie. User feedback doesn't.
Speed matters more than you think. A fast, good-enough system beats a slow, perfect one.
Plan for cost. Embedding API calls add up. Build in limits and monitoring.
Chunk carefully. This is where most RAG systems fail. Don't take shortcuts.
Test with real documents. Not just the clean ones. The messy, scanned, weird ones.
Be honest about uncertainty. Don't generate confident-sounding wrong answers.
Iterate. Your first version will be wrong. Plan for V2 and V3.

Want to see the result of three rebuilds? NovaKit's Document Chat processes your documents and answers questions—with the speed, accuracy, and honesty we learned the hard way.

We Rebuilt Our RAG System 3 Times: Lessons from Building Document Chat

We Rebuilt Our RAG System 3 Times: Lessons from Building Document Chat

Version 1: The Tutorial Implementation

Architecture

What Worked

What Broke in Production

The Humbling Realization

Version 2: The Over-Engineered Rebuild

Architecture

What Worked

What Broke in Production

The Second Humbling Realization

Version 3: The Right Balance

Principles

Architecture

Key Decisions

Production Results

What We Learned

Lesson 1: Chunking Is Not Solved

Lesson 2: OCR Is Harder Than It Looks

Lesson 3: Embeddings Are Commoditized

Lesson 4: Hybrid Search Is Worth It

Lesson 5: Speed Beats Accuracy (Sometimes)

Lesson 6: Graceful Degradation > Silent Failure

Lesson 7: Logging Saves Lives

Lesson 8: Start Simple, Complicate Only When Data Proves Necessity

Lesson 9: User Feedback > Benchmarks

Lesson 10: Documents Are Messy

Our Current Stack

The Ongoing Journey

Advice for Builders

Related Articles

From ChatGPT Wrapper to Production Agent: A NovaKit Implementation Guide

Beyond 200K Tokens: How Long Context Windows Are Changing AI in 2026

RAG is Dead, Long Live Context Engines: The Evolution of Document AI