Signup Bonus

Get +1,000 bonus credits on Pro, +2,500 on Business. Start building today.

View plans
NovaKit
Back to Blog

We Rebuilt Our RAG System 3 Times: Lessons from Building Document Chat

Building RAG looks easy in tutorials. In production, everything breaks. Here's what we learned rebuilding NovaKit's document processing system from scratch—three times.

16 min read
Share:

We Rebuilt Our RAG System 3 Times: Lessons from Building Document Chat

The first version took two weeks. The second version took a month. The third version took three months.

Each time, we thought we'd figured it out. Each time, production taught us otherwise.

Building RAG looks deceptively simple. Chunk documents, create embeddings, vector search, done. The tutorials make it look like a weekend project.

It's not.

This is the story of building NovaKit's document chat system—including every mistake we made and what we learned.

Version 1: The Tutorial Implementation

We followed the standard recipe:

Architecture

PDF Upload → Text Extraction → Fixed Chunks (500 tokens)
    → OpenAI Embeddings → Pinecone → Done

Simple. Clean. By the book.

What Worked

For a demo, it was impressive:

  • Upload a PDF
  • Ask questions
  • Get answers with sources

Our internal testing looked great. We shipped it.

What Broke in Production

Day 1: Users uploaded scanned PDFs. Zero text extracted. System returned "I don't have any information about that" for every question.

Day 3: A user uploaded a 500-page technical manual. We hit Pinecone's metadata limits. Chunks started failing silently. Random documents became unsearchable.

Day 7: Someone asked a question spanning two chunks. The answer was in the document, split across chunk boundaries. System found one half, generated a wrong answer from incomplete information.

Day 14: Memory explosion. A single user uploaded 200 documents. Embedding costs: $47. Their monthly plan: $9.

Day 21: "Why does it keep citing the wrong page numbers?"

The Humbling Realization

Our tutorial implementation worked for:

  • Clean, text-native PDFs
  • Small document sets
  • Simple questions
  • Demos

It failed for:

  • Real user documents (scanned, messy, diverse)
  • Real scale (hundreds of documents)
  • Real questions (complex, multi-part)
  • Real budgets (need cost efficiency)

Time to rebuild.

Version 2: The Over-Engineered Rebuild

The pendulum swung hard.

Architecture

Upload → Format Detection → OCR Pipeline → Layout Analysis
    → Intelligent Chunking → Multiple Embedding Models
    → Hybrid Vector + Keyword Index → Query Expansion
    → Re-ranking → Response Generation

We added everything:

  • Tesseract OCR for scanned documents
  • LayoutLM for document structure
  • Semantic chunking based on topic shifts
  • Three different embedding models (compared at query time)
  • BM25 keyword index alongside vectors
  • HyDE for query expansion
  • Cross-encoder re-ranking
  • Custom chunk metadata extraction

What Worked

Accuracy improved. Significantly.

The complex queries that broke V1 now worked. Scanned PDFs processed correctly. Page numbers were accurate.

What Broke in Production

Week 1: Processing time. A 50-page PDF took 8 minutes. Users refreshed the page, assumed it failed, uploaded again. Processing queue exploded.

Week 2: OCR costs. Tesseract was free but slow. Google Vision was fast but $1.50 per 1000 pages. A user uploaded 10,000 pages of legal documents. Cost: $15. Their plan: $29/month.

Week 3: "Why is search so slow?" Query latency ballooned to 4-6 seconds. Three embedding models × re-ranking × query expansion = slow.

Week 4: Debugging nightmares. When answers were wrong, which component failed? Chunking? Wrong embedding model? Re-ranking messed up? Hours spent on each bug.

Week 5: Cold starts. Serverless functions loading three embedding models: 12-second cold starts. Users thought the site was down.

The Second Humbling Realization

We solved V1's problems by creating new ones:

  • Accuracy ↑ but speed ↓
  • Features ↑ but complexity ↑↑↑
  • Costs ↑
  • Maintainability ↓↓

We'd over-corrected. Time for V3.

Version 3: The Right Balance

Third time, we had scars. We knew what mattered.

Principles

  1. Fail gracefully, not silently
  2. Optimize the common case, handle the edge cases
  3. Complexity must pay for itself
  4. Speed is a feature
  5. Costs must be predictable

Architecture

Upload → Quick Classification → Appropriate Pipeline
    → Smart Chunking → Single Good Embedding Model
    → Hybrid Index → Agentic Retrieval → Response

Simpler than V2. More robust than V1.

Key Decisions

Decision 1: Tiered Processing

Not all documents need the same treatment:

If: Native PDF (has text layer)
    → Direct text extraction (fast, free)

If: Scanned PDF (no text layer)
    → OCR with user warning about time
    → Background processing with notification

If: Word/Markdown/Text
    → Direct parsing (instant)

If: Image
    → Vision model description OR OCR
    → User choice

Processing time for 80% of documents: under 10 seconds.

Decision 2: Better Chunking, Not More

We settled on a single chunking strategy that handles most cases:

def smart_chunk(document):
    # Start with structural boundaries (headings, paragraphs)
    chunks = split_by_structure(document)

    # Merge tiny chunks, split huge chunks
    chunks = normalize_chunk_sizes(chunks, min=200, max=800)

    # Add overlap for context continuity
    chunks = add_overlap(chunks, tokens=50)

    # Preserve hierarchy metadata
    chunks = add_hierarchy_context(chunks)

    return chunks

Not fancy. But robust.

Decision 3: One Embedding Model, Chosen Carefully

We benchmarked extensively and chose one model:

  • Good accuracy across document types
  • Reasonable speed
  • Cost-effective
  • Reliable API

More models didn't improve results enough to justify complexity.

Decision 4: Hybrid Search, Simple Implementation

Vector search (semantic) + Keyword search (BM25)
    → Reciprocal Rank Fusion
    → Top results

No re-ranking. No query expansion. These added latency without proportional accuracy gains for our use case.

Decision 5: Agentic Retrieval

Instead of fixed retrieval:

def agentic_retrieve(query, document_collection):
    # Understand what we're looking for
    analysis = analyze_query(query)

    # First retrieval attempt
    results = search(query, k=5)

    # Check if sufficient
    if not sufficient(query, results):
        # Reformulate and try again
        refined_query = refine_query(query, results)
        more_results = search(refined_query, k=5)
        results = combine_and_dedupe(results, more_results)

    # Check again
    if still_not_sufficient(query, results):
        # Return partial results with honest "incomplete" flag
        return results, confidence="partial"

    return results, confidence="high"

This handles complex queries without over-engineering simple ones.

Production Results

After V3 launched:

MetricV1V2V3
Processing time (median)3s8min5s
Query latency (p50)1.2s4.5s1.8s
Answer accuracy62%78%75%
User satisfaction3.2/53.1/54.1/5
Bugs per week12234
Embedding cost/doc$0.002$0.008$0.003

V3 has slightly lower accuracy than V2 but much higher satisfaction. Speed matters more than we realized.

What We Learned

Lesson 1: Chunking Is Not Solved

Everyone thinks chunking is a solved problem. "Just split every 500 tokens."

It's not solved. Different documents need different approaches:

  • Legal contracts: Don't split clauses
  • Technical docs: Keep code blocks together
  • Research papers: Section boundaries matter
  • Chat logs: Speaker turns are natural breaks

We settled on structural chunking with size normalization. Not perfect for any document type, good enough for most.

Lesson 2: OCR Is Harder Than It Looks

"Just use Tesseract" they said.

Tesseract struggles with:

  • Multi-column layouts
  • Tables
  • Handwriting
  • Poor quality scans
  • Non-English text

We use Tesseract as default (free) with option to upgrade to Google Vision for complex documents. Users who need better OCR can choose it.

Lesson 3: Embeddings Are Commoditized

We spent weeks comparing embedding models. The differences were marginal for our use case.

What matters more:

  • Chunking quality (garbage in, garbage out)
  • Search strategy (hybrid beats vector-only)
  • Result evaluation (did we find what we needed?)

Pick a good embedding model and move on.

Lesson 4: Hybrid Search Is Worth It

Vector search alone misses:

  • Exact phrases ("error code 0x8007045D")
  • Proper nouns ("John Smith's proposal")
  • Numbers and codes

Keyword search alone misses:

  • Conceptual questions ("What's our refund policy?")
  • Paraphrased queries
  • Synonyms

Hybrid catches both. The fusion logic doesn't need to be complex—simple RRF works.

Lesson 5: Speed Beats Accuracy (Sometimes)

Our V2 was more accurate than V3. But users preferred V3.

Why? A 75% accurate answer in 2 seconds is more useful than an 80% accurate answer in 6 seconds. Users iterate. They ask follow-ups. Speed enables that workflow.

The right tradeoff depends on use case. For enterprise compliance, accuracy wins. For quick document lookup, speed wins.

Lesson 6: Graceful Degradation > Silent Failure

V1 failed silently. V3 fails loudly and gracefully:

❌ V1: "I don't have information about that"
   (Document failed to process, user doesn't know)

✅ V3: "This document is still processing (2 min remaining).
        For faster results, try uploading text-based PDFs."
   (User knows what's happening and how to help)

Tell users what's happening. Give them options.

Lesson 7: Logging Saves Lives

In V2, bugs were nightmares. We didn't know where in the 8-component pipeline things failed.

V3 logs everything:

  • Processing steps with timing
  • Retrieval results with scores
  • Chunks considered and selected
  • Confidence assessments

When something goes wrong, we can trace exactly what happened.

Lesson 8: Start Simple, Complicate Only When Data Proves Necessity

We added re-ranking to V2 because papers said it helped. It added 2 seconds of latency. When we measured, it improved accuracy by 3%.

Not worth it for our use case. We removed it.

Every component must justify its existence with real production data.

Lesson 9: User Feedback > Benchmarks

We optimized for MTEB benchmarks. Users didn't care.

What users actually complained about:

  • "It's slow"
  • "Why can't I upload Word docs?"
  • "The page numbers are wrong"
  • "It ignores my PDFs with images"

Benchmark-driven development vs. user-driven development produce different systems.

Lesson 10: Documents Are Messy

Academic papers on RAG use clean datasets. Real documents are:

  • Scanned at weird angles
  • Missing pages
  • Password protected (surprise!)
  • Corrupted
  • In unexpected formats (PDF that's actually a ZIP)
  • Too large
  • Too small (single-page images)

Your system needs to handle everything users throw at it.

Our Current Stack

For those who want specifics:

Text Extraction:

  • pdfplumber for native PDFs
  • Tesseract for OCR (default)
  • Google Vision for complex OCR (optional)
  • mammoth for Word docs
  • marked for Markdown

Chunking:

  • Custom semantic chunker
  • 400 token target, 200-800 range
  • 50 token overlap
  • Structure-aware splitting

Embeddings:

  • OpenAI text-embedding-3-small
  • 1536 dimensions
  • Batched processing

Vector Store:

  • PostgreSQL with pgvector
  • HNSW index for speed
  • Metadata filtering

Keyword Search:

  • PostgreSQL full-text search
  • Trigram similarity for fuzzy matching

Retrieval:

  • Hybrid search with RRF fusion
  • Agentic iteration for complex queries
  • Top 5-10 results depending on query complexity

Generation:

  • Claude or GPT-4 depending on user preference
  • Structured prompts with retrieved chunks
  • Source citation required

The Ongoing Journey

V3 isn't perfect. We're still improving:

Current focus:

  • Better table extraction
  • Multi-document queries ("Compare these two contracts")
  • Image understanding within documents
  • Faster processing for large uploads

Future exploration:

  • Fine-tuned embedding models for specific domains
  • Graph-based relationships between document sections
  • Caching for common queries
  • Predictive pre-retrieval

RAG is not a solved problem. It's an evolving practice.

Advice for Builders

If you're building a RAG system:

  1. Start with the simplest thing that could work. Add complexity only when you have evidence it helps.

  2. Build robust error handling from day one. Users will upload things you didn't expect.

  3. Instrument everything. You'll need the data to debug and improve.

  4. Talk to users. Benchmarks lie. User feedback doesn't.

  5. Speed matters more than you think. A fast, good-enough system beats a slow, perfect one.

  6. Plan for cost. Embedding API calls add up. Build in limits and monitoring.

  7. Chunk carefully. This is where most RAG systems fail. Don't take shortcuts.

  8. Test with real documents. Not just the clean ones. The messy, scanned, weird ones.

  9. Be honest about uncertainty. Don't generate confident-sounding wrong answers.

  10. Iterate. Your first version will be wrong. Plan for V2 and V3.


Want to see the result of three rebuilds? NovaKit's Document Chat processes your documents and answers questions—with the speed, accuracy, and honesty we learned the hard way.

Enjoyed this article? Share it with others.

Share:

Related Articles