We Rebuilt Our RAG System 3 Times: Lessons from Building Document Chat
Building RAG looks easy in tutorials. In production, everything breaks. Here's what we learned rebuilding NovaKit's document processing system from scratch—three times.
We Rebuilt Our RAG System 3 Times: Lessons from Building Document Chat
The first version took two weeks. The second version took a month. The third version took three months.
Each time, we thought we'd figured it out. Each time, production taught us otherwise.
Building RAG looks deceptively simple. Chunk documents, create embeddings, vector search, done. The tutorials make it look like a weekend project.
It's not.
This is the story of building NovaKit's document chat system—including every mistake we made and what we learned.
Version 1: The Tutorial Implementation
We followed the standard recipe:
Architecture
PDF Upload → Text Extraction → Fixed Chunks (500 tokens)
→ OpenAI Embeddings → Pinecone → Done
Simple. Clean. By the book.
What Worked
For a demo, it was impressive:
- Upload a PDF
- Ask questions
- Get answers with sources
Our internal testing looked great. We shipped it.
What Broke in Production
Day 1: Users uploaded scanned PDFs. Zero text extracted. System returned "I don't have any information about that" for every question.
Day 3: A user uploaded a 500-page technical manual. We hit Pinecone's metadata limits. Chunks started failing silently. Random documents became unsearchable.
Day 7: Someone asked a question spanning two chunks. The answer was in the document, split across chunk boundaries. System found one half, generated a wrong answer from incomplete information.
Day 14: Memory explosion. A single user uploaded 200 documents. Embedding costs: $47. Their monthly plan: $9.
Day 21: "Why does it keep citing the wrong page numbers?"
The Humbling Realization
Our tutorial implementation worked for:
- Clean, text-native PDFs
- Small document sets
- Simple questions
- Demos
It failed for:
- Real user documents (scanned, messy, diverse)
- Real scale (hundreds of documents)
- Real questions (complex, multi-part)
- Real budgets (need cost efficiency)
Time to rebuild.
Version 2: The Over-Engineered Rebuild
The pendulum swung hard.
Architecture
Upload → Format Detection → OCR Pipeline → Layout Analysis
→ Intelligent Chunking → Multiple Embedding Models
→ Hybrid Vector + Keyword Index → Query Expansion
→ Re-ranking → Response Generation
We added everything:
- Tesseract OCR for scanned documents
- LayoutLM for document structure
- Semantic chunking based on topic shifts
- Three different embedding models (compared at query time)
- BM25 keyword index alongside vectors
- HyDE for query expansion
- Cross-encoder re-ranking
- Custom chunk metadata extraction
What Worked
Accuracy improved. Significantly.
The complex queries that broke V1 now worked. Scanned PDFs processed correctly. Page numbers were accurate.
What Broke in Production
Week 1: Processing time. A 50-page PDF took 8 minutes. Users refreshed the page, assumed it failed, uploaded again. Processing queue exploded.
Week 2: OCR costs. Tesseract was free but slow. Google Vision was fast but $1.50 per 1000 pages. A user uploaded 10,000 pages of legal documents. Cost: $15. Their plan: $29/month.
Week 3: "Why is search so slow?" Query latency ballooned to 4-6 seconds. Three embedding models × re-ranking × query expansion = slow.
Week 4: Debugging nightmares. When answers were wrong, which component failed? Chunking? Wrong embedding model? Re-ranking messed up? Hours spent on each bug.
Week 5: Cold starts. Serverless functions loading three embedding models: 12-second cold starts. Users thought the site was down.
The Second Humbling Realization
We solved V1's problems by creating new ones:
- Accuracy ↑ but speed ↓
- Features ↑ but complexity ↑↑↑
- Costs ↑
- Maintainability ↓↓
We'd over-corrected. Time for V3.
Version 3: The Right Balance
Third time, we had scars. We knew what mattered.
Principles
- Fail gracefully, not silently
- Optimize the common case, handle the edge cases
- Complexity must pay for itself
- Speed is a feature
- Costs must be predictable
Architecture
Upload → Quick Classification → Appropriate Pipeline
→ Smart Chunking → Single Good Embedding Model
→ Hybrid Index → Agentic Retrieval → Response
Simpler than V2. More robust than V1.
Key Decisions
Decision 1: Tiered Processing
Not all documents need the same treatment:
If: Native PDF (has text layer)
→ Direct text extraction (fast, free)
If: Scanned PDF (no text layer)
→ OCR with user warning about time
→ Background processing with notification
If: Word/Markdown/Text
→ Direct parsing (instant)
If: Image
→ Vision model description OR OCR
→ User choice
Processing time for 80% of documents: under 10 seconds.
Decision 2: Better Chunking, Not More
We settled on a single chunking strategy that handles most cases:
def smart_chunk(document):
# Start with structural boundaries (headings, paragraphs)
chunks = split_by_structure(document)
# Merge tiny chunks, split huge chunks
chunks = normalize_chunk_sizes(chunks, min=200, max=800)
# Add overlap for context continuity
chunks = add_overlap(chunks, tokens=50)
# Preserve hierarchy metadata
chunks = add_hierarchy_context(chunks)
return chunks
Not fancy. But robust.
Decision 3: One Embedding Model, Chosen Carefully
We benchmarked extensively and chose one model:
- Good accuracy across document types
- Reasonable speed
- Cost-effective
- Reliable API
More models didn't improve results enough to justify complexity.
Decision 4: Hybrid Search, Simple Implementation
Vector search (semantic) + Keyword search (BM25)
→ Reciprocal Rank Fusion
→ Top results
No re-ranking. No query expansion. These added latency without proportional accuracy gains for our use case.
Decision 5: Agentic Retrieval
Instead of fixed retrieval:
def agentic_retrieve(query, document_collection):
# Understand what we're looking for
analysis = analyze_query(query)
# First retrieval attempt
results = search(query, k=5)
# Check if sufficient
if not sufficient(query, results):
# Reformulate and try again
refined_query = refine_query(query, results)
more_results = search(refined_query, k=5)
results = combine_and_dedupe(results, more_results)
# Check again
if still_not_sufficient(query, results):
# Return partial results with honest "incomplete" flag
return results, confidence="partial"
return results, confidence="high"
This handles complex queries without over-engineering simple ones.
Production Results
After V3 launched:
| Metric | V1 | V2 | V3 |
|---|---|---|---|
| Processing time (median) | 3s | 8min | 5s |
| Query latency (p50) | 1.2s | 4.5s | 1.8s |
| Answer accuracy | 62% | 78% | 75% |
| User satisfaction | 3.2/5 | 3.1/5 | 4.1/5 |
| Bugs per week | 12 | 23 | 4 |
| Embedding cost/doc | $0.002 | $0.008 | $0.003 |
V3 has slightly lower accuracy than V2 but much higher satisfaction. Speed matters more than we realized.
What We Learned
Lesson 1: Chunking Is Not Solved
Everyone thinks chunking is a solved problem. "Just split every 500 tokens."
It's not solved. Different documents need different approaches:
- Legal contracts: Don't split clauses
- Technical docs: Keep code blocks together
- Research papers: Section boundaries matter
- Chat logs: Speaker turns are natural breaks
We settled on structural chunking with size normalization. Not perfect for any document type, good enough for most.
Lesson 2: OCR Is Harder Than It Looks
"Just use Tesseract" they said.
Tesseract struggles with:
- Multi-column layouts
- Tables
- Handwriting
- Poor quality scans
- Non-English text
We use Tesseract as default (free) with option to upgrade to Google Vision for complex documents. Users who need better OCR can choose it.
Lesson 3: Embeddings Are Commoditized
We spent weeks comparing embedding models. The differences were marginal for our use case.
What matters more:
- Chunking quality (garbage in, garbage out)
- Search strategy (hybrid beats vector-only)
- Result evaluation (did we find what we needed?)
Pick a good embedding model and move on.
Lesson 4: Hybrid Search Is Worth It
Vector search alone misses:
- Exact phrases ("error code 0x8007045D")
- Proper nouns ("John Smith's proposal")
- Numbers and codes
Keyword search alone misses:
- Conceptual questions ("What's our refund policy?")
- Paraphrased queries
- Synonyms
Hybrid catches both. The fusion logic doesn't need to be complex—simple RRF works.
Lesson 5: Speed Beats Accuracy (Sometimes)
Our V2 was more accurate than V3. But users preferred V3.
Why? A 75% accurate answer in 2 seconds is more useful than an 80% accurate answer in 6 seconds. Users iterate. They ask follow-ups. Speed enables that workflow.
The right tradeoff depends on use case. For enterprise compliance, accuracy wins. For quick document lookup, speed wins.
Lesson 6: Graceful Degradation > Silent Failure
V1 failed silently. V3 fails loudly and gracefully:
❌ V1: "I don't have information about that"
(Document failed to process, user doesn't know)
✅ V3: "This document is still processing (2 min remaining).
For faster results, try uploading text-based PDFs."
(User knows what's happening and how to help)
Tell users what's happening. Give them options.
Lesson 7: Logging Saves Lives
In V2, bugs were nightmares. We didn't know where in the 8-component pipeline things failed.
V3 logs everything:
- Processing steps with timing
- Retrieval results with scores
- Chunks considered and selected
- Confidence assessments
When something goes wrong, we can trace exactly what happened.
Lesson 8: Start Simple, Complicate Only When Data Proves Necessity
We added re-ranking to V2 because papers said it helped. It added 2 seconds of latency. When we measured, it improved accuracy by 3%.
Not worth it for our use case. We removed it.
Every component must justify its existence with real production data.
Lesson 9: User Feedback > Benchmarks
We optimized for MTEB benchmarks. Users didn't care.
What users actually complained about:
- "It's slow"
- "Why can't I upload Word docs?"
- "The page numbers are wrong"
- "It ignores my PDFs with images"
Benchmark-driven development vs. user-driven development produce different systems.
Lesson 10: Documents Are Messy
Academic papers on RAG use clean datasets. Real documents are:
- Scanned at weird angles
- Missing pages
- Password protected (surprise!)
- Corrupted
- In unexpected formats (PDF that's actually a ZIP)
- Too large
- Too small (single-page images)
Your system needs to handle everything users throw at it.
Our Current Stack
For those who want specifics:
Text Extraction:
- pdfplumber for native PDFs
- Tesseract for OCR (default)
- Google Vision for complex OCR (optional)
- mammoth for Word docs
- marked for Markdown
Chunking:
- Custom semantic chunker
- 400 token target, 200-800 range
- 50 token overlap
- Structure-aware splitting
Embeddings:
- OpenAI text-embedding-3-small
- 1536 dimensions
- Batched processing
Vector Store:
- PostgreSQL with pgvector
- HNSW index for speed
- Metadata filtering
Keyword Search:
- PostgreSQL full-text search
- Trigram similarity for fuzzy matching
Retrieval:
- Hybrid search with RRF fusion
- Agentic iteration for complex queries
- Top 5-10 results depending on query complexity
Generation:
- Claude or GPT-4 depending on user preference
- Structured prompts with retrieved chunks
- Source citation required
The Ongoing Journey
V3 isn't perfect. We're still improving:
Current focus:
- Better table extraction
- Multi-document queries ("Compare these two contracts")
- Image understanding within documents
- Faster processing for large uploads
Future exploration:
- Fine-tuned embedding models for specific domains
- Graph-based relationships between document sections
- Caching for common queries
- Predictive pre-retrieval
RAG is not a solved problem. It's an evolving practice.
Advice for Builders
If you're building a RAG system:
-
Start with the simplest thing that could work. Add complexity only when you have evidence it helps.
-
Build robust error handling from day one. Users will upload things you didn't expect.
-
Instrument everything. You'll need the data to debug and improve.
-
Talk to users. Benchmarks lie. User feedback doesn't.
-
Speed matters more than you think. A fast, good-enough system beats a slow, perfect one.
-
Plan for cost. Embedding API calls add up. Build in limits and monitoring.
-
Chunk carefully. This is where most RAG systems fail. Don't take shortcuts.
-
Test with real documents. Not just the clean ones. The messy, scanned, weird ones.
-
Be honest about uncertainty. Don't generate confident-sounding wrong answers.
-
Iterate. Your first version will be wrong. Plan for V2 and V3.
Want to see the result of three rebuilds? NovaKit's Document Chat processes your documents and answers questions—with the speed, accuracy, and honesty we learned the hard way.
Enjoyed this article? Share it with others.