RAG Explained: How AI Answers Questions from Your Documents
A deep dive into Retrieval-Augmented Generation—the technology that lets AI answer questions accurately from your own documents without hallucinating.
RAG Explained: How AI Answers Questions from Your Documents
When you ask ChatGPT a question, it answers from its training data—knowledge frozen at a point in time. But what if you want AI to answer questions about your documents? Documents it has never seen?
This is where Retrieval-Augmented Generation (RAG) comes in. RAG is the technology that powers document chat, knowledge bases, and enterprise AI assistants. Let's break down how it actually works.
The Problem RAG Solves
Large Language Models (LLMs) have two fundamental limitations:
1. Knowledge Cutoff
Models are trained on data up to a specific date. GPT-4 doesn't know about events after its training cutoff. Your company's internal docs? Never seen them.
2. Hallucination
When LLMs don't know something, they sometimes make things up—confidently. Ask about a document it hasn't seen, and it might fabricate plausible-sounding but false information.
RAG solves both problems by giving the AI access to external knowledge at query time, grounding its responses in actual source material.
How RAG Works: The Pipeline
RAG operates in two phases: indexing (preparation) and retrieval + generation (query time).
Phase 1: Indexing
Before you can chat with a document, it needs to be processed and stored.
Document → Chunking → Embedding → Vector Database
Step 1: Chunking
Documents are split into smaller, semantically meaningful pieces called "chunks."
Why chunk?
- LLMs have context limits (can't process entire books at once)
- Smaller chunks = more precise retrieval
- Each chunk should contain a complete thought
Chunking strategies:
| Strategy | Description | Best For |
|---|---|---|
| Fixed-size | Split every N characters | Simple, predictable |
| Sentence-based | Split on sentence boundaries | Natural breaks |
| Semantic | Split on topic changes | Highest quality |
| Recursive | Hierarchical splitting | Long documents |
Example: A 50-page PDF might become 200 chunks of ~500 tokens each.
Step 2: Embedding
Each chunk is converted into a vector embedding—a list of numbers (typically 1536 dimensions) that represents its semantic meaning.
# Conceptual example
chunk = "The quarterly revenue increased by 35%..."
embedding = embed(chunk)
# → [0.023, -0.156, 0.089, ..., 0.042] # 1536 numbers
Key insight: Similar meanings produce similar vectors. "Revenue grew" and "Sales increased" will have embeddings close together in vector space, even though the words are different.
Step 3: Vector Storage
Embeddings are stored in a vector database optimized for similarity search.
Popular options:
- Pinecone
- Weaviate
- Qdrant
- Supabase pgvector (what NovaKit uses)
- Chroma
Phase 2: Retrieval + Generation
When you ask a question, the magic happens:
Question → Embed → Search → Retrieve Chunks → Generate Answer
Step 1: Embed the Query
Your question is converted to a vector using the same embedding model:
question = "What was the revenue growth?"
query_embedding = embed(question)
Step 2: Semantic Search
The query vector is compared against all chunk vectors to find the most similar ones.
Similarity metrics:
- Cosine similarity: Measures angle between vectors (most common)
- Euclidean distance: Measures straight-line distance
- Dot product: Faster, works well with normalized vectors
-- Simplified pgvector query
SELECT content, 1 - (embedding <=> query_embedding) as similarity
FROM chunks
ORDER BY embedding <=> query_embedding
LIMIT 5;
This returns the 5 most semantically similar chunks—even if they don't share exact keywords with your question.
Step 3: Context Augmentation
Retrieved chunks are inserted into the LLM prompt as context:
System: You are a helpful assistant. Answer questions based on the
following context. Cite sources using [1], [2], etc.
Context:
[1] "The quarterly revenue increased by 35%, reaching $2.8M..."
[2] "Growth was primarily driven by enterprise expansion..."
[3] "CEO stated targets remain aggressive but achievable..."
User: What was the revenue growth?
Step 4: Generation
The LLM generates an answer grounded in the provided context:
Based on the quarterly report, revenue grew by 35%, reaching $2.8M [1].
This growth was primarily driven by enterprise expansion [2].
Why RAG Works Better Than Fine-Tuning
You might wonder: "Why not just train the model on my documents?"
| Approach | RAG | Fine-Tuning |
|---|---|---|
| Setup time | Minutes | Hours/days |
| Cost | Low (just embedding) | High (GPU training) |
| Updates | Instant (re-embed) | Requires retraining |
| Accuracy | Cites sources | May hallucinate |
| Multiple sources | Easy to add | Complex |
| Data privacy | Stays in your DB | Goes to training |
RAG is generally preferred for knowledge bases because:
- Documents change frequently
- You need verifiable sources
- You want to keep data separate from the model
The Anatomy of Great RAG
Not all RAG implementations are equal. Here's what separates good from great:
1. Smart Chunking
Bad: Fixed 500-character splits that cut sentences mid-word Good: Semantic splits that preserve complete thoughts with overlap
# Overlap ensures context isn't lost at boundaries
chunks = split_with_overlap(text, chunk_size=500, overlap=50)
2. Quality Embeddings
Bad: Bag-of-words or simple TF-IDF Good: State-of-the-art embedding models
| Model | Dimensions | Quality | Speed |
|---|---|---|---|
| text-embedding-3-small | 1536 | Good | Fast |
| text-embedding-3-large | 3072 | Better | Medium |
| voyage-3 | 1024 | Excellent | Fast |
| Cohere embed-v3 | 1024 | Excellent | Fast |
3. Hybrid Search
Pure vector search misses exact matches. Hybrid search combines:
- Semantic search: Finds similar meaning
- Keyword search: Finds exact terms (BM25)
# Hybrid approach
semantic_results = vector_search(query)
keyword_results = bm25_search(query)
final_results = rerank(semantic_results + keyword_results)
4. Reranking
Initial retrieval casts a wide net. Reranking uses a cross-encoder to precisely score relevance:
# Retrieved 20 chunks, rerank to top 5
candidates = retrieve(query, limit=20)
reranked = cross_encoder.rerank(query, candidates)
top_chunks = reranked[:5]
5. Citation Tracking
Great RAG doesn't just use context—it tracks exactly which chunks informed each claim:
Revenue grew 35% [1] → chunk_id: abc123, page: 12, paragraph: 3
Common RAG Pitfalls
1. Chunking Too Small
Tiny chunks lack context. The AI sees "35%" but not what it refers to.
Fix: Use 300-800 token chunks with overlap.
2. Chunking Too Large
Huge chunks dilute relevance. Irrelevant content confuses the model.
Fix: Split on semantic boundaries, not arbitrary limits.
3. Ignoring Metadata
Documents have structure—titles, sections, page numbers.
Fix: Include metadata in chunks: [Page 5, Section: Financial Results] Revenue grew...
4. No Fallback
What if no relevant chunks are found?
Fix: Detect low-confidence retrievals and respond: "I couldn't find information about that in your documents."
5. Context Window Overflow
Too many chunks exceed the LLM's limit.
Fix: Summarize or truncate lower-ranked chunks.
RAG in Production: NovaKit's Architecture
Here's how NovaKit implements RAG for Document Chat:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Upload │ ──▶ │ Chunking │ ──▶ │ Embedding │
│ (PDF/URL) │ │ (Semantic) │ │ (OpenAI) │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Answer │ ◀── │ LLM │ ◀── │ pgvector │
│ + Citations │ │ (GPT/Claude)│ │ Search │
└─────────────┘ └─────────────┘ └─────────────┘
Key implementation details:
- Semantic chunking with 500-token targets
- OpenAI text-embedding-3-small for vectors
- Supabase pgvector for storage and search
- Cosine similarity with 0.7 threshold
- Top-5 chunk retrieval by default
- Citation extraction from model output
The Future of RAG
RAG is evolving rapidly:
Agentic RAG
AI agents that decide when to retrieve, what to search for, and how to combine results.
Multi-Modal RAG
Retrieve not just text but images, tables, and charts from documents.
Self-Correcting RAG
Systems that detect when retrieval failed and try alternative strategies.
Graph RAG
Combining vector search with knowledge graphs for relationship-aware retrieval.
Try It Yourself
Understanding RAG is one thing—experiencing it is another.
- Upload a document to NovaKit Document Chat
- Ask a specific question about the content
- Check the citations to see retrieval in action
- Try edge cases to feel the boundaries
The best way to understand RAG is to break it. Ask questions you know aren't in the document. See how the system responds.
RAG represents a fundamental shift in how we interact with AI. Instead of relying solely on training data, we can ground AI responses in our own knowledge—verified, cited, and trustworthy.
The technology will only get better. But the core insight remains: the best AI answers come from combining general intelligence with specific knowledge.
Ready to see RAG in action? Try NovaKit Document Chat →