RAG Explained: How AI Answers Questions from Your Documents

When you ask ChatGPT a question, it answers from its training data—knowledge frozen at a point in time. But what if you want AI to answer questions about your documents? Documents it has never seen?

This is where Retrieval-Augmented Generation (RAG) comes in. RAG is the technology that powers document chat, knowledge bases, and enterprise AI assistants. Let's break down how it actually works.

The Problem RAG Solves

Large Language Models (LLMs) have two fundamental limitations:

1. Knowledge Cutoff

Models are trained on data up to a specific date. GPT-4 doesn't know about events after its training cutoff. Your company's internal docs? Never seen them.

2. Hallucination

When LLMs don't know something, they sometimes make things up—confidently. Ask about a document it hasn't seen, and it might fabricate plausible-sounding but false information.

RAG solves both problems by giving the AI access to external knowledge at query time, grounding its responses in actual source material.

How RAG Works: The Pipeline

RAG operates in two phases: indexing (preparation) and retrieval + generation (query time).

Phase 1: Indexing

Before you can chat with a document, it needs to be processed and stored.

Document → Chunking → Embedding → Vector Database

Step 1: Chunking

Documents are split into smaller, semantically meaningful pieces called "chunks."

Why chunk?

LLMs have context limits (can't process entire books at once)
Smaller chunks = more precise retrieval
Each chunk should contain a complete thought

Chunking strategies:

Strategy	Description	Best For
Fixed-size	Split every N characters	Simple, predictable
Sentence-based	Split on sentence boundaries	Natural breaks
Semantic	Split on topic changes	Highest quality
Recursive	Hierarchical splitting	Long documents

Example: A 50-page PDF might become 200 chunks of ~500 tokens each.

Step 2: Embedding

Each chunk is converted into a vector embedding—a list of numbers (typically 1536 dimensions) that represents its semantic meaning.

# Conceptual example
chunk = "The quarterly revenue increased by 35%..."
embedding = embed(chunk)
# → [0.023, -0.156, 0.089, ..., 0.042]  # 1536 numbers

Key insight: Similar meanings produce similar vectors. "Revenue grew" and "Sales increased" will have embeddings close together in vector space, even though the words are different.

Step 3: Vector Storage

Embeddings are stored in a vector database optimized for similarity search.

Popular options:

Pinecone
Weaviate
Qdrant
Supabase pgvector (what NovaKit uses)
Chroma

Phase 2: Retrieval + Generation

When you ask a question, the magic happens:

Question → Embed → Search → Retrieve Chunks → Generate Answer

Step 1: Embed the Query

Your question is converted to a vector using the same embedding model:

question = "What was the revenue growth?"
query_embedding = embed(question)

Step 2: Semantic Search

The query vector is compared against all chunk vectors to find the most similar ones.

Similarity metrics:

Cosine similarity: Measures angle between vectors (most common)
Euclidean distance: Measures straight-line distance
Dot product: Faster, works well with normalized vectors

-- Simplified pgvector query
SELECT content, 1 - (embedding <=> query_embedding) as similarity
FROM chunks
ORDER BY embedding <=> query_embedding
LIMIT 5;

This returns the 5 most semantically similar chunks—even if they don't share exact keywords with your question.

Step 3: Context Augmentation

Retrieved chunks are inserted into the LLM prompt as context:

System: You are a helpful assistant. Answer questions based on the
following context. Cite sources using [1], [2], etc.

Context:
[1] "The quarterly revenue increased by 35%, reaching $2.8M..."
[2] "Growth was primarily driven by enterprise expansion..."
[3] "CEO stated targets remain aggressive but achievable..."

User: What was the revenue growth?

Step 4: Generation

The LLM generates an answer grounded in the provided context:

Based on the quarterly report, revenue grew by 35%, reaching $2.8M [1].
This growth was primarily driven by enterprise expansion [2].

Why RAG Works Better Than Fine-Tuning

You might wonder: "Why not just train the model on my documents?"

Approach	RAG	Fine-Tuning
Setup time	Minutes	Hours/days
Cost	Low (just embedding)	High (GPU training)
Updates	Instant (re-embed)	Requires retraining
Accuracy	Cites sources	May hallucinate
Multiple sources	Easy to add	Complex
Data privacy	Stays in your DB	Goes to training

RAG is generally preferred for knowledge bases because:

Documents change frequently
You need verifiable sources
You want to keep data separate from the model

The Anatomy of Great RAG

Not all RAG implementations are equal. Here's what separates good from great:

1. Smart Chunking

Bad: Fixed 500-character splits that cut sentences mid-word Good: Semantic splits that preserve complete thoughts with overlap

# Overlap ensures context isn't lost at boundaries
chunks = split_with_overlap(text, chunk_size=500, overlap=50)

2. Quality Embeddings

Bad: Bag-of-words or simple TF-IDF Good: State-of-the-art embedding models

Model	Dimensions	Quality	Speed
text-embedding-3-small	1536	Good	Fast
text-embedding-3-large	3072	Better	Medium
voyage-3	1024	Excellent	Fast
Cohere embed-v3	1024	Excellent	Fast

3. Hybrid Search

Pure vector search misses exact matches. Hybrid search combines:

Semantic search: Finds similar meaning
Keyword search: Finds exact terms (BM25)

# Hybrid approach
semantic_results = vector_search(query)
keyword_results = bm25_search(query)
final_results = rerank(semantic_results + keyword_results)

4. Reranking

Initial retrieval casts a wide net. Reranking uses a cross-encoder to precisely score relevance:

# Retrieved 20 chunks, rerank to top 5
candidates = retrieve(query, limit=20)
reranked = cross_encoder.rerank(query, candidates)
top_chunks = reranked[:5]

5. Citation Tracking

Great RAG doesn't just use context—it tracks exactly which chunks informed each claim:

Revenue grew 35% [1] → chunk_id: abc123, page: 12, paragraph: 3

Common RAG Pitfalls

1. Chunking Too Small

Tiny chunks lack context. The AI sees "35%" but not what it refers to.

Fix: Use 300-800 token chunks with overlap.

2. Chunking Too Large

Huge chunks dilute relevance. Irrelevant content confuses the model.

Fix: Split on semantic boundaries, not arbitrary limits.

3. Ignoring Metadata

Documents have structure—titles, sections, page numbers.

Fix: Include metadata in chunks: [Page 5, Section: Financial Results] Revenue grew...

4. No Fallback

What if no relevant chunks are found?

Fix: Detect low-confidence retrievals and respond: "I couldn't find information about that in your documents."

5. Context Window Overflow

Too many chunks exceed the LLM's limit.

Fix: Summarize or truncate lower-ranked chunks.

RAG in Production: NovaKit's Architecture

Here's how NovaKit implements RAG for Document Chat:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Upload    │ ──▶ │  Chunking   │ ──▶ │  Embedding  │
│  (PDF/URL)  │     │ (Semantic)  │     │ (OpenAI)    │
└─────────────┘     └─────────────┘     └─────────────┘
                                               │
                                               ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Answer    │ ◀── │     LLM     │ ◀── │  pgvector   │
│ + Citations │     │ (GPT/Claude)│     │   Search    │
└─────────────┘     └─────────────┘     └─────────────┘

Key implementation details:

Semantic chunking with 500-token targets
OpenAI text-embedding-3-small for vectors
Supabase pgvector for storage and search
Cosine similarity with 0.7 threshold
Top-5 chunk retrieval by default
Citation extraction from model output

The Future of RAG

RAG is evolving rapidly:

Agentic RAG

AI agents that decide when to retrieve, what to search for, and how to combine results.

Multi-Modal RAG

Retrieve not just text but images, tables, and charts from documents.

Self-Correcting RAG

Systems that detect when retrieval failed and try alternative strategies.

Graph RAG

Combining vector search with knowledge graphs for relationship-aware retrieval.

Try It Yourself

Understanding RAG is one thing—experiencing it is another.

Upload a document to NovaKit Document Chat
Ask a specific question about the content
Check the citations to see retrieval in action
Try edge cases to feel the boundaries

The best way to understand RAG is to break it. Ask questions you know aren't in the document. See how the system responds.

RAG represents a fundamental shift in how we interact with AI. Instead of relying solely on training data, we can ground AI responses in our own knowledge—verified, cited, and trustworthy.

The technology will only get better. But the core insight remains: the best AI answers come from combining general intelligence with specific knowledge.

Ready to see RAG in action? Try NovaKit Document Chat →

RAG Explained: How AI Answers Questions from Your Documents

RAG Explained: How AI Answers Questions from Your Documents

The Problem RAG Solves

1. Knowledge Cutoff

2. Hallucination

How RAG Works: The Pipeline

Phase 1: Indexing

Step 1: Chunking

Step 2: Embedding

Step 3: Vector Storage

Phase 2: Retrieval + Generation

Step 1: Embed the Query

Step 2: Semantic Search

Step 3: Context Augmentation

Step 4: Generation

Why RAG Works Better Than Fine-Tuning

The Anatomy of Great RAG

1. Smart Chunking

2. Quality Embeddings

3. Hybrid Search

4. Reranking

5. Citation Tracking

Common RAG Pitfalls

1. Chunking Too Small

2. Chunking Too Large

3. Ignoring Metadata

4. No Fallback

5. Context Window Overflow

RAG in Production: NovaKit's Architecture

The Future of RAG

Agentic RAG

Multi-Modal RAG

Self-Correcting RAG

Graph RAG

Try It Yourself

Related Articles

RAG is Dead, Long Live Context Engines: The Evolution of Document AI

How to Chat with PDFs Using AI: A Complete Guide

What Are AI Agents? The Complete Guide to Autonomous AI Systems in 2025