Signup Bonus

Get +1,000 bonus credits on Pro, +2,500 on Business. Start building today.

View plans
NovaKit
← Back to Blog

RAG Explained: How AI Answers Questions from Your Documents

A deep dive into Retrieval-Augmented Generation—the technology that lets AI answer questions accurately from your own documents without hallucinating.

15 min readNovaKit Team

RAG Explained: How AI Answers Questions from Your Documents

When you ask ChatGPT a question, it answers from its training data—knowledge frozen at a point in time. But what if you want AI to answer questions about your documents? Documents it has never seen?

This is where Retrieval-Augmented Generation (RAG) comes in. RAG is the technology that powers document chat, knowledge bases, and enterprise AI assistants. Let's break down how it actually works.

The Problem RAG Solves

Large Language Models (LLMs) have two fundamental limitations:

1. Knowledge Cutoff

Models are trained on data up to a specific date. GPT-4 doesn't know about events after its training cutoff. Your company's internal docs? Never seen them.

2. Hallucination

When LLMs don't know something, they sometimes make things up—confidently. Ask about a document it hasn't seen, and it might fabricate plausible-sounding but false information.

RAG solves both problems by giving the AI access to external knowledge at query time, grounding its responses in actual source material.

How RAG Works: The Pipeline

RAG operates in two phases: indexing (preparation) and retrieval + generation (query time).

Phase 1: Indexing

Before you can chat with a document, it needs to be processed and stored.

Document → Chunking → Embedding → Vector Database

Step 1: Chunking

Documents are split into smaller, semantically meaningful pieces called "chunks."

Why chunk?

  • LLMs have context limits (can't process entire books at once)
  • Smaller chunks = more precise retrieval
  • Each chunk should contain a complete thought

Chunking strategies:

StrategyDescriptionBest For
Fixed-sizeSplit every N charactersSimple, predictable
Sentence-basedSplit on sentence boundariesNatural breaks
SemanticSplit on topic changesHighest quality
RecursiveHierarchical splittingLong documents

Example: A 50-page PDF might become 200 chunks of ~500 tokens each.

Step 2: Embedding

Each chunk is converted into a vector embedding—a list of numbers (typically 1536 dimensions) that represents its semantic meaning.

# Conceptual example
chunk = "The quarterly revenue increased by 35%..."
embedding = embed(chunk)
# → [0.023, -0.156, 0.089, ..., 0.042]  # 1536 numbers

Key insight: Similar meanings produce similar vectors. "Revenue grew" and "Sales increased" will have embeddings close together in vector space, even though the words are different.

Step 3: Vector Storage

Embeddings are stored in a vector database optimized for similarity search.

Popular options:

  • Pinecone
  • Weaviate
  • Qdrant
  • Supabase pgvector (what NovaKit uses)
  • Chroma

Phase 2: Retrieval + Generation

When you ask a question, the magic happens:

Question → Embed → Search → Retrieve Chunks → Generate Answer

Step 1: Embed the Query

Your question is converted to a vector using the same embedding model:

question = "What was the revenue growth?"
query_embedding = embed(question)

Step 2: Semantic Search

The query vector is compared against all chunk vectors to find the most similar ones.

Similarity metrics:

  • Cosine similarity: Measures angle between vectors (most common)
  • Euclidean distance: Measures straight-line distance
  • Dot product: Faster, works well with normalized vectors
-- Simplified pgvector query
SELECT content, 1 - (embedding <=> query_embedding) as similarity
FROM chunks
ORDER BY embedding <=> query_embedding
LIMIT 5;

This returns the 5 most semantically similar chunks—even if they don't share exact keywords with your question.

Step 3: Context Augmentation

Retrieved chunks are inserted into the LLM prompt as context:

System: You are a helpful assistant. Answer questions based on the
following context. Cite sources using [1], [2], etc.

Context:
[1] "The quarterly revenue increased by 35%, reaching $2.8M..."
[2] "Growth was primarily driven by enterprise expansion..."
[3] "CEO stated targets remain aggressive but achievable..."

User: What was the revenue growth?

Step 4: Generation

The LLM generates an answer grounded in the provided context:

Based on the quarterly report, revenue grew by 35%, reaching $2.8M [1].
This growth was primarily driven by enterprise expansion [2].

Why RAG Works Better Than Fine-Tuning

You might wonder: "Why not just train the model on my documents?"

ApproachRAGFine-Tuning
Setup timeMinutesHours/days
CostLow (just embedding)High (GPU training)
UpdatesInstant (re-embed)Requires retraining
AccuracyCites sourcesMay hallucinate
Multiple sourcesEasy to addComplex
Data privacyStays in your DBGoes to training

RAG is generally preferred for knowledge bases because:

  1. Documents change frequently
  2. You need verifiable sources
  3. You want to keep data separate from the model

The Anatomy of Great RAG

Not all RAG implementations are equal. Here's what separates good from great:

1. Smart Chunking

Bad: Fixed 500-character splits that cut sentences mid-word Good: Semantic splits that preserve complete thoughts with overlap

# Overlap ensures context isn't lost at boundaries
chunks = split_with_overlap(text, chunk_size=500, overlap=50)

2. Quality Embeddings

Bad: Bag-of-words or simple TF-IDF Good: State-of-the-art embedding models

ModelDimensionsQualitySpeed
text-embedding-3-small1536GoodFast
text-embedding-3-large3072BetterMedium
voyage-31024ExcellentFast
Cohere embed-v31024ExcellentFast

3. Hybrid Search

Pure vector search misses exact matches. Hybrid search combines:

  • Semantic search: Finds similar meaning
  • Keyword search: Finds exact terms (BM25)
# Hybrid approach
semantic_results = vector_search(query)
keyword_results = bm25_search(query)
final_results = rerank(semantic_results + keyword_results)

4. Reranking

Initial retrieval casts a wide net. Reranking uses a cross-encoder to precisely score relevance:

# Retrieved 20 chunks, rerank to top 5
candidates = retrieve(query, limit=20)
reranked = cross_encoder.rerank(query, candidates)
top_chunks = reranked[:5]

5. Citation Tracking

Great RAG doesn't just use context—it tracks exactly which chunks informed each claim:

Revenue grew 35% [1] → chunk_id: abc123, page: 12, paragraph: 3

Common RAG Pitfalls

1. Chunking Too Small

Tiny chunks lack context. The AI sees "35%" but not what it refers to.

Fix: Use 300-800 token chunks with overlap.

2. Chunking Too Large

Huge chunks dilute relevance. Irrelevant content confuses the model.

Fix: Split on semantic boundaries, not arbitrary limits.

3. Ignoring Metadata

Documents have structure—titles, sections, page numbers.

Fix: Include metadata in chunks: [Page 5, Section: Financial Results] Revenue grew...

4. No Fallback

What if no relevant chunks are found?

Fix: Detect low-confidence retrievals and respond: "I couldn't find information about that in your documents."

5. Context Window Overflow

Too many chunks exceed the LLM's limit.

Fix: Summarize or truncate lower-ranked chunks.

RAG in Production: NovaKit's Architecture

Here's how NovaKit implements RAG for Document Chat:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Upload    │ ──▶ │  Chunking   │ ──▶ │  Embedding  │
│  (PDF/URL)  │     │ (Semantic)  │     │ (OpenAI)    │
└─────────────┘     └─────────────┘     └─────────────┘
                                               │
                                               ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Answer    │ ◀── │     LLM     │ ◀── │  pgvector   │
│ + Citations │     │ (GPT/Claude)│     │   Search    │
└─────────────┘     └─────────────┘     └─────────────┘

Key implementation details:

  • Semantic chunking with 500-token targets
  • OpenAI text-embedding-3-small for vectors
  • Supabase pgvector for storage and search
  • Cosine similarity with 0.7 threshold
  • Top-5 chunk retrieval by default
  • Citation extraction from model output

The Future of RAG

RAG is evolving rapidly:

Agentic RAG

AI agents that decide when to retrieve, what to search for, and how to combine results.

Multi-Modal RAG

Retrieve not just text but images, tables, and charts from documents.

Self-Correcting RAG

Systems that detect when retrieval failed and try alternative strategies.

Graph RAG

Combining vector search with knowledge graphs for relationship-aware retrieval.

Try It Yourself

Understanding RAG is one thing—experiencing it is another.

  1. Upload a document to NovaKit Document Chat
  2. Ask a specific question about the content
  3. Check the citations to see retrieval in action
  4. Try edge cases to feel the boundaries

The best way to understand RAG is to break it. Ask questions you know aren't in the document. See how the system responds.


RAG represents a fundamental shift in how we interact with AI. Instead of relying solely on training data, we can ground AI responses in our own knowledge—verified, cited, and trustworthy.

The technology will only get better. But the core insight remains: the best AI answers come from combining general intelligence with specific knowledge.

Ready to see RAG in action? Try NovaKit Document Chat →

RAG Explained: How AI Answers Questions from Your Documents | NovaKit Blog | NovaKit