Why Your RAG Chatbot Sucks (And How to Fix It)
Your RAG chatbot gives wrong answers, misses obvious information, and hallucinates sources. Here's why—and a practical guide to fixing the most common problems.
Why Your RAG Chatbot Sucks (And How to Fix It)
You built a RAG chatbot. You uploaded your documents. You asked it a question that's clearly answered on page 3.
It confidently gave you the wrong answer. Or it said "I don't have that information" when you're staring at it in the source document.
You're not alone. Most RAG implementations are broken in predictable ways.
This guide will help you diagnose why your RAG chatbot sucks and how to fix it.
Problem 1: Wrong Chunks Retrieved
Symptom
The chatbot retrieves chunks that seem related but don't actually answer the question.
User: "What's the cancellation policy for annual plans?"
Retrieved chunk: "Our plans include monthly and annual options.
Annual plans offer a 20% discount compared to monthly billing."
Answer: "Annual plans offer a 20% discount!"
(Completely missed the actual cancellation policy)
Why It Happens
Semantic similarity ≠ relevance. The embedding model found text about "annual plans" but not about "cancellation."
Vector search optimizes for topical similarity, not answer relevance.
How to Fix It
Fix 1: Hybrid search
Add keyword search alongside vector search:
# Vector results
vector_results = vector_search(query, k=10)
# Keyword results
keyword_results = bm25_search(query, k=10)
# Combine with reciprocal rank fusion
final_results = rrf_combine(vector_results, keyword_results)
Keyword search catches exact term matches that vectors miss.
Fix 2: Query decomposition
Break complex queries into sub-queries:
def decompose_query(query):
# "cancellation policy for annual plans" becomes:
return [
"cancellation policy",
"annual plan terms",
"refund policy annual"
]
# Search for each, combine results
Fix 3: Better metadata filtering
If your documents have categories:
# Filter to relevant sections first
results = vector_search(
query,
filter={"category": "policies"}
)
Problem 2: Chunk Boundaries Split Information
Symptom
The answer exists but spans two chunks. The chatbot retrieves one chunk and generates a partial or wrong answer.
Document text:
"...The API rate limit is 100 requests per minute for free tier users.
[CHUNK BOUNDARY]
For paid users, the limit increases to 1000 requests per minute..."
User: "What's the rate limit for paid users?"
Retrieved: First chunk only
Answer: "The rate limit is 100 requests per minute."
(Wrong! That's free tier.)
Why It Happens
Fixed-size chunking doesn't respect semantic boundaries. Information gets split at arbitrary points.
How to Fix It
Fix 1: Increase chunk overlap
# Add overlap so context spans chunks
chunks = split_text(
document,
chunk_size=500,
overlap=100 # 100 tokens overlap
)
Fix 2: Semantic chunking
Split at natural boundaries:
def semantic_chunk(text):
# Split at paragraph breaks
paragraphs = text.split('\n\n')
chunks = []
current_chunk = ""
for para in paragraphs:
if len(current_chunk) + len(para) < MAX_CHUNK_SIZE:
current_chunk += para + "\n\n"
else:
chunks.append(current_chunk)
current_chunk = para
return chunks
Fix 3: Parent-child retrieval
Retrieve small chunks for precision, return larger context:
# Index small chunks for search
small_chunks = split_text(doc, chunk_size=200)
# Keep mapping to larger sections
for chunk in small_chunks:
chunk.parent_section = get_parent_section(chunk)
# At query time
retrieved_chunks = search(query)
# Return parent sections, not just chunks
return [chunk.parent_section for chunk in retrieved_chunks]
Problem 3: Hallucinated Sources
Symptom
The chatbot cites sources that don't exist or attributes information to the wrong document.
User: "What's our refund policy?"
Answer: "According to section 4.2 of the Terms of Service,
refunds are available within 30 days."
Reality: Section 4.2 doesn't exist. Refund policy is in
a completely different document.
Why It Happens
The LLM generates plausible-sounding citations based on patterns in training data, not the actual retrieved content.
How to Fix It
Fix 1: Explicit citation requirements
prompt = """
Answer based ONLY on the following sources.
For EVERY claim, cite the exact source in brackets [Source: filename, page X].
If information isn't in the sources, say "This isn't covered in the provided documents."
Sources:
{retrieved_chunks}
Question: {query}
"""
Fix 2: Validate citations post-generation
def validate_citations(answer, sources):
# Extract all citations from answer
citations = extract_citations(answer)
for citation in citations:
# Check if cited text actually exists in sources
if not verify_in_sources(citation.text, sources):
# Flag or remove hallucinated citation
answer = flag_unverified(answer, citation)
return answer
Fix 3: Structured output
Force the model to output in a structured format:
response_schema = {
"answer": "string",
"citations": [
{
"claim": "string",
"source_chunk_id": "string",
"quote": "string" # Exact quote from source
}
],
"confidence": "high | medium | low"
}
Problem 4: "I Don't Know" When Answer Exists
Symptom
The information is clearly in your documents, but the chatbot says it doesn't have it.
User: "What integrations do you support?"
Answer: "I don't have information about integrations."
Document (definitely indexed): "We support integrations with
Slack, Discord, Zapier, Make, and n8n..."
Why It Happens
Several possible causes:
- Document failed to process silently
- Embedding model didn't capture the semantic relationship
- Retrieval threshold too high
- Query phrasing doesn't match document phrasing
How to Fix It
Fix 1: Add logging to diagnose
def search_with_logging(query):
# Log the query embedding
logger.info(f"Query: {query}")
logger.info(f"Query embedding: {embed(query)[:5]}...") # First 5 dims
# Log retrieval results
results = vector_search(query, k=10)
for i, result in enumerate(results):
logger.info(f"Result {i}: {result.text[:100]}... Score: {result.score}")
return results
If results are empty or low-scoring, you know retrieval failed.
Fix 2: Lower retrieval threshold
# Instead of hard threshold
results = search(query, min_score=0.8) # Too strict
# Use top-k without threshold
results = search(query, k=10)
# Then filter in context
Fix 3: Synonym expansion
def expand_query(query):
# Add synonyms and related terms
synonyms = get_synonyms(query) # "integrations" -> "connections", "apps", "plugins"
expanded = f"{query} {' '.join(synonyms)}"
return expanded
Fix 4: Verify document processing
def verify_document_indexed(doc_id):
# Check if document has chunks
chunks = get_chunks_for_document(doc_id)
if not chunks:
return {"status": "not_indexed", "reason": "no_chunks"}
# Check if chunks have embeddings
chunks_with_embeddings = [c for c in chunks if c.embedding is not None]
if len(chunks_with_embeddings) < len(chunks):
return {"status": "partial", "reason": "missing_embeddings"}
return {"status": "fully_indexed"}
Problem 5: Stale or Contradictory Information
Symptom
The chatbot gives outdated answers or mixes information from different document versions.
User: "What's the API endpoint for user creation?"
Answer: "Use POST /api/v1/users"
Reality: That was v1. Current docs say POST /api/v2/users
Both versions are indexed.
Why It Happens
You indexed multiple document versions without version awareness. The retriever might find the old version.
How to Fix It
Fix 1: Add timestamp metadata
chunk = {
"text": "...",
"embedding": [...],
"metadata": {
"document_id": "api-docs",
"version": "2.0",
"last_updated": "2026-01-01",
"is_current": True
}
}
Fix 2: Filter by recency
results = search(
query,
filter={"is_current": True}
)
# Or prefer recent
results = search(
query,
sort_by="last_updated",
order="desc"
)
Fix 3: Remove old versions
When uploading new documents, explicitly remove old versions:
def upload_document(doc, version):
# Remove previous versions
delete_chunks_where(document_id=doc.id, version__lt=version)
# Add new version
add_chunks(doc, version)
Problem 6: Context Window Overflow
Symptom
For complex queries, the chatbot's answer quality degrades or it ignores some retrieved information.
Why It Happens
You're stuffing too many chunks into the prompt, exceeding effective context utilization.
How to Fix It
Fix 1: Limit retrieved chunks
# Instead of retrieving many chunks
results = search(query, k=20) # Too many
# Retrieve fewer, higher quality
results = search(query, k=5)
Fix 2: Summarize before including
def prepare_context(chunks, max_tokens=2000):
total_tokens = sum(count_tokens(c) for c in chunks)
if total_tokens > max_tokens:
# Summarize less relevant chunks
important_chunks = chunks[:3] # Keep top 3 full
summarized = summarize(chunks[3:]) # Summarize rest
return important_chunks + [summarized]
return chunks
Fix 3: Iterative retrieval
def iterative_answer(query):
# Start with small context
results = search(query, k=3)
answer = generate(query, results)
# Check if answer is complete
if needs_more_info(answer):
# Get additional context
more_results = search(refine_query(query, answer), k=3)
answer = generate(query, results + more_results)
return answer
Problem 7: Poor Handling of Tables and Structured Data
Symptom
Questions about data in tables return wrong answers or "not found."
Document contains:
| Plan | Price | Users |
|-------|-------|-------|
| Free | $0 | 1 |
| Pro | $29 | 5 |
User: "How many users does the Pro plan support?"
Answer: "I couldn't find information about Pro plan user limits."
Why It Happens
Tables don't chunk or embed well. Row context gets split from headers.
How to Fix It
Fix 1: Flatten tables
def flatten_table(table):
rows = []
headers = table.headers
for row in table.rows:
# "Pro plan: Price is $29, Users is 5"
row_text = f"{row[0]}: " + ", ".join(
f"{h} is {v}" for h, v in zip(headers[1:], row[1:])
)
rows.append(row_text)
return "\n".join(rows)
Fix 2: Index tables separately
# Create special table chunks with full context
table_chunk = {
"text": table.to_markdown(), # Full table as markdown
"type": "table",
"metadata": {
"headers": table.headers,
"row_count": len(table.rows)
}
}
Fix 3: Add table descriptions
def describe_table(table):
return f"""
Table: {table.caption or 'Unnamed'}
Columns: {', '.join(table.headers)}
Contains information about: {infer_topic(table)}
"""
Problem 8: Ignoring User Context
Symptom
The chatbot doesn't use information from earlier in the conversation.
User: "I'm on the Enterprise plan"
Bot: "Got it!"
User: "What's my rate limit?"
Bot: "Rate limits depend on your plan. Free is 100/min, Pro is 500/min, Enterprise is unlimited."
(Should have directly said "unlimited" based on context)
Why It Happens
Each query is processed independently without conversation context.
How to Fix It
Fix 1: Include conversation history in retrieval
def search_with_context(query, conversation_history):
# Combine recent context with query
context = summarize_recent(conversation_history[-5:])
enriched_query = f"{context}\n\nCurrent question: {query}"
return search(enriched_query)
Fix 2: Extract and store facts
def extract_user_facts(message):
facts = llm_extract(message, schema={"plan": "string", "company": "string"})
return facts
# Store facts per conversation
conversation.facts["plan"] = "Enterprise"
# Use facts in generation
prompt = f"""
Known about user:
- Plan: {conversation.facts.get('plan', 'unknown')}
Answer their question using this context...
"""
Problem 9: Slow Response Times
Symptom
Queries take 5-10+ seconds. Users abandon before seeing answers.
Why It Happens
- Large embeddings
- Slow vector search
- Too many chunks retrieved
- Large context sent to LLM
- No caching
How to Fix It
Fix 1: Optimize embedding
# Batch embedding requests
chunks = [c1, c2, c3, ...]
embeddings = embed_batch(chunks) # One API call
# Use smaller models where quality permits
embedding = embed(text, model="text-embedding-3-small") # Faster
Fix 2: Index optimization
# Use HNSW index for faster approximate search
index = create_index(
type="hnsw",
m=16,
ef_construction=200
)
Fix 3: Caching
@cache(ttl=3600)
def search(query):
return vector_search(query)
# Cache common queries
# Cache embedding computations
# Cache LLM responses for identical inputs
Fix 4: Streaming
# Stream response as it generates
for chunk in llm.stream(prompt):
yield chunk
# User sees response building, feels faster
Problem 10: No Way to Debug
Symptom
Something's wrong but you can't figure out what.
How to Fix It
Build observability from day one:
@trace
def answer_query(query):
# Log query
span.log("query", query)
# Log embedding
embedding = embed(query)
span.log("embedding_dims", len(embedding))
# Log retrieval
results = search(query)
span.log("results_count", len(results))
span.log("top_score", results[0].score if results else 0)
span.log("retrieved_texts", [r.text[:100] for r in results])
# Log generation
answer = generate(query, results)
span.log("answer_length", len(answer))
return answer
When things go wrong, you can trace:
- What was the query?
- What got retrieved?
- What scores did results have?
- What went into the prompt?
- What came out?
Quick Diagnostic Checklist
When your RAG chatbot fails, check in order:
- Is the document processed? Check chunk count, embedding presence
- Is retrieval working? Log retrieved chunks, check relevance
- Are the right chunks found? Manual inspection of top results
- Is the prompt correct? Log full prompt sent to LLM
- Is the LLM responding well? Check for hallucinations, formatting issues
Most problems are retrieval problems. Fix those first.
Building RAG right is hard. That's why we spent three iterations getting NovaKit's Document Chat right. Try it yourself and see how document AI should work.
Enjoyed this article? Share it with others.