Why Your RAG Chatbot Sucks (And How to Fix It)

You built a RAG chatbot. You uploaded your documents. You asked it a question that's clearly answered on page 3.

It confidently gave you the wrong answer. Or it said "I don't have that information" when you're staring at it in the source document.

You're not alone. Most RAG implementations are broken in predictable ways.

This guide will help you diagnose why your RAG chatbot sucks and how to fix it.

Problem 1: Wrong Chunks Retrieved

Symptom

The chatbot retrieves chunks that seem related but don't actually answer the question.

User: "What's the cancellation policy for annual plans?"

Retrieved chunk: "Our plans include monthly and annual options.
Annual plans offer a 20% discount compared to monthly billing."

Answer: "Annual plans offer a 20% discount!"
(Completely missed the actual cancellation policy)

Why It Happens

Semantic similarity ≠ relevance. The embedding model found text about "annual plans" but not about "cancellation."

Vector search optimizes for topical similarity, not answer relevance.

How to Fix It

Fix 1: Hybrid search

Add keyword search alongside vector search:

# Vector results
vector_results = vector_search(query, k=10)

# Keyword results
keyword_results = bm25_search(query, k=10)

# Combine with reciprocal rank fusion
final_results = rrf_combine(vector_results, keyword_results)

Keyword search catches exact term matches that vectors miss.

Fix 2: Query decomposition

Break complex queries into sub-queries:

def decompose_query(query):
    # "cancellation policy for annual plans" becomes:
    return [
        "cancellation policy",
        "annual plan terms",
        "refund policy annual"
    ]

# Search for each, combine results

Fix 3: Better metadata filtering

If your documents have categories:

# Filter to relevant sections first
results = vector_search(
    query,
    filter={"category": "policies"}
)

Problem 2: Chunk Boundaries Split Information

Symptom

The answer exists but spans two chunks. The chatbot retrieves one chunk and generates a partial or wrong answer.

Document text:
"...The API rate limit is 100 requests per minute for free tier users.
[CHUNK BOUNDARY]
For paid users, the limit increases to 1000 requests per minute..."

User: "What's the rate limit for paid users?"

Retrieved: First chunk only

Answer: "The rate limit is 100 requests per minute."
(Wrong! That's free tier.)

Why It Happens

Fixed-size chunking doesn't respect semantic boundaries. Information gets split at arbitrary points.

How to Fix It

Fix 1: Increase chunk overlap

# Add overlap so context spans chunks
chunks = split_text(
    document,
    chunk_size=500,
    overlap=100  # 100 tokens overlap
)

Fix 2: Semantic chunking

Split at natural boundaries:

def semantic_chunk(text):
    # Split at paragraph breaks
    paragraphs = text.split('\n\n')

    chunks = []
    current_chunk = ""

    for para in paragraphs:
        if len(current_chunk) + len(para) < MAX_CHUNK_SIZE:
            current_chunk += para + "\n\n"
        else:
            chunks.append(current_chunk)
            current_chunk = para

    return chunks

Fix 3: Parent-child retrieval

Retrieve small chunks for precision, return larger context:

# Index small chunks for search
small_chunks = split_text(doc, chunk_size=200)

# Keep mapping to larger sections
for chunk in small_chunks:
    chunk.parent_section = get_parent_section(chunk)

# At query time
retrieved_chunks = search(query)
# Return parent sections, not just chunks
return [chunk.parent_section for chunk in retrieved_chunks]

Problem 3: Hallucinated Sources

Symptom

The chatbot cites sources that don't exist or attributes information to the wrong document.

User: "What's our refund policy?"

Answer: "According to section 4.2 of the Terms of Service,
refunds are available within 30 days."

Reality: Section 4.2 doesn't exist. Refund policy is in
a completely different document.

Why It Happens

The LLM generates plausible-sounding citations based on patterns in training data, not the actual retrieved content.

How to Fix It

Fix 1: Explicit citation requirements

prompt = """
Answer based ONLY on the following sources.
For EVERY claim, cite the exact source in brackets [Source: filename, page X].
If information isn't in the sources, say "This isn't covered in the provided documents."

Sources:
{retrieved_chunks}

Question: {query}
"""

Fix 2: Validate citations post-generation

def validate_citations(answer, sources):
    # Extract all citations from answer
    citations = extract_citations(answer)

    for citation in citations:
        # Check if cited text actually exists in sources
        if not verify_in_sources(citation.text, sources):
            # Flag or remove hallucinated citation
            answer = flag_unverified(answer, citation)

    return answer

Fix 3: Structured output

Force the model to output in a structured format:

response_schema = {
    "answer": "string",
    "citations": [
        {
            "claim": "string",
            "source_chunk_id": "string",
            "quote": "string"  # Exact quote from source
        }
    ],
    "confidence": "high | medium | low"
}

Problem 4: "I Don't Know" When Answer Exists

Symptom

The information is clearly in your documents, but the chatbot says it doesn't have it.

User: "What integrations do you support?"

Answer: "I don't have information about integrations."

Document (definitely indexed): "We support integrations with
Slack, Discord, Zapier, Make, and n8n..."

Why It Happens

Several possible causes:

Document failed to process silently
Embedding model didn't capture the semantic relationship
Retrieval threshold too high
Query phrasing doesn't match document phrasing

How to Fix It

Fix 1: Add logging to diagnose

def search_with_logging(query):
    # Log the query embedding
    logger.info(f"Query: {query}")
    logger.info(f"Query embedding: {embed(query)[:5]}...")  # First 5 dims

    # Log retrieval results
    results = vector_search(query, k=10)
    for i, result in enumerate(results):
        logger.info(f"Result {i}: {result.text[:100]}... Score: {result.score}")

    return results

If results are empty or low-scoring, you know retrieval failed.

Fix 2: Lower retrieval threshold

# Instead of hard threshold
results = search(query, min_score=0.8)  # Too strict

# Use top-k without threshold
results = search(query, k=10)
# Then filter in context

Fix 3: Synonym expansion

def expand_query(query):
    # Add synonyms and related terms
    synonyms = get_synonyms(query)  # "integrations" -> "connections", "apps", "plugins"
    expanded = f"{query} {' '.join(synonyms)}"
    return expanded

Fix 4: Verify document processing

def verify_document_indexed(doc_id):
    # Check if document has chunks
    chunks = get_chunks_for_document(doc_id)
    if not chunks:
        return {"status": "not_indexed", "reason": "no_chunks"}

    # Check if chunks have embeddings
    chunks_with_embeddings = [c for c in chunks if c.embedding is not None]
    if len(chunks_with_embeddings) < len(chunks):
        return {"status": "partial", "reason": "missing_embeddings"}

    return {"status": "fully_indexed"}

Problem 5: Stale or Contradictory Information

Symptom

The chatbot gives outdated answers or mixes information from different document versions.

User: "What's the API endpoint for user creation?"

Answer: "Use POST /api/v1/users"

Reality: That was v1. Current docs say POST /api/v2/users
         Both versions are indexed.

Why It Happens

You indexed multiple document versions without version awareness. The retriever might find the old version.

How to Fix It

Fix 1: Add timestamp metadata

chunk = {
    "text": "...",
    "embedding": [...],
    "metadata": {
        "document_id": "api-docs",
        "version": "2.0",
        "last_updated": "2026-01-01",
        "is_current": True
    }
}

Fix 2: Filter by recency

results = search(
    query,
    filter={"is_current": True}
)

# Or prefer recent
results = search(
    query,
    sort_by="last_updated",
    order="desc"
)

Fix 3: Remove old versions

When uploading new documents, explicitly remove old versions:

def upload_document(doc, version):
    # Remove previous versions
    delete_chunks_where(document_id=doc.id, version__lt=version)

    # Add new version
    add_chunks(doc, version)

Problem 6: Context Window Overflow

Symptom

For complex queries, the chatbot's answer quality degrades or it ignores some retrieved information.

Why It Happens

You're stuffing too many chunks into the prompt, exceeding effective context utilization.

How to Fix It

Fix 1: Limit retrieved chunks

# Instead of retrieving many chunks
results = search(query, k=20)  # Too many

# Retrieve fewer, higher quality
results = search(query, k=5)

Fix 2: Summarize before including

def prepare_context(chunks, max_tokens=2000):
    total_tokens = sum(count_tokens(c) for c in chunks)

    if total_tokens > max_tokens:
        # Summarize less relevant chunks
        important_chunks = chunks[:3]  # Keep top 3 full
        summarized = summarize(chunks[3:])  # Summarize rest
        return important_chunks + [summarized]

    return chunks

Fix 3: Iterative retrieval

def iterative_answer(query):
    # Start with small context
    results = search(query, k=3)
    answer = generate(query, results)

    # Check if answer is complete
    if needs_more_info(answer):
        # Get additional context
        more_results = search(refine_query(query, answer), k=3)
        answer = generate(query, results + more_results)

    return answer

Problem 7: Poor Handling of Tables and Structured Data

Symptom

Questions about data in tables return wrong answers or "not found."

Document contains:
| Plan  | Price | Users |
|-------|-------|-------|
| Free  | $0    | 1     |
| Pro   | $29   | 5     |

User: "How many users does the Pro plan support?"

Answer: "I couldn't find information about Pro plan user limits."

Why It Happens

Tables don't chunk or embed well. Row context gets split from headers.

How to Fix It

Fix 1: Flatten tables

def flatten_table(table):
    rows = []
    headers = table.headers

    for row in table.rows:
        # "Pro plan: Price is $29, Users is 5"
        row_text = f"{row[0]}: " + ", ".join(
            f"{h} is {v}" for h, v in zip(headers[1:], row[1:])
        )
        rows.append(row_text)

    return "\n".join(rows)

Fix 2: Index tables separately

# Create special table chunks with full context
table_chunk = {
    "text": table.to_markdown(),  # Full table as markdown
    "type": "table",
    "metadata": {
        "headers": table.headers,
        "row_count": len(table.rows)
    }
}

Fix 3: Add table descriptions

def describe_table(table):
    return f"""
    Table: {table.caption or 'Unnamed'}
    Columns: {', '.join(table.headers)}
    Contains information about: {infer_topic(table)}
    """

Problem 8: Ignoring User Context

Symptom

The chatbot doesn't use information from earlier in the conversation.

User: "I'm on the Enterprise plan"
Bot: "Got it!"

User: "What's my rate limit?"
Bot: "Rate limits depend on your plan. Free is 100/min, Pro is 500/min, Enterprise is unlimited."
(Should have directly said "unlimited" based on context)

Why It Happens

Each query is processed independently without conversation context.

How to Fix It

Fix 1: Include conversation history in retrieval

def search_with_context(query, conversation_history):
    # Combine recent context with query
    context = summarize_recent(conversation_history[-5:])
    enriched_query = f"{context}\n\nCurrent question: {query}"

    return search(enriched_query)

Fix 2: Extract and store facts

def extract_user_facts(message):
    facts = llm_extract(message, schema={"plan": "string", "company": "string"})
    return facts

# Store facts per conversation
conversation.facts["plan"] = "Enterprise"

# Use facts in generation
prompt = f"""
Known about user:
- Plan: {conversation.facts.get('plan', 'unknown')}

Answer their question using this context...
"""

Problem 9: Slow Response Times

Symptom

Queries take 5-10+ seconds. Users abandon before seeing answers.

Why It Happens

Large embeddings
Slow vector search
Too many chunks retrieved
Large context sent to LLM
No caching

How to Fix It

Fix 1: Optimize embedding

# Batch embedding requests
chunks = [c1, c2, c3, ...]
embeddings = embed_batch(chunks)  # One API call

# Use smaller models where quality permits
embedding = embed(text, model="text-embedding-3-small")  # Faster

Fix 2: Index optimization

# Use HNSW index for faster approximate search
index = create_index(
    type="hnsw",
    m=16,
    ef_construction=200
)

Fix 3: Caching

@cache(ttl=3600)
def search(query):
    return vector_search(query)

# Cache common queries
# Cache embedding computations
# Cache LLM responses for identical inputs

Fix 4: Streaming

# Stream response as it generates
for chunk in llm.stream(prompt):
    yield chunk

# User sees response building, feels faster

Problem 10: No Way to Debug

Symptom

Something's wrong but you can't figure out what.

How to Fix It

Build observability from day one:

@trace
def answer_query(query):
    # Log query
    span.log("query", query)

    # Log embedding
    embedding = embed(query)
    span.log("embedding_dims", len(embedding))

    # Log retrieval
    results = search(query)
    span.log("results_count", len(results))
    span.log("top_score", results[0].score if results else 0)
    span.log("retrieved_texts", [r.text[:100] for r in results])

    # Log generation
    answer = generate(query, results)
    span.log("answer_length", len(answer))

    return answer

When things go wrong, you can trace:

What was the query?
What got retrieved?
What scores did results have?
What went into the prompt?
What came out?

Quick Diagnostic Checklist

When your RAG chatbot fails, check in order:

Is the document processed? Check chunk count, embedding presence
Is retrieval working? Log retrieved chunks, check relevance
Are the right chunks found? Manual inspection of top results
Is the prompt correct? Log full prompt sent to LLM
Is the LLM responding well? Check for hallucinations, formatting issues

Most problems are retrieval problems. Fix those first.

Building RAG right is hard. That's why we spent three iterations getting NovaKit's Document Chat right. Try it yourself and see how document AI should work.

Why Your RAG Chatbot Sucks (And How to Fix It)

Problem 1: Wrong Chunks Retrieved

Symptom

Why It Happens

How to Fix It

Problem 2: Chunk Boundaries Split Information

Symptom

Why It Happens

How to Fix It

Problem 3: Hallucinated Sources

Symptom

Why It Happens

How to Fix It

Problem 4: "I Don't Know" When Answer Exists

Symptom

Why It Happens

How to Fix It

Problem 5: Stale or Contradictory Information

Symptom

Why It Happens

How to Fix It

Problem 6: Context Window Overflow

Symptom

Why It Happens

How to Fix It

Problem 7: Poor Handling of Tables and Structured Data

Symptom

Why It Happens

How to Fix It

Problem 8: Ignoring User Context

Symptom

Why It Happens

How to Fix It

Problem 9: Slow Response Times

Symptom

Why It Happens

How to Fix It

Problem 10: No Way to Debug

Symptom

How to Fix It

Quick Diagnostic Checklist

Related Articles

No-Code's Dirty Secret: Why 41% of AI-Generated Code is Technical Debt (And How to Avoid It)

How to Reduce AI Hallucinations by 90%: The 2026 Guide to Reliable AI Outputs

I Replaced My 10-Person Dev Team with AI (Here's What Actually Happened)