March 28, 2026

RAG in Production: What Nobody Tells You

Retrieval-augmented generation sounds simple in tutorials. Here's what actually breaks when you ship it to real users.

Every RAG tutorial makes it look easy: chunk your documents, embed them, retrieve the top-k, stuff them into a prompt. Ship it. Done.

Then you deploy to production and everything falls apart.

The chunking problem

Fixed-size chunks destroy context. A 512-token chunk that splits a paragraph mid-sentence will return garbage relevance scores. Semantic chunking helps, but it's slower and harder to get right. And tables? Good luck chunking a table meaningfully.

Your users don't write queries like search engines expect. They write "what was that thing about the Q3 budget" and expect the system to understand context, recency, and intent. Cosine similarity on embeddings doesn't cut it.

The answer quality trap

The retriever found relevant chunks. The LLM generated a fluent answer. But is it correct? Without evaluation infrastructure, you have no idea. And "it sounds right" is not an evaluation strategy.

What actually works

After shipping RAG systems to production for dozens of clients, here's what we've learned:

  1. Hybrid search — combine embeddings with BM25 keyword search, then rerank.
  2. Smart chunking — respect document structure. Headers, paragraphs, and tables need different strategies.
  3. Query transformation — rewrite the user's query before retrieval. HyDE and multi-query both help.
  4. Evaluation from day one — build automated accuracy benchmarks before you ship, not after users complain.
  5. Guardrails — know when the system doesn't know. Confident wrong answers are worse than "I don't have that information."

RAG is not a solved problem. But it is a solvable one — if you treat it as an engineering challenge, not a tutorial project.