RAG architecture for the enterprise: what actually works
A practical reference architecture for retrieval-augmented generation in regulated environments — plus the seven production failures nobody warns you about, and how to avoid each one.
Every week someone ships "their first RAG system" and every week some of them fail in the same predictable ways. The technology is easy. The production engineering around it is where most teams hit the wall.
This post is the condensed version of what we build and what we avoid, based on systems we’ve shipped to healthcare, financial services, and public-sector clients.
The reference architecture
A production RAG system has six stages. Each stage has its own failure modes. Skipping any one will cost you three weeks of debugging during UAT.
- Ingestion. Files land in your system — PDFs, HTML, Office docs, Confluence exports.
- Parsing. Convert heterogeneous formats into clean text with structural hints (headings, tables, lists).
- Chunking. Break text into retrieval units with overlap and rich metadata.
- Indexing. Embed chunks, store in a vector index with filterable metadata.
- Retrieval. Hybrid search (BM25 + dense) then rerank.
- Generation. Assemble prompt, constrain output, validate before display.
Failure 1: PDF parsing ruins everything downstream
80% of enterprise RAG failures trace back to bad text extraction from PDFs. Tables become scrambled word soup. Multi-column layouts get read top-to-bottom instead of column-by-column. Footnotes are inlined into the main body and change meaning.
Use a layout-aware parser: Unstructured.io, LlamaParse, AWS Textract, or Azure Document Intelligence. Do not use plain `pdfplumber` or `PyPDF2` for anything beyond simple prose. Validate output on a sample before trusting it at scale.
Failure 2: Chunking that ignores document structure
Naive character-count chunking splits in the middle of sentences, tables, and lists. Your retrieval returns chunks with half a thought.
Fix: structural chunking. Split on headings first, then on paragraphs, only falling back to character count for very long paragraphs. Preserve 10-20% overlap to keep boundary context. Tag every chunk with document ID, section heading, and page number — you’ll need these for citations.
Failure 3: Embedding a 1-line chunk the same way as a 500-line one
Embedding models are trained on sentence-to-paragraph-length text. A one-line "Section 4" heading and a 500-line explanation produce embeddings that are incomparable in the vector space.
Normalize chunk length (aim for 200-500 tokens). Merge very short chunks with their neighbors. Never embed tables of contents or index pages — they will poison your retrieval.
Failure 4: Dense retrieval misses exact-match queries
Someone searches "policy section 4.3(b)" and dense retrieval returns semantically similar content instead of the exact clause. Users lose trust immediately.
Fix: hybrid search. Run BM25 and dense retrieval in parallel, merge results, then rerank with a cross-encoder. BM25 catches exact keyword matches. Dense catches semantic ones. Rerank picks the best of both for the prompt.
Failure 5: No reranker means you send garbage to the model
Top-50 retrieval recall is usually fine. Top-5 precision is usually terrible without a reranker. If you stuff the top-5 raw retrieval results into your prompt, you’re paying context-window tokens for low-quality snippets.
Add a cross-encoder reranker (BGE-reranker, Cohere reranker, or a purpose-trained one). Retrieve top-50, rerank to top-5, send those five to the model. Costs ~100ms, yields substantially better answers.
Failure 6: Letting the model generate without grounding checks
LLMs will happily produce confident answers when the retrieved context doesn’t actually support them. In regulated environments this is a direct liability.
Enforce a strict output schema: verdict, citation (chunk ID + source document + page), confidence. If the model cannot produce all three, drop the answer or escalate. Never show ungrounded output to the user.
Failure 7: No evaluation harness, no drift detection
"It worked in testing" is the most expensive sentence in AI engineering. Without a nightly eval harness you cannot detect when a retrieval tweak, a model update, or a corpus change has broken a feature.
Build 100-500 golden question-answer pairs. Run them nightly. Track retrieval recall, answer accuracy, citation validity, and refusal rate. Alert on deltas beyond thresholds.
What we use by default
- Parsing: Unstructured.io or Azure Document Intelligence depending on compliance posture
- Vector store: pgvector for simple deployments, Qdrant or Weaviate for scale
- Embeddings: BGE-large or nomic-embed (both run locally)
- Retrieval: hybrid BM25 + dense with RRF (reciprocal rank fusion) merge
- Reranker: BGE-reranker-large
- Generation: self-hosted Llama 3 or Mistral for sensitive workloads, Azure OpenAI otherwise
- Evaluation: Ragas + custom golden-set harness
Complexity you don’t need
Things we see teams build that rarely pay off in our experience:
- Agent-based retrieval with 5+ tools. Adds latency and failure modes. Use it only when single-shot retrieval genuinely can’t answer.
- Graph RAG before you’ve mastered vector RAG. Solve the 80% case first.
- Fine-tuning your embedding model before you’ve fixed parsing and chunking. You’re polishing a bad foundation.
- Custom orchestration frameworks. Start with 50 lines of Python. Adopt LangChain / LlamaIndex / etc. only when you feel the pain they solve.
The 80/20 of a production RAG launch
If you have limited time, spend it in this order:
- Parsing quality (30% of total impact)
- Chunking with structure (20%)
- Hybrid retrieval + reranker (20%)
- Strict output grounding (15%)
- Evaluation harness (10%)
- Everything else (5%)
Notice what’s not on this list: prompt engineering, model selection, and fine-tuning. Those are last-mile optimizations. The failures above account for almost every RAG system we’ve had to rescue.