In 2024, a €180M professional services firm in Zurich hired us to fix their RAG implementation. They had spent €240K building a knowledge retrieval system for their consultants. Six months after launch, usage was under 5% and the CTO wanted it shut down. The system was technically sound: vector database, embedding model, LLM integration, and a beautiful UI. The problem was that it didn't work for real documents. Legal contracts had redacted sections. Financial reports had footnotes that changed meaning. Client proposals had confidential annexes that couldn't be vectorized. The system was built for clean data. Real data is never clean.
This article shares the production-tested RAG architecture we rebuilt for them — and now use with every client building enterprise knowledge systems. It's designed for messy documents, changing schemas, compliance requirements, and the reality that your data will never be as clean as your demo data.
Why Most RAG Systems Fail in Production
RAG tutorials show you how to index a PDF and ask questions about it. Production RAG requires handling: documents in 15+ formats (PDF, DOCX, scanned images, emails, spreadsheets), tables and charts that lose meaning when converted to text, confidential information that must be redacted before indexing, documents that change monthly (financial reports, compliance filings), and queries that require information from multiple documents with conflicting answers. The gap between tutorial RAG and production RAG is the gap between a bicycle and a freight train.
The Production RAG Architecture
Layer 1: Ingestion — Handling Messy Documents
Our ingestion layer uses a multi-format parser that handles PDFs (including scanned documents via OCR), Word documents with track changes, Excel files with multiple sheets, PowerPoint presentations with speaker notes, and email threads with attachments. Each document type has a specialized extractor that preserves structure: tables remain tables, headers remain headers, and footnotes remain associated with their parent text. For the Zurich client, this alone increased retrieval accuracy from 34% to 78%.
We also implement a document classification system that tags each document by type (contract, report, email, etc.), sensitivity level (public, internal, confidential, restricted), and freshness (date of last update). These tags are stored as metadata and used in the retrieval layer to filter results by user permissions and query context. A junior consultant asking about client engagement protocols shouldn't see the partner compensation model, even if both documents contain similar keywords.
Layer 2: Chunking — Preserving Context and Meaning
Standard chunking (split every 500 tokens) destroys meaning. A contract clause split in half becomes meaningless. A table row separated from its header becomes uninterpretable. Our chunking strategy is semantic: we split at document structure boundaries (sections, subsections, tables, lists) and include hierarchical context in each chunk. A chunk from Section 4.2 includes metadata that it's from Section 4.2, which is under Section 4, which is part of the engagement terms. This allows the retrieval system to understand not just what the text says, but where it sits in the document hierarchy.
Layer 3: Embedding — Domain-Specific Representation
Generic embedding models (like OpenAI's text-embedding-3) work well for general queries but poorly for domain-specific language. A query about 'material adverse change clauses' in legal contracts requires embeddings that understand legal terminology, not general English. We use domain-specific embedding models for clients in regulated industries: legal embeddings for law firms, financial embeddings for banks, technical embeddings for engineering firms. The improvement is dramatic: for legal queries, domain-specific embeddings improved top-3 retrieval accuracy by 41%.
Layer 4: Retrieval — Hybrid Search with Re-ranking
Pure vector search fails on exact-match requirements. A query for 'Section 4.2(b) of the Master Services Agreement' requires exact text matching, not semantic similarity. We implement hybrid retrieval: vector search for conceptual queries, keyword search for exact matches, and a re-ranking model that combines both signals. The re-ranker uses cross-attention to score each candidate chunk against the full query, catching nuances that either search method would miss alone. For the Zurich client, hybrid retrieval improved answer accuracy from 52% to 89%.
Layer 5: Generation — Controlled, Traceable, Compliant
The generation layer must do three things: answer accurately (with citations to source documents), comply with regulations (no hallucinated facts, no confidential data leakage), and be traceable (every answer links to its source chunks for audit purposes). We implement a citation system that requires the LLM to reference specific document chunks for every claim. If the system can't find a relevant source, it says 'I don't have information about that' instead of hallucinating. For regulated industries, this isn't optional — it's mandatory.
The Implementation Timeline
A production RAG system takes 6-10 weeks to build and 2-4 weeks to tune. Week 1-2: data audit and ingestion pipeline. Week 3-4: chunking strategy and embedding model selection. Week 5-6: retrieval layer and re-ranking. Week 7-8: generation layer and citation system. Week 9-10: integration, security review, and user testing. Week 11-12: performance tuning and edge case handling. The total investment for a mid-market enterprise is €85K-€180K depending on document volume and compliance requirements. The typical ROI is 300-600% within 12 months through productivity gains and error reduction.
Building an enterprise knowledge system? Our AI Engineering team designs and deploys RAG pipelines that handle real-world data complexity.
Design Your RAG PipelineBuilding an enterprise knowledge system? Our AI Engineering team designs and deploys RAG pipelines that handle real-world data complexity.
Design Your RAG Pipeline







