RAG in the Real World: Lessons from Building 'ChatGPT Over Our Docs'

This article examines real-world challenges of building Retrieval-Augmented Generation systems for enterprise knowledge bases, exploring critical decisions around document chunking, hybrid search strategies, context window management, and the complex permissioning requirements that make production deployment difficult. Drawing on lessons from teams nine months into ChatGPT-era deployments, it identifies common failure modes and emerging best practices for systems that actually work at scale.

9/4/20236 min read

Over the past nine months, "ChatGPT for our internal documents" has become one of the most common LLM use cases enterprises pursue. The pitch is compelling: give employees conversational access to company knowledge—policies, procedures, technical documentation, meeting notes—eliminating the frustration of searching through SharePoint or Confluence.

The architecture seems straightforward: embed documents into a vector database, retrieve relevant chunks based on user queries, inject them into ChatGPT's context, and return synthesized answers. Dozens of startups offer platforms promising this functionality in days. Yet teams building these systems are discovering that "ChatGPT over our docs" is deceptively complex. By early September 2023, enough production systems exist to identify consistent failure modes and emerging best practices.

The Basic RAG Architecture

Retrieval-Augmented Generation works by separating knowledge from the language model. The model provides language understanding and generation capability. The retrieval system provides factual information. Together, they answer questions the base model alone couldn't handle.

The standard implementation: documents are split into chunks (typically 500-1000 tokens), embedded using models like OpenAI's text-embedding-ada-002, and stored in vector databases like Pinecone or Weaviate. When users ask questions, the question is embedded, similar chunks are retrieved via vector similarity search, and these chunks are injected into the LLM prompt as context for answering.

This works beautifully in demos. Production reveals the complexities.

The Chunking Problem

Document chunking seems trivial—split text into overlapping segments—but significantly impacts quality. Chunks too small lose context. Chunks too large exceed context windows or include irrelevant information that confuses the model.

Semantic chunking has emerged as superior to naive character-count splitting. Rather than splitting every 500 tokens, identify semantic boundaries—paragraphs, section headers, topic shifts. Several teams report building custom chunking logic that respects document structure: keeping tables intact, not splitting code blocks mid-function, maintaining bullet list coherence.

One engineering team shared that naive chunking split their API documentation mid-example, with the code sample's setup in one chunk and explanation in another. Queries about that API returned incomplete, confusing answers. They rebuilt their chunking to preserve example integrity, dramatically improving quality.

Overlap strategies matter. Chunks typically overlap 10-20% to prevent information split across boundaries from being lost. But determining optimal overlap requires experimentation—too little misses context, too much wastes tokens and retrieval slots.

Metadata inclusion has proven critical. Effective chunks include document title, section headers, creation date, and author as metadata. This allows filtering ("find information from engineering docs created this year") and provides context the raw text lacks. Teams that embedded chunks without metadata struggled with disambiguation—"the policy" could refer to dozens of documents.

Retrieval Quality: The Hardest Problem

Vector similarity search finds semantically similar text, but semantic similarity doesn't always equal relevance. Users ask questions in casual language; documentation uses formal terminology. The embedding model must bridge this gap, and sometimes it doesn't.

Hybrid search combining vector similarity with keyword search has become standard practice. Weaviate and other databases support hybrid queries that score results using both semantic embeddings and BM25 keyword matching. Teams report this catches edge cases pure vector search misses—particularly acronyms, product names, and technical identifiers that embed poorly.

Query rewriting improves retrieval significantly. Before searching, use the LLM to expand or clarify the query. A user asking "how do I submit expenses?" might need documents about "expense report submission," "reimbursement process," or "financial policies." Having the LLM generate multiple query variations and searching with all of them captures more relevant documents.

Re-ranking applies additional intelligence after initial retrieval. Retrieve 20 candidates via vector search, then use a more sophisticated model to re-rank them for relevance to the specific query. This two-stage approach balances speed (fast vector search) with quality (careful re-ranking).

Several teams described scenarios where correct information existed in documents but retrieval failed. A query about "parental leave policy" missed documents titled "Family Leave Guidelines" because the embeddings didn't capture the semantic equivalence strongly enough. Hybrid search and query expansion help, but retrieval remains the primary quality bottleneck.

Context Window Management

Claude's 100K token window helps, but most systems use GPT-3.5 or GPT-4's shorter windows. With 8K tokens, you can include perhaps 5-7 retrieved chunks plus conversation history. Choosing which chunks to include is critical.

Diversity in retrieval prevents redundancy. If the top 5 most similar chunks all come from the same document section, you're wasting context window. Systems now implement diversity algorithms ensuring retrieved chunks come from different documents or sections, providing broader coverage.

Dynamic chunk limits adjust based on query complexity. Simple factual questions might need 2-3 chunks. Complex analytical questions benefit from 7-10 chunks providing comprehensive context. Some systems use a small LLM call to classify query complexity before determining retrieval count.

Conversation history management creates tension with retrieval. Long conversations accumulate history that consumes token budget. Systems must balance maintaining conversational context against including sufficient retrieved information. Strategies include summarizing old conversation turns, dropping ancient history, or explicitly choosing between conversational vs. retrieval mode.

The Permissioning Nightmare

This is where many projects fail. Enterprise documents have complex access controls. Engineering can see technical docs. HR can see personnel files. Finance sees financial data. Your ChatGPT interface must respect these boundaries or create catastrophic data leaks.

Document-level permissions are table stakes. Before embedding documents, capture who can access them. At query time, filter retrieved results to only documents the user has permission to see. This requires integrating with identity systems (Active Directory, Okta) and document repositories (SharePoint, Google Drive) to understand permissions.

Row-level security becomes necessary for structured data. A sales database where reps see only their accounts requires filtering at retrieval time based on user identity. Vector databases increasingly support metadata filtering for this purpose.

Permission lag creates security risks. If someone loses access to a document, when does the RAG system reflect that? Real-time permission checks at query time are safest but add latency. Periodic permission refresh (daily, hourly) creates windows where unauthorized access is possible. Teams must choose based on their risk tolerance.

One company described a near-miss where their system almost exposed confidential M&A documents to general employees because permission sync lagged 24 hours behind SharePoint. They implemented real-time permission verification before launching, accepting the latency cost.

What Actually Breaks in Production

Beyond architectural challenges, production systems encounter practical issues:

Hallucination despite retrieval: Even with relevant documents retrieved, models sometimes generate plausible-sounding nonsense. Users can't distinguish between information from retrieved docs and model hallucination. Solutions include citation—showing which documents informed the answer—and explicit confidence scoring.

Citation accuracy: Implementing "show me where that information came from" is harder than expected. The model's answer often paraphrases or synthesizes multiple sources. Mapping generated text back to specific source chunks requires careful prompt engineering or post-processing.

Stale information: Documents update, but embeddings don't automatically refresh. Systems need pipelines to detect document changes, re-chunk, re-embed, and update the vector database. Teams underestimate this operational complexity.

Cost spirals: Every query triggers embedding API calls (for query and sometimes re-ranking), vector database queries, and LLM API calls with large contexts. At scale, costs exceed budgets. Caching strategies (cache embeddings for common queries, cache LLM responses for identical contexts) become essential.

Latency budgets: Retrieval adds 200-500ms. LLM calls with large contexts take 2-5 seconds. Total latency of 3-6 seconds frustrates users expecting instant answers. Optimizations include parallel retrieval and generation, streaming responses, and precomputing embeddings for likely queries.

Quality degradation over time: As document collections grow, retrieval precision decreases. Systems tuned for 1,000 documents perform poorly at 100,000 documents. Regular re-evaluation and tuning become necessary.

Emerging Best Practices

Successful production systems share common patterns:

Start small and focused: Don't embed the entire company knowledge base. Begin with a bounded domain (one department, one product area) where you can achieve high quality and prove value.

Invest in evaluation infrastructure: Build test sets of queries with known correct answers. Measure retrieval precision, answer accuracy, and latency continuously. Treat this like traditional software testing.

Make retrieval transparent: Show users which documents were consulted. This builds trust and helps users verify accuracy. When retrieval fails, users can identify the gap.

Implement feedback loops: Let users rate answer quality. Use negative feedback to improve retrieval or identify documentation gaps. This creates a flywheel of continuous improvement.

Separate retrieval and generation: Treat these as independent systems. This enables testing retrieval quality separately and swapping LLM providers without rebuilding retrieval infrastructure.

Plan for scale from day one: Even if starting small, design data pipelines, permission systems, and observability for the eventual scale you'll reach.

The Path Forward

RAG remains the most practical architecture for "ChatGPT over our docs," but it's not trivial. Teams that succeed treat it as a serious engineering project requiring data engineering, search quality expertise, security rigor, and operational investment.

The technology is maturing rapidly. Vector databases are improving. Embedding models are getting better. LLMOps tools are making operations easier. But fundamentals remain: chunking quality, retrieval precision, and permission correctness determine success or failure.

For organizations considering these projects: budget more time and expertise than you initially think necessary. The demo is quick. Production is hard. But when done well, conversational access to company knowledge delivers transformative value. The key is recognizing the complexity early and building accordingly.