What Is RAG and Why Does It Matter in Production?
Retrieval-Augmented Generation (RAG) is the most practical pattern for grounding LLMs in real, up-to-date data. Instead of fine-tuning a model on your proprietary knowledge — which is expensive and hard to update — RAG retrieves relevant documents at query time and passes them into the context window. The model then answers based on actual source material, not just its training weights.
Building a RAG demo is easy. Getting it to perform reliably in production is a completely different challenge. Chunking strategy, embedding quality, retrieval precision, and prompt design all affect answer quality in ways that only surface under real workloads. This guide covers each layer with production considerations built in from the start.
Step 1: Document Ingestion and Chunking
How you split documents is the single biggest factor in retrieval quality. Chunks that are too small lose context. Chunks that are too large dilute relevance scores and waste context window space.
Recommended Chunking Strategy
Use recursive character text splitting with overlap. A chunk size of 512–1024 tokens with a 10–15% overlap works well for most prose documents. For structured content like code or tables, use a separator-aware splitter.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=150,
separators=["
", "
", ". ", " ", ""]
)
chunks = splitter.split_documents(raw_docs)
For each chunk, store metadata: source file, page number, section title, and a creation timestamp. This metadata is critical for filtering and attribution later.
Step 2: Embedding and Vector Storage
OpenAI's text-embedding-3-small model is the best cost-performance choice for most use cases. It outputs 1536-dimensional vectors and costs roughly $0.02 per million tokens — orders of magnitude cheaper than the previous ada-002 model with better performance.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import PGVector
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = PGVector.from_documents(
documents=chunks,
embedding=embeddings,
connection_string=os.environ["DATABASE_URL"],
collection_name="knowledge_base"
)
In production, prefer pgvector (PostgreSQL extension) or Pinecone over Chroma. pgvector runs inside your existing Postgres instance and eliminates a separate service to manage. Pinecone is a good choice when you need managed horizontal scaling with millions of vectors.
Step 3: Retrieval Chain with Reranking
Basic similarity search returns the top-k vectors by cosine distance. This works, but production systems benefit from a reranking step that uses a cross-encoder to score the actual (query, chunk) pair relevance rather than relying purely on embedding distance.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
compressor = CohereRerank(model="rerank-english-v3.0", top_n=5)
retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever
)
This two-stage approach — retrieve 20, rerank to 5 — significantly improves answer quality for domain-specific corpora where embedding distance alone misses nuanced relevance.
Step 4: The Generation Prompt
The prompt template determines how the model uses retrieved context. Be explicit about attribution, fallback behavior, and answer format.
SYSTEM_PROMPT = """You are a helpful assistant. Answer the user's question using ONLY
the context provided below. If the answer cannot be found in the context, say
"I don't have enough information to answer that" — do not make up an answer.
Always cite the source document when you use information from it.
Context:
{context}"""
Never instruct the model to "try its best" when context is insufficient. That's an invitation to hallucinate. Explicit fallback instructions are critical for trust.
Step 5: Evaluation and Continuous Improvement
RAG systems degrade silently. Implement automated evaluation from day one using RAGAS — a framework that measures retrieval precision, recall, faithfulness, and answer relevance without needing human labels for every query.
- Faithfulness: Is the answer supported by the retrieved chunks?
- Answer Relevance: Does the answer actually address the question?
- Context Precision: Are the retrieved chunks actually relevant?
- Context Recall: Did retrieval capture all necessary information?
Set up a nightly evaluation run against a curated golden dataset of 50–100 question-answer pairs. Track scores over time and alert when faithfulness drops below 0.85.
Production Checklist
- Use connection pooling (PgBouncer) for the vector DB under concurrent load
- Cache embedding requests — the same text always produces the same vector
- Rate-limit the OpenAI API client with exponential backoff
- Log every retrieval with its source chunks for debugging
- Store user feedback (thumbs up/down) to build a fine-tuning dataset over time
- Set a context window budget — never pass more than 60% of the model's limit
RAG built right is a genuine competitive advantage. The teams that win with it are the ones who invest in evaluation infrastructure, not just the initial pipeline.