January 10, 20252 min read
Building Production RAG Systems: Lessons from the Trenches
A deep dive into building retrieval-augmented generation systems that actually work at scale, covering architecture decisions, embedding strategies, and production pitfalls.
RAG LLM Production ML Vector Databases
Building RAG systems that work in demos is easy. Building RAG systems that work in production is a different story entirely.
After deploying multiple RAG systems at scale, I've learned that the gap between a working prototype and a production system is enormous. Here's what I've learned.
## The Architecture Decision
The first critical decision is your retrieval architecture. There are three main approaches:
1. **Naive RAG**: Simple semantic search + LLM generation
2. **Advanced RAG**: Query rewriting, reranking, and hybrid search
3. **Modular RAG**: Composable pipeline with specialized retrievers
For production systems, I strongly recommend starting with Advanced RAG and evolving to Modular RAG as needed.
## Embedding Strategy
Your embedding model choice matters more than you think. Here's my hierarchy:
- **General purpose**: text-embedding-3-small (OpenAI) for most use cases
- **Domain specific**: Fine-tuned embeddings for specialized vocabularies
- **Multilingual**: multilingual-e5-large for multi-language content
## Production Pitfalls
### 1. Chunk Size Matters
Too small = fragmented context. Too large = diluted relevance.
Sweet spot: 512-1024 tokens with 50-100 token overlap.
### 2. Hybrid Search is Non-Negotiable
Pure semantic search misses keyword-specific queries. Combine:
- BM25 for keyword matching
- Vector search for semantic understanding
- Reranking for precision
### 3. Evaluation is Everything
You can't improve what you can't measure. Track:
- Retrieval recall@k
- Answer faithfulness
- Answer relevance
- End-to-end latency
## The Code
```python
class RAGPipeline:
def __init__(self, vector_store, reranker, llm):
self.vector_store = vector_store
self.reranker = reranker
self.llm = llm
async def query(self, question: str, k: int = 10) -> str:
# Retrieve
candidates = await self.vector_store.search(question, k=k * 3)
# Rerank
ranked = await self.reranker.rerank(question, candidates, top_k=k)
# Generate
context = self._build_context(ranked)
response = await self.llm.generate(
system=RAG_PROMPT,
user=f"Context:\n{context}\n\nQuestion: {question}"
)
return response
```
## Lessons Learned
1. **Start simple, add complexity gradually** - Don't over-engineer from day one
2. **Monitor everything** - Log retrieval scores, latency, and user feedback
3. **Build feedback loops** - Let users flag bad answers
4. **Cache aggressively** - Most queries are repetitive
5. **Test with real data** - Synthetic data gives false confidence
## Conclusion
Production RAG is an ongoing process, not a one-time implementation. Build for iteration, not perfection.