Building RAG Applications: Step-by-Step Implementation Guide
Retrieval-Augmented Generation (RAG) combines the power of large language models with your own data. This comprehensive guide walks you through building production-ready RAG applications.
What is RAG?
RAG enhances LLMs by retrieving relevant information from a knowledge base before generating responses.
The RAG Flow:
User Query β Retrieve Documents β Augment Prompt β Generate Response
Why RAG?
- Access to private/up-to-date data
- Reduced hallucinations
- Source citations
- Cost-effective vs fine-tuning
Architecture Overview
βββββββββββββββ ββββββββββββββββ βββββββββββββββ
β Document ββββββΆβ Embeddings ββββββΆβ Vector β
β Store β β Model β β Database β
βββββββββββββββ ββββββββββββββββ ββββββββ¬βββββββ
β
βββββββββββββββ ββββββββββββββββ β
β User ββββββΆβ Query ββββββββββββββ
β Query β β Embedding β
βββββββββββββββ ββββββββββββββββ
β
βΌ
βββββββββββββββ ββββββββββββββββ βββββββββββββββ
β Generated βββββββ LLM βββββββ Retrieved β
β Response β β β β Context β
βββββββββββββββ ββββββββββββββββ βββββββββββββββ
Implementation Steps
Step 1: Environment Setup
pip install langchain langchain-openai langchain-community
pip install chromadb tiktoken pypdf
Step 2: Document Loading
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.document_loaders.directory import DirectoryLoader
# Single file
loader = PyPDFLoader("document.pdf")
docs = loader.load()
# Directory
loader = DirectoryLoader(
"./documents",
glob="**/*.pdf",
loader_cls=PyPDFLoader
)
docs = loader.load()
print(f"Loaded {len(docs)} documents")
Step 3: Text Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")
Chunking Strategies:
| Strategy | Best For | Chunk Size |
|---|---|---|
| Fixed Size | General text | 500-1000 chars |
| Recursive | Documents with structure | 1000-2000 chars |
| Semantic | Meaning preservation | Varies |
| Token-based | LLM context limits | 256-512 tokens |
Step 4: Embeddings
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small"
)
# Test embedding
test_vector = embeddings.embed_query("test")
print(f"Vector dimension: {len(test_vector)}")
Embedding Models Comparison:
| Model | Dimensions | Cost/1M | Performance |
|---|---|---|---|
| text-embedding-3-small | 1536 | $0.02 | Good |
| text-embedding-3-large | 3072 | $0.13 | Best |
| text-embedding-ada-002 | 1536 | $0.10 | Legacy |
| BGE-large | 1024 | Free | Good |
Step 5: Vector Store
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
# Save for later
vectorstore.persist()
Vector Database Options:
| Database | Best For | Deployment |
|---|---|---|
| Chroma | Prototyping, local | Self-hosted |
| Pinecone | Production, scale | Cloud |
| Weaviate | Hybrid search | Cloud/Self-hosted |
| Qdrant | Performance | Self-hosted |
| pgvector | Postgres users | Extension |
Step 6: Retrieval
# Basic similarity search
results = vectorstore.similarity_search(
"What is machine learning?",
k=4
)
# With scores
results = vectorstore.similarity_search_with_score(
"query",
k=4
)
# MMR for diversity
from langchain.retrievers import MaxMarginalRelevanceRetriever
retriever = MaxMarginalRelevanceRetriever.from_vectorstore(
vectorstore,
fetch_k=10,
lambda_mult=0.5
)
Step 7: Complete RAG Chain
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
return_source_documents=True
)
# Query
result = qa_chain({"query": "What are the main topics?"})
print(result["result"])
print("\nSources:")
for doc in result["source_documents"]:
print(f"- {doc.metadata['source']}")
Advanced RAG Techniques
1. Multi-Query Retrieval
from langchain.retrievers.multi_query import MultiQueryRetriever
retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(),
llm=llm
)
2. Contextual Compression
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever()
)
3. Hybrid Search
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 2
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.5, 0.5]
)
4. Reranking
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_transformers import EmbeddingsRedundantFilter
filter = EmbeddingsRedundantFilter(embeddings=embeddings)
pipeline = DocumentCompressorPipeline(transformers=[filter])
compression_retriever = ContextualCompressionRetriever(
base_retriever=vectorstore.as_retriever(),
base_compressor=pipeline
)
Production Considerations
1. Performance Optimization
# Caching
from langchain.globals import set_llm_cache
from langchain.cache import InMemoryCache
set_llm_cache(InMemoryCache())
# Async operations
results = await vectorstore.asimilarity_search("query")
2. Monitoring
# Callbacks for logging
from langchain.callbacks import StdOutCallbackHandler
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
callbacks=[StdOutCallbackHandler()]
)
3. Error Handling
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def query_rag(question):
return qa_chain({"query": question})
Evaluation
Metrics
from langchain.evaluation import QAEvalChain
# Create evaluation set
eval_questions = [
{"query": "What is X?", "answer": "X is..."}
]
# Evaluate
predictions = qa_chain.apply(eval_questions)
Key Metrics:
- Retrieval Accuracy: Relevant docs retrieved
- Answer Relevance: Response matches question
- Faithfulness: Answer supported by context
- Latency: Response time
Complete Example Application
class RAGApplication:
def __init__(self, docs_path, model="gpt-4"):
self.embeddings = OpenAIEmbeddings()
self.llm = ChatOpenAI(model=model)
self.vectorstore = self._load_documents(docs_path)
self.qa_chain = self._create_chain()
def _load_documents(self, path):
loader = DirectoryLoader(path, glob="**/*.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(docs)
return Chroma.from_documents(chunks, self.embeddings)
def _create_chain(self):
return RetrievalQA.from_chain_type(
llm=self.llm,
retriever=self.vectorstore.as_retriever(),
return_source_documents=True
)
def query(self, question):
result = self.qa_chain({"query": question})
return {
"answer": result["result"],
"sources": [doc.metadata for doc in result["source_documents"]]
}
# Usage
app = RAGApplication("./documents")
response = app.query("What is the main topic?")
Common Pitfalls
- Poor Chunking: Too large/small chunks hurt retrieval
- No Metadata: Missing source tracking
- Ignoring Edge Cases: No fallback for failed retrieval
- Over-reliance: Not validating LLM outputs
- No Monitoring: Blind to production issues
Next Steps
- Add conversation memory
- Implement user authentication
- Add rate limiting
- Deploy to cloud
- Set up CI/CD
Learn more AI development at LearnClub AI and explore our AI tools directory.