tutorials

Building RAG Applications: Step-by-Step Implementation Guide

LearnClub AI
February 28, 2026
5 min read

Building RAG Applications: Step-by-Step Implementation Guide

Retrieval-Augmented Generation (RAG) combines the power of large language models with your own data. This comprehensive guide walks you through building production-ready RAG applications.

What is RAG?

RAG enhances LLMs by retrieving relevant information from a knowledge base before generating responses.

The RAG Flow:

User Query β†’ Retrieve Documents β†’ Augment Prompt β†’ Generate Response

Why RAG?

  • Access to private/up-to-date data
  • Reduced hallucinations
  • Source citations
  • Cost-effective vs fine-tuning

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Document  │────▢│  Embeddings  │────▢│   Vector    β”‚
β”‚    Store    β”‚     β”‚    Model     β”‚     β”‚   Database  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                                                 β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚    User     │────▢│    Query     β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚    Query    β”‚     β”‚  Embedding   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Generated │◀────│     LLM      │◀────│  Retrieved  β”‚
β”‚   Response  β”‚     β”‚              β”‚     β”‚   Context   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation Steps

Step 1: Environment Setup

pip install langchain langchain-openai langchain-community
pip install chromadb tiktoken pypdf

Step 2: Document Loading

from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.document_loaders.directory import DirectoryLoader

# Single file
loader = PyPDFLoader("document.pdf")
docs = loader.load()

# Directory
loader = DirectoryLoader(
    "./documents",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader
)
docs = loader.load()

print(f"Loaded {len(docs)} documents")

Step 3: Text Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

chunks = text_splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")

Chunking Strategies:

StrategyBest ForChunk Size
Fixed SizeGeneral text500-1000 chars
RecursiveDocuments with structure1000-2000 chars
SemanticMeaning preservationVaries
Token-basedLLM context limits256-512 tokens

Step 4: Embeddings

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)

# Test embedding
test_vector = embeddings.embed_query("test")
print(f"Vector dimension: {len(test_vector)}")

Embedding Models Comparison:

ModelDimensionsCost/1MPerformance
text-embedding-3-small1536$0.02Good
text-embedding-3-large3072$0.13Best
text-embedding-ada-0021536$0.10Legacy
BGE-large1024FreeGood

Step 5: Vector Store

from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Save for later
vectorstore.persist()

Vector Database Options:

DatabaseBest ForDeployment
ChromaPrototyping, localSelf-hosted
PineconeProduction, scaleCloud
WeaviateHybrid searchCloud/Self-hosted
QdrantPerformanceSelf-hosted
pgvectorPostgres usersExtension

Step 6: Retrieval

# Basic similarity search
results = vectorstore.similarity_search(
    "What is machine learning?",
    k=4
)

# With scores
results = vectorstore.similarity_search_with_score(
    "query",
    k=4
)

# MMR for diversity
from langchain.retrievers import MaxMarginalRelevanceRetriever

retriever = MaxMarginalRelevanceRetriever.from_vectorstore(
    vectorstore,
    fetch_k=10,
    lambda_mult=0.5
)

Step 7: Complete RAG Chain

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True
)

# Query
result = qa_chain({"query": "What are the main topics?"})
print(result["result"])
print("\nSources:")
for doc in result["source_documents"]:
    print(f"- {doc.metadata['source']}")

Advanced RAG Techniques

1. Multi-Query Retrieval

from langchain.retrievers.multi_query import MultiQueryRetriever

retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm
)

2. Contextual Compression

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever()
)
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 2

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5]
)

4. Reranking

from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_transformers import EmbeddingsRedundantFilter

filter = EmbeddingsRedundantFilter(embeddings=embeddings)
pipeline = DocumentCompressorPipeline(transformers=[filter])

compression_retriever = ContextualCompressionRetriever(
    base_retriever=vectorstore.as_retriever(),
    base_compressor=pipeline
)

Production Considerations

1. Performance Optimization

# Caching
from langchain.globals import set_llm_cache
from langchain.cache import InMemoryCache

set_llm_cache(InMemoryCache())

# Async operations
results = await vectorstore.asimilarity_search("query")

2. Monitoring

# Callbacks for logging
from langchain.callbacks import StdOutCallbackHandler

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    callbacks=[StdOutCallbackHandler()]
)

3. Error Handling

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def query_rag(question):
    return qa_chain({"query": question})

Evaluation

Metrics

from langchain.evaluation import QAEvalChain

# Create evaluation set
eval_questions = [
    {"query": "What is X?", "answer": "X is..."}
]

# Evaluate
predictions = qa_chain.apply(eval_questions)

Key Metrics:

  • Retrieval Accuracy: Relevant docs retrieved
  • Answer Relevance: Response matches question
  • Faithfulness: Answer supported by context
  • Latency: Response time

Complete Example Application

class RAGApplication:
    def __init__(self, docs_path, model="gpt-4"):
        self.embeddings = OpenAIEmbeddings()
        self.llm = ChatOpenAI(model=model)
        self.vectorstore = self._load_documents(docs_path)
        self.qa_chain = self._create_chain()
    
    def _load_documents(self, path):
        loader = DirectoryLoader(path, glob="**/*.pdf")
        docs = loader.load()
        
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200
        )
        chunks = splitter.split_documents(docs)
        
        return Chroma.from_documents(chunks, self.embeddings)
    
    def _create_chain(self):
        return RetrievalQA.from_chain_type(
            llm=self.llm,
            retriever=self.vectorstore.as_retriever(),
            return_source_documents=True
        )
    
    def query(self, question):
        result = self.qa_chain({"query": question})
        return {
            "answer": result["result"],
            "sources": [doc.metadata for doc in result["source_documents"]]
        }

# Usage
app = RAGApplication("./documents")
response = app.query("What is the main topic?")

Common Pitfalls

  1. Poor Chunking: Too large/small chunks hurt retrieval
  2. No Metadata: Missing source tracking
  3. Ignoring Edge Cases: No fallback for failed retrieval
  4. Over-reliance: Not validating LLM outputs
  5. No Monitoring: Blind to production issues

Next Steps

  • Add conversation memory
  • Implement user authentication
  • Add rate limiting
  • Deploy to cloud
  • Set up CI/CD

Learn more AI development at LearnClub AI and explore our AI tools directory.

Share this article