What is Retrieval-Augmented Generation?
Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with relevant context from external knowledge bases. Instead of relying solely on the model's training data, RAG allows you to ground responses in your specific documents, databases, or knowledge repositories.
This approach solves several key limitations of traditional LLMs:
- Knowledge cutoff: LLMs only know what they were trained on
- Hallucinations: RAG provides factual grounding
- Domain specificity: Access your proprietary data
- Traceability: Know exactly where answers come from
Architecture Overview
A typical RAG system consists of these components:
- Document Processing: Split documents into chunks
- Embedding Generation: Convert text to vectors using OpenAI
- Vector Storage: Store embeddings in Pinecone
- Retrieval: Find relevant chunks for a query
- Generation: Use retrieved context with GPT-4
Setting Up the Environment
First, install the required dependencies:
pip install openai pinecone-client langchain tiktoken
Set up your environment variables:
import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"
os.environ["PINECONE_API_KEY"] = "your-pinecone-key"
Document Processing & Chunking
Effective chunking is crucial for RAG performance. Here's a robust approach:
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_documents(documents, chunk_size=1000, overlap=200):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
length_function=len,
separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)
return chunks
๐ก Chunking Best Practices
Optimal chunk size depends on your use case. For Q&A, 500-1000 tokens work well. For summarization, larger chunks (2000+) may be better. Always include overlap to maintain context across chunk boundaries.
Generating Embeddings with OpenAI
Use OpenAI's text-embedding-3-small model for efficient embeddings:
from openai import OpenAI
client = OpenAI()
def get_embedding(text, model="text-embedding-3-small"):
response = client.embeddings.create(
input=text,
model=model
)
return response.data[0].embedding
# Batch processing for efficiency
def get_embeddings_batch(texts, model="text-embedding-3-small"):
response = client.embeddings.create(
input=texts,
model=model
)
return [item.embedding for item in response.data]
Storing Vectors in Pinecone
Initialize Pinecone and create an index:
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
# Create index if it doesn't exist
index_name = "rag-knowledge-base"
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536, # OpenAI embedding dimension
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index(index_name)
Upsert your embeddings:
def upsert_documents(chunks, batch_size=100):
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i + batch_size]
texts = [chunk.page_content for chunk in batch]
embeddings = get_embeddings_batch(texts)
vectors = [
{
"id": f"doc_{i+j}",
"values": embedding,
"metadata": {
"text": text,
"source": batch[j].metadata.get("source", "")
}
}
for j, (text, embedding) in enumerate(zip(texts, embeddings))
]
index.upsert(vectors=vectors)
Implementing the RAG Pipeline
Now let's build the complete retrieval and generation pipeline:
def retrieve_context(query, top_k=5):
# Generate query embedding
query_embedding = get_embedding(query)
# Search Pinecone
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
# Extract relevant texts
contexts = [match.metadata["text"] for match in results.matches]
return contexts
def generate_response(query, contexts):
# Build the prompt with retrieved context
context_text = "\n\n".join(contexts)
prompt = f"""Use the following context to answer the question.
If the answer is not in the context, say "I don't have enough information."
Context:
{context_text}
Question: {query}
Answer:"""
response = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context."},
{"role": "user", "content": prompt}
],
temperature=0.1
)
return response.choices[0].message.content
def rag_query(query):
contexts = retrieve_context(query)
response = generate_response(query, contexts)
return response
Advanced Techniques
Hybrid Search
Combine semantic search with keyword matching for better results:
# Use Pinecone's hybrid search with sparse-dense vectors
# or implement BM25 alongside semantic search
Re-ranking
Use a cross-encoder to re-rank retrieved results:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_results(query, documents, top_k=3):
pairs = [[query, doc] for doc in documents]
scores = reranker.predict(pairs)
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:top_k]]
Query Transformation
Improve retrieval by reformulating queries:
def expand_query(query):
response = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "system", "content": "Generate 3 alternative phrasings of this query for better search:"},
{"role": "user", "content": query}
]
)
return response.choices[0].message.content
๐ Production Tips
For production deployments: implement caching for embeddings, use async operations for better throughput, monitor token usage, and set up proper error handling with retries.
Evaluation Metrics
Measure your RAG system's performance with:
- Retrieval metrics: Precision@K, Recall@K, MRR
- Generation metrics: BLEU, ROUGE, BERTScore
- End-to-end: Answer relevance, faithfulness, context relevance
Conclusion
RAG represents a powerful paradigm for building AI applications that can leverage your organization's knowledge. By combining the reasoning capabilities of LLMs with precise retrieval from your data, you can create systems that are both intelligent and grounded in facts.
At VESTLABZ AI Labs, we help organizations implement production-ready RAG systems tailored to their specific needs. Whether you're building a customer support bot, internal knowledge assistant, or document analysis tool, our team can help you get there.