Implementing RAG with OpenAI and Pinecone

What is Retrieval-Augmented Generation?

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with relevant context from external knowledge bases. Instead of relying solely on the model's training data, RAG allows you to ground responses in your specific documents, databases, or knowledge repositories.

This approach solves several key limitations of traditional LLMs:

Knowledge cutoff: LLMs only know what they were trained on
Hallucinations: RAG provides factual grounding
Domain specificity: Access your proprietary data
Traceability: Know exactly where answers come from

Architecture Overview

A typical RAG system consists of these components:

Document Processing: Split documents into chunks
Embedding Generation: Convert text to vectors using OpenAI
Vector Storage: Store embeddings in Pinecone
Retrieval: Find relevant chunks for a query
Generation: Use retrieved context with GPT-4

Setting Up the Environment

First, install the required dependencies:

pip install openai pinecone-client langchain tiktoken

Set up your environment variables:

import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"
os.environ["PINECONE_API_KEY"] = "your-pinecone-key"

Document Processing & Chunking

Effective chunking is crucial for RAG performance. Here's a robust approach:

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_documents(documents, chunk_size=1000, overlap=200):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        length_function=len,
        separators=["\n\n", "\n", " ", ""]
    )
    chunks = text_splitter.split_documents(documents)
    return chunks

💡 Chunking Best Practices

Optimal chunk size depends on your use case. For Q&A, 500-1000 tokens work well. For summarization, larger chunks (2000+) may be better. Always include overlap to maintain context across chunk boundaries.

Generating Embeddings with OpenAI

Use OpenAI's text-embedding-3-small model for efficient embeddings:

from openai import OpenAI

client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

# Batch processing for efficiency
def get_embeddings_batch(texts, model="text-embedding-3-small"):
    response = client.embeddings.create(
        input=texts,
        model=model
    )
    return [item.embedding for item in response.data]

Storing Vectors in Pinecone

Initialize Pinecone and create an index:

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

# Create index if it doesn't exist
index_name = "rag-knowledge-base"
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  # OpenAI embedding dimension
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

index = pc.Index(index_name)

Upsert your embeddings:

def upsert_documents(chunks, batch_size=100):
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [chunk.page_content for chunk in batch]
        embeddings = get_embeddings_batch(texts)
        
        vectors = [
            {
                "id": f"doc_{i+j}",
                "values": embedding,
                "metadata": {
                    "text": text,
                    "source": batch[j].metadata.get("source", "")
                }
            }
            for j, (text, embedding) in enumerate(zip(texts, embeddings))
        ]
        
        index.upsert(vectors=vectors)

Implementing the RAG Pipeline

Now let's build the complete retrieval and generation pipeline:

def retrieve_context(query, top_k=5):
    # Generate query embedding
    query_embedding = get_embedding(query)
    
    # Search Pinecone
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True
    )
    
    # Extract relevant texts
    contexts = [match.metadata["text"] for match in results.matches]
    return contexts

def generate_response(query, contexts):
    # Build the prompt with retrieved context
    context_text = "\n\n".join(contexts)
    
    prompt = f"""Use the following context to answer the question. 
If the answer is not in the context, say "I don't have enough information."

Context:
{context_text}

Question: {query}

Answer:"""
    
    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.1
    )
    
    return response.choices[0].message.content

def rag_query(query):
    contexts = retrieve_context(query)
    response = generate_response(query, contexts)
    return response

Advanced Techniques

Hybrid Search

Combine semantic search with keyword matching for better results:

# Use Pinecone's hybrid search with sparse-dense vectors
# or implement BM25 alongside semantic search

Re-ranking

Use a cross-encoder to re-rank retrieved results:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query, documents, top_k=3):
    pairs = [[query, doc] for doc in documents]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:top_k]]

Query Transformation

Improve retrieval by reformulating queries:

def expand_query(query):
    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[
            {"role": "system", "content": "Generate 3 alternative phrasings of this query for better search:"},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content

🚀 Production Tips

For production deployments: implement caching for embeddings, use async operations for better throughput, monitor token usage, and set up proper error handling with retries.

Evaluation Metrics

Measure your RAG system's performance with:

Retrieval metrics: Precision@K, Recall@K, MRR
Generation metrics: BLEU, ROUGE, BERTScore
End-to-end: Answer relevance, faithfulness, context relevance

Conclusion

RAG represents a powerful paradigm for building AI applications that can leverage your organization's knowledge. By combining the reasoning capabilities of LLMs with precise retrieval from your data, you can create systems that are both intelligent and grounded in facts.

At VESTLABZ AI Labs, we help organizations implement production-ready RAG systems tailored to their specific needs. Whether you're building a customer support bot, internal knowledge assistant, or document analysis tool, our team can help you get there.

Emily Chen

Head of AI Labs at VESTLABZ

AWS vs Azure vs GCP Zero Trust Architecture