Session 3: Building a RAG Pipeline in Python
Synopsis
Walks through the core stages of a RAG system, including document loading, chunking, embedding generation, indexing, retrieval, and response synthesis. This session connects earlier API and application skills to a powerful production pattern.
Session Content
Session 3: Building a RAG Pipeline in Python
Session Overview
In this session, learners will build a complete Retrieval-Augmented Generation (RAG) pipeline in Python using the OpenAI Python SDK and the Responses API. The goal is to understand how to ground model outputs in external knowledge by retrieving relevant documents and supplying them as context to the model.
By the end of this session, learners will be able to:
- Explain what RAG is and when to use it
- Break a RAG system into ingestion, chunking, embedding, retrieval, and generation stages
- Implement a simple local RAG pipeline in Python
- Use the OpenAI Responses API with
gpt-5.4-mini - Evaluate the quality of retrieval and answer generation
- Identify common pitfalls such as poor chunking, noisy context, and prompt issues
Learning Objectives
After this session, learners should be able to:
- Define Retrieval-Augmented Generation and distinguish it from plain prompting
- Explain the role of chunking and retrieval in a RAG system
- Implement a basic keyword-based retriever in Python
- Connect retrieval output to a grounded generation step using the OpenAI Responses API
- Improve a baseline RAG pipeline with better chunking and prompt design
- Inspect outputs and reason about failure modes
Session Agenda (~45 minutes)
- 0–5 min: Introduction to RAG
- 5–12 min: Core architecture of a RAG pipeline
- 12–20 min: Data preparation and chunking
- 20–30 min: Hands-on Exercise 1 — Build a simple retriever
- 30–40 min: Hands-on Exercise 2 — Add grounded answer generation with the Responses API
- 40–45 min: Wrap-up, pitfalls, and next steps
1. What Is RAG?
1.1 Definition
Retrieval-Augmented Generation (RAG) is a pattern where an LLM first receives relevant information retrieved from a knowledge source, then uses that information to generate an answer.
Instead of relying only on its internal training knowledge, the model is given specific, up-to-date, or domain-specific content.
Plain prompting
User asks:
“What does our company refund policy say about digital products?”
Without access to company documents, the model may guess or respond generically.
RAG prompting
The system: 1. Searches company policy documents 2. Retrieves the most relevant passages 3. Sends those passages to the model 4. Asks the model to answer using the retrieved information
This makes answers: - More grounded - More accurate for private/domain-specific knowledge - Easier to inspect and debug
1.2 When to Use RAG
Use RAG when:
- The knowledge changes frequently
- You need answers based on private/internal documents
- You want citations or source-aware answers
- You want to reduce hallucinations by grounding outputs
Do not assume RAG solves everything. If retrieval is poor, generation quality will also be poor.
2. Anatomy of a RAG Pipeline
A simple RAG pipeline usually has these stages:
2.1 Ingestion
Load documents from files, databases, APIs, or internal systems.
Examples: - Markdown files - PDFs - Product documentation - Wiki pages - FAQs
2.2 Chunking
Split long documents into smaller pieces called chunks.
Why chunk? - LLM context windows are limited - Retrieval works better on focused passages - Small chunks are easier to rank and inspect
2.3 Embedding or Retrieval Indexing
Convert chunks into a searchable format.
Common approaches: - Keyword-based retrieval - TF-IDF / BM25 - Dense vector embeddings
For this session, we will begin with a simple keyword-overlap retriever so learners can understand the mechanics without additional dependencies.
2.4 Retrieval
Given a user query: - Search the chunk collection - Rank relevant chunks - Return top-k passages
2.5 Augmented Generation
Pass retrieved chunks to the LLM with instructions such as: - Answer only from the provided context - Say when the answer is not present - Quote or cite the supporting chunks
3. Designing a Good RAG Workflow
3.1 Good Chunking Matters
A chunk should: - Be semantically coherent - Not be too large - Preserve meaning without requiring the whole document
Bad chunking: - Splitting in the middle of a sentence - Huge chunks with many unrelated topics - Tiny chunks with no context
A useful starting point: - Split by paragraphs - Keep chunk sizes moderate - Include metadata like document title and chunk ID
3.2 Good Retrieval Matters
If retrieval returns the wrong chunks: - The model may answer incorrectly - The answer may be incomplete - The model may say “not found” when it exists elsewhere
3.3 Good Prompting Matters
The generation prompt should: - Clearly separate question and context - Instruct the model to rely on the context - Request a fallback behavior if context is insufficient
Example instruction:
Answer using only the provided context. If the answer is not in the context, say: "I could not find that information in the provided documents."
4. Preparing Example Knowledge Base
For the hands-on portion, we will use a small in-memory knowledge base that simulates internal documentation.
Example Documents
We will work with: - Refund policy - Shipping policy - Account security guide
5. Hands-on Exercise 1: Build a Simple Retriever
Goal
Create a minimal RAG retrieval stage in Python: - Load documents - Chunk them - Score chunks against a user query using keyword overlap - Return the most relevant chunks
This exercise teaches the core retrieval loop before adding the LLM.
5.1 Code: Document Loading, Chunking, and Retrieval
"""
Exercise 1: Build a simple local retriever for a RAG pipeline.
What this script does:
1. Defines a small in-memory document collection
2. Splits documents into paragraph-level chunks
3. Implements a simple keyword-overlap scoring function
4. Retrieves the top matching chunks for a user query
This is intentionally simple so the retrieval mechanics are easy to understand.
"""
from __future__ import annotations
import re
from dataclasses import dataclass
from typing import List
# -----------------------------
# Data model
# -----------------------------
@dataclass
class Chunk:
"""Represents a chunk of text plus metadata."""
doc_id: str
chunk_id: int
text: str
# -----------------------------
# Example knowledge base
# -----------------------------
DOCUMENTS = {
"refund_policy": """
Refund Policy
Physical products may be returned within 30 days of delivery if they are unused and in their original packaging.
Digital products are non-refundable once the download has started, except where required by law.
Refunds for approved returns are processed within 5 to 7 business days after inspection.
""",
"shipping_policy": """
Shipping Policy
Standard shipping usually takes 3 to 5 business days.
Express shipping usually takes 1 to 2 business days.
International shipping times vary by destination and customs processing.
""",
"account_security": """
Account Security Guide
Users should enable multi-factor authentication to improve account security.
If you suspect unauthorized access, reset your password immediately and contact support.
Password reset links expire after 30 minutes for security reasons.
""",
}
# -----------------------------
# Utility functions
# -----------------------------
def normalize_text(text: str) -> List[str]:
"""
Lowercase text and extract alphanumeric tokens.
Returns:
A list of normalized tokens.
"""
return re.findall(r"[a-z0-9]+", text.lower())
def chunk_document(doc_id: str, text: str) -> List[Chunk]:
"""
Split a document into paragraph chunks.
Paragraph splitting is simple and often a decent starting point
for structured internal docs.
Args:
doc_id: The document identifier
text: The raw document text
Returns:
A list of Chunk objects
"""
paragraphs = [p.strip() for p in text.strip().split("\n\n") if p.strip()]
return [
Chunk(doc_id=doc_id, chunk_id=i, text=paragraph)
for i, paragraph in enumerate(paragraphs)
]
def build_chunk_index(documents: dict[str, str]) -> List[Chunk]:
"""
Convert a document dictionary into a flat list of chunks.
"""
all_chunks: List[Chunk] = []
for doc_id, text in documents.items():
all_chunks.extend(chunk_document(doc_id, text))
return all_chunks
def score_chunk(query: str, chunk: Chunk) -> int:
"""
Score a chunk based on keyword overlap with the query.
This is a naive retriever:
- Tokenize query and chunk
- Compute overlap count
Args:
query: The user's question
chunk: A candidate chunk
Returns:
Integer overlap score
"""
query_tokens = set(normalize_text(query))
chunk_tokens = set(normalize_text(chunk.text))
return len(query_tokens.intersection(chunk_tokens))
def retrieve(query: str, chunks: List[Chunk], top_k: int = 3) -> List[tuple[Chunk, int]]:
"""
Retrieve the top-k chunks for a query.
Args:
query: The user's question
chunks: The chunk collection
top_k: Number of chunks to return
Returns:
A list of (chunk, score) tuples sorted by descending score
"""
scored = [(chunk, score_chunk(query, chunk)) for chunk in chunks]
scored.sort(key=lambda item: item[1], reverse=True)
# Filter out zero-score chunks so only relevant matches remain
return [(chunk, score) for chunk, score in scored if score > 0][:top_k]
def main() -> None:
"""
Run a retrieval example.
"""
chunks = build_chunk_index(DOCUMENTS)
user_query = "Can I get a refund for a digital product?"
results = retrieve(user_query, chunks, top_k=3)
print(f"User query: {user_query}\n")
print("Top retrieved chunks:\n")
if not results:
print("No relevant chunks found.")
return
for rank, (chunk, score) in enumerate(results, start=1):
print(f"[Rank {rank}] doc_id={chunk.doc_id}, chunk_id={chunk.chunk_id}, score={score}")
print(chunk.text)
print("-" * 60)
if __name__ == "__main__":
main()
5.2 Example Output
User query: Can I get a refund for a digital product?
Top retrieved chunks:
[Rank 1] doc_id=refund_policy, chunk_id=2, score=4
Digital products are non-refundable once the download has started, except where required by law.
------------------------------------------------------------
[Rank 2] doc_id=refund_policy, chunk_id=0, score=2
Refund Policy
------------------------------------------------------------
[Rank 3] doc_id=refund_policy, chunk_id=3, score=1
Refunds for approved returns are processed within 5 to 7 business days after inspection.
------------------------------------------------------------
5.3 Discussion
This works, but has limitations: - It matches words, not meaning - It may overvalue titles - It does not understand synonyms - It does not rank semantically similar chunks well
Still, it is very useful for understanding: - Chunking - Scoring - Ranking - Passing evidence to the generation stage
5.4 Mini Exercise
Modify the script to answer these questions:
- “How long does express shipping take?”
- “What should I do if my account may be compromised?”
- “How long do password reset links remain valid?”
Suggested learner task
- Change
user_query - Observe retrieved chunks
- Inspect whether the top result is correct
6. Hands-on Exercise 2: Add Grounded Generation with the OpenAI Responses API
Goal
Take the retrieved chunks and pass them to gpt-5.4-mini using the Responses API so the model answers using only retrieved context.
This is the core RAG loop: 1. Retrieve evidence 2. Build grounded prompt 3. Generate answer
6.1 Prerequisites
Install the OpenAI Python SDK:
pip install openai
Set your API key:
export OPENAI_API_KEY="your_api_key_here"
On Windows PowerShell:
$env:OPENAI_API_KEY="your_api_key_here"
6.2 Code: Full Basic RAG Pipeline
"""
Exercise 2: End-to-end basic RAG pipeline using the OpenAI Responses API.
What this script does:
1. Builds a chunk index from local documents
2. Retrieves the top matching chunks for a user question
3. Constructs a grounded prompt with the retrieved context
4. Calls gpt-5.4-mini via the Responses API
5. Prints the final answer and the supporting chunks
Requirements:
pip install openai
Environment:
OPENAI_API_KEY must be set
"""
from __future__ import annotations
import re
from dataclasses import dataclass
from typing import List
from openai import OpenAI
# Create the OpenAI client once and reuse it.
client = OpenAI()
# -----------------------------
# Data model
# -----------------------------
@dataclass
class Chunk:
"""A retrievable chunk of text with metadata."""
doc_id: str
chunk_id: int
text: str
# -----------------------------
# Example knowledge base
# -----------------------------
DOCUMENTS = {
"refund_policy": """
Refund Policy
Physical products may be returned within 30 days of delivery if they are unused and in their original packaging.
Digital products are non-refundable once the download has started, except where required by law.
Refunds for approved returns are processed within 5 to 7 business days after inspection.
""",
"shipping_policy": """
Shipping Policy
Standard shipping usually takes 3 to 5 business days.
Express shipping usually takes 1 to 2 business days.
International shipping times vary by destination and customs processing.
""",
"account_security": """
Account Security Guide
Users should enable multi-factor authentication to improve account security.
If you suspect unauthorized access, reset your password immediately and contact support.
Password reset links expire after 30 minutes for security reasons.
""",
}
# -----------------------------
# Retrieval helpers
# -----------------------------
def normalize_text(text: str) -> List[str]:
"""Normalize text into lowercase alphanumeric tokens."""
return re.findall(r"[a-z0-9]+", text.lower())
def chunk_document(doc_id: str, text: str) -> List[Chunk]:
"""Split a document into paragraph chunks."""
paragraphs = [p.strip() for p in text.strip().split("\n\n") if p.strip()]
return [
Chunk(doc_id=doc_id, chunk_id=i, text=paragraph)
for i, paragraph in enumerate(paragraphs)
]
def build_chunk_index(documents: dict[str, str]) -> List[Chunk]:
"""Build a flat chunk list from all documents."""
chunks: List[Chunk] = []
for doc_id, text in documents.items():
chunks.extend(chunk_document(doc_id, text))
return chunks
def score_chunk(query: str, chunk: Chunk) -> int:
"""Score a chunk by keyword overlap."""
query_tokens = set(normalize_text(query))
chunk_tokens = set(normalize_text(chunk.text))
return len(query_tokens.intersection(chunk_tokens))
def retrieve(query: str, chunks: List[Chunk], top_k: int = 3) -> List[tuple[Chunk, int]]:
"""Return the top-k relevant chunks."""
scored = [(chunk, score_chunk(query, chunk)) for chunk in chunks]
scored.sort(key=lambda item: item[1], reverse=True)
return [(chunk, score) for chunk, score in scored if score > 0][:top_k]
# -----------------------------
# Prompt construction
# -----------------------------
def build_context(retrieved_chunks: List[tuple[Chunk, int]]) -> str:
"""
Format retrieved chunks into a context block for the model.
"""
lines = []
for chunk, score in retrieved_chunks:
lines.append(
f"[doc_id={chunk.doc_id} | chunk_id={chunk.chunk_id} | score={score}]\n{chunk.text}"
)
return "\n\n".join(lines)
def answer_with_rag(user_query: str, chunks: List[Chunk], top_k: int = 3) -> str:
"""
Retrieve context and ask the model to answer using only that context.
Args:
user_query: The user's question
chunks: Available retrievable chunks
top_k: Number of chunks to include
Returns:
The model's grounded answer as plain text
"""
retrieved = retrieve(user_query, chunks, top_k=top_k)
if not retrieved:
return "No relevant documents were found for this question."
context = build_context(retrieved)
prompt = f"""
You are a helpful assistant answering questions using only the provided context.
Instructions:
- Answer only from the context below.
- If the answer cannot be found in the context, say:
"I could not find that information in the provided documents."
- Be concise and clear.
- If possible, mention the supporting document ID.
User question:
{user_query}
Context:
{context}
""".strip()
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt,
)
return response.output_text
def main() -> None:
"""
Run the full RAG example.
"""
chunks = build_chunk_index(DOCUMENTS)
user_query = "Can I get a refund for a digital product?"
retrieved = retrieve(user_query, chunks, top_k=3)
answer = answer_with_rag(user_query, chunks, top_k=3)
print(f"Question: {user_query}\n")
print("Retrieved context:")
for rank, (chunk, score) in enumerate(retrieved, start=1):
print(f"\n[Rank {rank}] doc_id={chunk.doc_id}, chunk_id={chunk.chunk_id}, score={score}")
print(chunk.text)
print("\n" + "=" * 60)
print("Model answer:")
print(answer)
if __name__ == "__main__":
main()
6.3 Example Output
Question: Can I get a refund for a digital product?
Retrieved context:
[Rank 1] doc_id=refund_policy, chunk_id=2, score=4
Digital products are non-refundable once the download has started, except where required by law.
[Rank 2] doc_id=refund_policy, chunk_id=0, score=2
Refund Policy
[Rank 3] doc_id=refund_policy, chunk_id=3, score=1
Refunds for approved returns are processed within 5 to 7 business days after inspection.
============================================================
Model answer:
According to document refund_policy, digital products are non-refundable once the download has started, except where required by law.
6.4 Key Takeaways
This basic pipeline already demonstrates the essential RAG pattern: - Retrieval first - Generation second - Answer grounded in evidence
Even with a naive retriever, this approach is often better than asking the model with no context.
7. Improving the Baseline RAG Pipeline
7.1 Better Chunking
Current approach: - Splits on paragraphs only
Potential improvements: - Merge short paragraphs with nearby ones - Add chunk overlap - Preserve section titles with content - Split long sections by sentence windows
Example idea
If a heading like “Refund Policy” appears alone, attach it to the next paragraph so retrieval is more useful.
7.2 Better Prompting
A stronger answer prompt can ask for: - Short answers - Bullet points - Citations - Explicit uncertainty if context is missing
Example refinement:
Answer using only the context.
Include a short citation in parentheses like (refund_policy, chunk 2).
If the answer is not present, say so clearly.
7.3 Better Retrieval
Keyword matching is easy to understand but weak in practice.
Real-world retrievers often use: - BM25 - Dense vector embeddings - Hybrid retrieval - Metadata filters
Future sessions can extend this to embedding-based retrieval for semantic matching.
8. Hands-on Exercise 3: Improve the Prompt with Citations
Goal
Update the answer generation prompt so the model cites chunk sources.
8.1 Code: Citation-Friendly RAG Prompt
"""
Exercise 3: Improve the RAG answer format with explicit citations.
This example focuses on prompt engineering rather than retrieval changes.
"""
from __future__ import annotations
import re
from dataclasses import dataclass
from typing import List
from openai import OpenAI
client = OpenAI()
@dataclass
class Chunk:
doc_id: str
chunk_id: int
text: str
DOCUMENTS = {
"shipping_policy": """
Shipping Policy
Standard shipping usually takes 3 to 5 business days.
Express shipping usually takes 1 to 2 business days.
International shipping times vary by destination and customs processing.
"""
}
def normalize_text(text: str) -> List[str]:
return re.findall(r"[a-z0-9]+", text.lower())
def chunk_document(doc_id: str, text: str) -> List[Chunk]:
paragraphs = [p.strip() for p in text.strip().split("\n\n") if p.strip()]
return [Chunk(doc_id=doc_id, chunk_id=i, text=p) for i, p in enumerate(paragraphs)]
def retrieve(query: str, chunks: List[Chunk], top_k: int = 2) -> List[tuple[Chunk, int]]:
query_tokens = set(normalize_text(query))
scored = []
for chunk in chunks:
chunk_tokens = set(normalize_text(chunk.text))
score = len(query_tokens.intersection(chunk_tokens))
if score > 0:
scored.append((chunk, score))
scored.sort(key=lambda item: item[1], reverse=True)
return scored[:top_k]
def build_context(retrieved_chunks: List[tuple[Chunk, int]]) -> str:
return "\n\n".join(
f"[doc_id={chunk.doc_id}, chunk_id={chunk.chunk_id}, score={score}]\n{chunk.text}"
for chunk, score in retrieved_chunks
)
def main() -> None:
chunks = chunk_document("shipping_policy", DOCUMENTS["shipping_policy"])
question = "How long does express shipping take?"
retrieved = retrieve(question, chunks)
context = build_context(retrieved)
prompt = f"""
Answer the user's question using only the provided context.
Requirements:
- Give a concise answer.
- Include a citation in the format: (doc_id, chunk_id).
- If the answer is not in the context, say:
"I could not find that information in the provided documents."
Question:
{question}
Context:
{context}
""".strip()
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt,
)
print("Retrieved context:")
print(context)
print("\nAnswer:")
print(response.output_text)
if __name__ == "__main__":
main()
8.2 Example Output
Retrieved context:
[doc_id=shipping_policy, chunk_id=2, score=2]
Express shipping usually takes 1 to 2 business days.
[doc_id=shipping_policy, chunk_id=0, score=1]
Shipping Policy
Answer:
Express shipping usually takes 1 to 2 business days. (shipping_policy, 2)
9. Common RAG Failure Modes
9.1 Retrieval Misses the Right Chunk
Cause: - Bad chunking - Weak matching - Poor query phrasing
Symptom: - Model says it cannot find the answer - Model answers from weakly related text
9.2 Too Much Irrelevant Context
Cause: - Top-k too large - Weak retriever returns noisy chunks
Symptom: - Model gets distracted - Answer becomes verbose or incorrect
9.3 Prompt Does Not Restrict the Model
Cause: - Vague instructions - No fallback behavior
Symptom: - Model fills gaps with guesses
9.4 Chunk Too Small or Too Large
Too small: - Lacks enough context
Too large: - Includes unrelated information - Harder to rank accurately
10. Best Practices
- Start simple and inspect your chunks
- Print retrieved chunks during development
- Keep metadata with every chunk
- Explicitly instruct the model to use only context
- Add a fallback phrase for missing information
- Evaluate both retrieval quality and answer quality
- Prefer smaller, inspectable experiments before scaling up
11. Guided Practice Questions
Use the RAG script and try these prompts:
- “Are digital products refundable?”
- “How fast is standard shipping?”
- “What should I do after unauthorized access?”
- “How long until approved refunds are processed?”
- “What is the customs duty fee for international shipping?”
Reflection prompts
- Did retrieval return the right chunk?
- Did the model answer only from context?
- What happened when the answer was not present?
- Which chunking or prompt improvements would help?
12. Summary
In this session, learners built a minimal RAG pipeline in Python:
- Loaded local documents
- Split them into chunks
- Retrieved relevant chunks with a simple scoring strategy
- Passed context into
gpt-5.4-mini - Generated grounded answers using the OpenAI Responses API
This is the conceptual foundation of many practical GenAI systems: - Internal document assistants - FAQ bots - Knowledge-grounded support tools - Enterprise search assistants
The most important lesson is that RAG quality depends on retrieval quality. The generation model can only be as grounded as the evidence it receives.
13. Useful Resources
- OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
- OpenAI API docs: https://platform.openai.com/docs
- OpenAI Python SDK: https://github.com/openai/openai-python
- Prompt engineering guide: https://platform.openai.com/docs/guides/prompt-engineering
- Python
dataclassesdocs: https://docs.python.org/3/library/dataclasses.html
14. Suggested Homework
Homework Task 1
Extend the document set with a new policy, such as: - Subscription cancellation - Warranty policy - Technical support SLA
Then test whether your retriever can find the right chunk.
Homework Task 2
Modify chunking so that headings are merged into the next paragraph.
Homework Task 3
Add a function that returns both: - The generated answer - The list of cited chunks
Homework Task 4
Test edge cases where the answer is not present and verify the fallback response is used.
15. Instructor Notes
Recommended emphasis
- RAG is a pipeline, not just a prompt
- Retrieval quality is the main bottleneck
- Debugging starts by printing chunks and scores
- Grounded prompting reduces hallucinations
Optional extension if time remains
Ask learners to compare: - Direct question to the model without context - RAG-grounded question with retrieved context
Then discuss: - Accuracy - Confidence - Inspectability
End of Session
Back to Chapter | Back to Master Plan | Previous Session | Next Session