Session 3: Building a RAG Pipeline in Python

Synopsis

Walks through the core stages of a RAG system, including document loading, chunking, embedding generation, indexing, retrieval, and response synthesis. This session connects earlier API and application skills to a powerful production pattern.

Session Content

Session 3: Building a RAG Pipeline in Python

Session Overview

In this session, learners will build a complete Retrieval-Augmented Generation (RAG) pipeline in Python using the OpenAI Python SDK and the Responses API. The goal is to understand how to ground model outputs in external knowledge by retrieving relevant documents and supplying them as context to the model.

By the end of this session, learners will be able to:

Explain what RAG is and when to use it
Break a RAG system into ingestion, chunking, embedding, retrieval, and generation stages
Implement a simple local RAG pipeline in Python
Use the OpenAI Responses API with gpt-5.4-mini
Evaluate the quality of retrieval and answer generation
Identify common pitfalls such as poor chunking, noisy context, and prompt issues

Learning Objectives

After this session, learners should be able to:

Define Retrieval-Augmented Generation and distinguish it from plain prompting
Explain the role of chunking and retrieval in a RAG system
Implement a basic keyword-based retriever in Python
Connect retrieval output to a grounded generation step using the OpenAI Responses API
Improve a baseline RAG pipeline with better chunking and prompt design
Inspect outputs and reason about failure modes

Session Agenda (~45 minutes)

0–5 min: Introduction to RAG
5–12 min: Core architecture of a RAG pipeline
12–20 min: Data preparation and chunking
20–30 min: Hands-on Exercise 1 — Build a simple retriever
30–40 min: Hands-on Exercise 2 — Add grounded answer generation with the Responses API
40–45 min: Wrap-up, pitfalls, and next steps

1. What Is RAG?

1.1 Definition

Retrieval-Augmented Generation (RAG) is a pattern where an LLM first receives relevant information retrieved from a knowledge source, then uses that information to generate an answer.

Instead of relying only on its internal training knowledge, the model is given specific, up-to-date, or domain-specific content.

Plain prompting

User asks:

“What does our company refund policy say about digital products?”

Without access to company documents, the model may guess or respond generically.

RAG prompting

The system: 1. Searches company policy documents 2. Retrieves the most relevant passages 3. Sends those passages to the model 4. Asks the model to answer using the retrieved information

This makes answers: - More grounded - More accurate for private/domain-specific knowledge - Easier to inspect and debug

1.2 When to Use RAG

Use RAG when:

The knowledge changes frequently
You need answers based on private/internal documents
You want citations or source-aware answers
You want to reduce hallucinations by grounding outputs

Do not assume RAG solves everything. If retrieval is poor, generation quality will also be poor.

2. Anatomy of a RAG Pipeline

A simple RAG pipeline usually has these stages:

2.1 Ingestion

Load documents from files, databases, APIs, or internal systems.

Examples: - Markdown files - PDFs - Product documentation - Wiki pages - FAQs

2.2 Chunking

Split long documents into smaller pieces called chunks.

Why chunk? - LLM context windows are limited - Retrieval works better on focused passages - Small chunks are easier to rank and inspect

2.3 Embedding or Retrieval Indexing

Convert chunks into a searchable format.

Common approaches: - Keyword-based retrieval - TF-IDF / BM25 - Dense vector embeddings

For this session, we will begin with a simple keyword-overlap retriever so learners can understand the mechanics without additional dependencies.

2.4 Retrieval

Given a user query: - Search the chunk collection - Rank relevant chunks - Return top-k passages

2.5 Augmented Generation

Pass retrieved chunks to the LLM with instructions such as: - Answer only from the provided context - Say when the answer is not present - Quote or cite the supporting chunks

3. Designing a Good RAG Workflow

3.1 Good Chunking Matters

A chunk should: - Be semantically coherent - Not be too large - Preserve meaning without requiring the whole document

Bad chunking: - Splitting in the middle of a sentence - Huge chunks with many unrelated topics - Tiny chunks with no context

A useful starting point: - Split by paragraphs - Keep chunk sizes moderate - Include metadata like document title and chunk ID

3.2 Good Retrieval Matters

If retrieval returns the wrong chunks: - The model may answer incorrectly - The answer may be incomplete - The model may say “not found” when it exists elsewhere

3.3 Good Prompting Matters

The generation prompt should: - Clearly separate question and context - Instruct the model to rely on the context - Request a fallback behavior if context is insufficient

Example instruction:

Answer using only the provided context. If the answer is not in the context, say: "I could not find that information in the provided documents."

4. Preparing Example Knowledge Base

For the hands-on portion, we will use a small in-memory knowledge base that simulates internal documentation.

Example Documents

We will work with: - Refund policy - Shipping policy - Account security guide

5. Hands-on Exercise 1: Build a Simple Retriever

Goal

Create a minimal RAG retrieval stage in Python: - Load documents - Chunk them - Score chunks against a user query using keyword overlap - Return the most relevant chunks

This exercise teaches the core retrieval loop before adding the LLM.

5.1 Code: Document Loading, Chunking, and Retrieval

"""
Exercise 1: Build a simple local retriever for a RAG pipeline.

What this script does:
1. Defines a small in-memory document collection
2. Splits documents into paragraph-level chunks
3. Implements a simple keyword-overlap scoring function
4. Retrieves the top matching chunks for a user query

This is intentionally simple so the retrieval mechanics are easy to understand.
"""

from __future__ import annotations

import re
from dataclasses import dataclass
from typing import List


# -----------------------------
# Data model
# -----------------------------
@dataclass
class Chunk:
    """Represents a chunk of text plus metadata."""
    doc_id: str
    chunk_id: int
    text: str


# -----------------------------
# Example knowledge base
# -----------------------------
DOCUMENTS = {
    "refund_policy": """
Refund Policy

Physical products may be returned within 30 days of delivery if they are unused and in their original packaging.

Digital products are non-refundable once the download has started, except where required by law.

Refunds for approved returns are processed within 5 to 7 business days after inspection.
""",
    "shipping_policy": """
Shipping Policy

Standard shipping usually takes 3 to 5 business days.

Express shipping usually takes 1 to 2 business days.

International shipping times vary by destination and customs processing.
""",
    "account_security": """
Account Security Guide

Users should enable multi-factor authentication to improve account security.

If you suspect unauthorized access, reset your password immediately and contact support.

Password reset links expire after 30 minutes for security reasons.
""",
}


# -----------------------------
# Utility functions
# -----------------------------
def normalize_text(text: str) -> List[str]:
    """
    Lowercase text and extract alphanumeric tokens.

    Returns:
        A list of normalized tokens.
    """
    return re.findall(r"[a-z0-9]+", text.lower())


def chunk_document(doc_id: str, text: str) -> List[Chunk]:
    """
    Split a document into paragraph chunks.

    Paragraph splitting is simple and often a decent starting point
    for structured internal docs.

    Args:
        doc_id: The document identifier
        text: The raw document text

    Returns:
        A list of Chunk objects
    """
    paragraphs = [p.strip() for p in text.strip().split("\n\n") if p.strip()]
    return [
        Chunk(doc_id=doc_id, chunk_id=i, text=paragraph)
        for i, paragraph in enumerate(paragraphs)
    ]


def build_chunk_index(documents: dict[str, str]) -> List[Chunk]:
    """
    Convert a document dictionary into a flat list of chunks.
    """
    all_chunks: List[Chunk] = []
    for doc_id, text in documents.items():
        all_chunks.extend(chunk_document(doc_id, text))
    return all_chunks


def score_chunk(query: str, chunk: Chunk) -> int:
    """
    Score a chunk based on keyword overlap with the query.

    This is a naive retriever:
    - Tokenize query and chunk
    - Compute overlap count

    Args:
        query: The user's question
        chunk: A candidate chunk

    Returns:
        Integer overlap score
    """
    query_tokens = set(normalize_text(query))
    chunk_tokens = set(normalize_text(chunk.text))
    return len(query_tokens.intersection(chunk_tokens))


def retrieve(query: str, chunks: List[Chunk], top_k: int = 3) -> List[tuple[Chunk, int]]:
    """
    Retrieve the top-k chunks for a query.

    Args:
        query: The user's question
        chunks: The chunk collection
        top_k: Number of chunks to return

    Returns:
        A list of (chunk, score) tuples sorted by descending score
    """
    scored = [(chunk, score_chunk(query, chunk)) for chunk in chunks]
    scored.sort(key=lambda item: item[1], reverse=True)

    # Filter out zero-score chunks so only relevant matches remain
    return [(chunk, score) for chunk, score in scored if score > 0][:top_k]


def main() -> None:
    """
    Run a retrieval example.
    """
    chunks = build_chunk_index(DOCUMENTS)

    user_query = "Can I get a refund for a digital product?"
    results = retrieve(user_query, chunks, top_k=3)

    print(f"User query: {user_query}\n")
    print("Top retrieved chunks:\n")

    if not results:
        print("No relevant chunks found.")
        return

    for rank, (chunk, score) in enumerate(results, start=1):
        print(f"[Rank {rank}] doc_id={chunk.doc_id}, chunk_id={chunk.chunk_id}, score={score}")
        print(chunk.text)
        print("-" * 60)


if __name__ == "__main__":
    main()

5.2 Example Output

User query: Can I get a refund for a digital product?

Top retrieved chunks:

[Rank 1] doc_id=refund_policy, chunk_id=2, score=4
Digital products are non-refundable once the download has started, except where required by law.
------------------------------------------------------------
[Rank 2] doc_id=refund_policy, chunk_id=0, score=2
Refund Policy
------------------------------------------------------------
[Rank 3] doc_id=refund_policy, chunk_id=3, score=1
Refunds for approved returns are processed within 5 to 7 business days after inspection.
------------------------------------------------------------

5.3 Discussion

This works, but has limitations: - It matches words, not meaning - It may overvalue titles - It does not understand synonyms - It does not rank semantically similar chunks well

Still, it is very useful for understanding: - Chunking - Scoring - Ranking - Passing evidence to the generation stage

5.4 Mini Exercise

Modify the script to answer these questions:

“How long does express shipping take?”
“What should I do if my account may be compromised?”
“How long do password reset links remain valid?”

Suggested learner task

Change user_query
Observe retrieved chunks
Inspect whether the top result is correct

6. Hands-on Exercise 2: Add Grounded Generation with the OpenAI Responses API

Goal

Take the retrieved chunks and pass them to gpt-5.4-mini using the Responses API so the model answers using only retrieved context.

This is the core RAG loop: 1. Retrieve evidence 2. Build grounded prompt 3. Generate answer

6.1 Prerequisites

Install the OpenAI Python SDK:

pip install openai

Set your API key:

export OPENAI_API_KEY="your_api_key_here"

On Windows PowerShell:

$env:OPENAI_API_KEY="your_api_key_here"

6.2 Code: Full Basic RAG Pipeline

"""
Exercise 2: End-to-end basic RAG pipeline using the OpenAI Responses API.

What this script does:
1. Builds a chunk index from local documents
2. Retrieves the top matching chunks for a user question
3. Constructs a grounded prompt with the retrieved context
4. Calls gpt-5.4-mini via the Responses API
5. Prints the final answer and the supporting chunks

Requirements:
    pip install openai

Environment:
    OPENAI_API_KEY must be set
"""

from __future__ import annotations

import re
from dataclasses import dataclass
from typing import List

from openai import OpenAI


# Create the OpenAI client once and reuse it.
client = OpenAI()


# -----------------------------
# Data model
# -----------------------------
@dataclass
class Chunk:
    """A retrievable chunk of text with metadata."""
    doc_id: str
    chunk_id: int
    text: str


# -----------------------------
# Example knowledge base
# -----------------------------
DOCUMENTS = {
    "refund_policy": """
Refund Policy

Physical products may be returned within 30 days of delivery if they are unused and in their original packaging.

Digital products are non-refundable once the download has started, except where required by law.

Refunds for approved returns are processed within 5 to 7 business days after inspection.
""",
    "shipping_policy": """
Shipping Policy

Standard shipping usually takes 3 to 5 business days.

Express shipping usually takes 1 to 2 business days.

International shipping times vary by destination and customs processing.
""",
    "account_security": """
Account Security Guide

Users should enable multi-factor authentication to improve account security.

If you suspect unauthorized access, reset your password immediately and contact support.

Password reset links expire after 30 minutes for security reasons.
""",
}


# -----------------------------
# Retrieval helpers
# -----------------------------
def normalize_text(text: str) -> List[str]:
    """Normalize text into lowercase alphanumeric tokens."""
    return re.findall(r"[a-z0-9]+", text.lower())


def chunk_document(doc_id: str, text: str) -> List[Chunk]:
    """Split a document into paragraph chunks."""
    paragraphs = [p.strip() for p in text.strip().split("\n\n") if p.strip()]
    return [
        Chunk(doc_id=doc_id, chunk_id=i, text=paragraph)
        for i, paragraph in enumerate(paragraphs)
    ]


def build_chunk_index(documents: dict[str, str]) -> List[Chunk]:
    """Build a flat chunk list from all documents."""
    chunks: List[Chunk] = []
    for doc_id, text in documents.items():
        chunks.extend(chunk_document(doc_id, text))
    return chunks


def score_chunk(query: str, chunk: Chunk) -> int:
    """Score a chunk by keyword overlap."""
    query_tokens = set(normalize_text(query))
    chunk_tokens = set(normalize_text(chunk.text))
    return len(query_tokens.intersection(chunk_tokens))


def retrieve(query: str, chunks: List[Chunk], top_k: int = 3) -> List[tuple[Chunk, int]]:
    """Return the top-k relevant chunks."""
    scored = [(chunk, score_chunk(query, chunk)) for chunk in chunks]
    scored.sort(key=lambda item: item[1], reverse=True)
    return [(chunk, score) for chunk, score in scored if score > 0][:top_k]


# -----------------------------
# Prompt construction
# -----------------------------
def build_context(retrieved_chunks: List[tuple[Chunk, int]]) -> str:
    """
    Format retrieved chunks into a context block for the model.
    """
    lines = []
    for chunk, score in retrieved_chunks:
        lines.append(
            f"[doc_id={chunk.doc_id} | chunk_id={chunk.chunk_id} | score={score}]\n{chunk.text}"
        )
    return "\n\n".join(lines)


def answer_with_rag(user_query: str, chunks: List[Chunk], top_k: int = 3) -> str:
    """
    Retrieve context and ask the model to answer using only that context.

    Args:
        user_query: The user's question
        chunks: Available retrievable chunks
        top_k: Number of chunks to include

    Returns:
        The model's grounded answer as plain text
    """
    retrieved = retrieve(user_query, chunks, top_k=top_k)

    if not retrieved:
        return "No relevant documents were found for this question."

    context = build_context(retrieved)

    prompt = f"""
You are a helpful assistant answering questions using only the provided context.

Instructions:
- Answer only from the context below.
- If the answer cannot be found in the context, say:
  "I could not find that information in the provided documents."
- Be concise and clear.
- If possible, mention the supporting document ID.

User question:
{user_query}

Context:
{context}
""".strip()

    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt,
    )

    return response.output_text


def main() -> None:
    """
    Run the full RAG example.
    """
    chunks = build_chunk_index(DOCUMENTS)

    user_query = "Can I get a refund for a digital product?"
    retrieved = retrieve(user_query, chunks, top_k=3)
    answer = answer_with_rag(user_query, chunks, top_k=3)

    print(f"Question: {user_query}\n")
    print("Retrieved context:")
    for rank, (chunk, score) in enumerate(retrieved, start=1):
        print(f"\n[Rank {rank}] doc_id={chunk.doc_id}, chunk_id={chunk.chunk_id}, score={score}")
        print(chunk.text)

    print("\n" + "=" * 60)
    print("Model answer:")
    print(answer)


if __name__ == "__main__":
    main()

6.3 Example Output

Question: Can I get a refund for a digital product?

Retrieved context:

[Rank 1] doc_id=refund_policy, chunk_id=2, score=4
Digital products are non-refundable once the download has started, except where required by law.

[Rank 2] doc_id=refund_policy, chunk_id=0, score=2
Refund Policy

[Rank 3] doc_id=refund_policy, chunk_id=3, score=1
Refunds for approved returns are processed within 5 to 7 business days after inspection.

============================================================
Model answer:
According to document refund_policy, digital products are non-refundable once the download has started, except where required by law.

6.4 Key Takeaways

This basic pipeline already demonstrates the essential RAG pattern: - Retrieval first - Generation second - Answer grounded in evidence

Even with a naive retriever, this approach is often better than asking the model with no context.

7. Improving the Baseline RAG Pipeline

7.1 Better Chunking

Current approach: - Splits on paragraphs only

Potential improvements: - Merge short paragraphs with nearby ones - Add chunk overlap - Preserve section titles with content - Split long sections by sentence windows

Example idea

If a heading like “Refund Policy” appears alone, attach it to the next paragraph so retrieval is more useful.

7.2 Better Prompting

A stronger answer prompt can ask for: - Short answers - Bullet points - Citations - Explicit uncertainty if context is missing

Example refinement:

Answer using only the context.
Include a short citation in parentheses like (refund_policy, chunk 2).
If the answer is not present, say so clearly.

7.3 Better Retrieval

Keyword matching is easy to understand but weak in practice.

Real-world retrievers often use: - BM25 - Dense vector embeddings - Hybrid retrieval - Metadata filters

Future sessions can extend this to embedding-based retrieval for semantic matching.

8. Hands-on Exercise 3: Improve the Prompt with Citations

Goal

Update the answer generation prompt so the model cites chunk sources.

8.1 Code: Citation-Friendly RAG Prompt

"""
Exercise 3: Improve the RAG answer format with explicit citations.

This example focuses on prompt engineering rather than retrieval changes.
"""

from __future__ import annotations

import re
from dataclasses import dataclass
from typing import List

from openai import OpenAI

client = OpenAI()


@dataclass
class Chunk:
    doc_id: str
    chunk_id: int
    text: str


DOCUMENTS = {
    "shipping_policy": """
Shipping Policy

Standard shipping usually takes 3 to 5 business days.

Express shipping usually takes 1 to 2 business days.

International shipping times vary by destination and customs processing.
"""
}


def normalize_text(text: str) -> List[str]:
    return re.findall(r"[a-z0-9]+", text.lower())


def chunk_document(doc_id: str, text: str) -> List[Chunk]:
    paragraphs = [p.strip() for p in text.strip().split("\n\n") if p.strip()]
    return [Chunk(doc_id=doc_id, chunk_id=i, text=p) for i, p in enumerate(paragraphs)]


def retrieve(query: str, chunks: List[Chunk], top_k: int = 2) -> List[tuple[Chunk, int]]:
    query_tokens = set(normalize_text(query))
    scored = []
    for chunk in chunks:
        chunk_tokens = set(normalize_text(chunk.text))
        score = len(query_tokens.intersection(chunk_tokens))
        if score > 0:
            scored.append((chunk, score))
    scored.sort(key=lambda item: item[1], reverse=True)
    return scored[:top_k]


def build_context(retrieved_chunks: List[tuple[Chunk, int]]) -> str:
    return "\n\n".join(
        f"[doc_id={chunk.doc_id}, chunk_id={chunk.chunk_id}, score={score}]\n{chunk.text}"
        for chunk, score in retrieved_chunks
    )


def main() -> None:
    chunks = chunk_document("shipping_policy", DOCUMENTS["shipping_policy"])
    question = "How long does express shipping take?"
    retrieved = retrieve(question, chunks)

    context = build_context(retrieved)

    prompt = f"""
Answer the user's question using only the provided context.

Requirements:
- Give a concise answer.
- Include a citation in the format: (doc_id, chunk_id).
- If the answer is not in the context, say:
  "I could not find that information in the provided documents."

Question:
{question}

Context:
{context}
""".strip()

    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt,
    )

    print("Retrieved context:")
    print(context)
    print("\nAnswer:")
    print(response.output_text)


if __name__ == "__main__":
    main()

8.2 Example Output

Retrieved context:
[doc_id=shipping_policy, chunk_id=2, score=2]
Express shipping usually takes 1 to 2 business days.

[doc_id=shipping_policy, chunk_id=0, score=1]
Shipping Policy

Answer:
Express shipping usually takes 1 to 2 business days. (shipping_policy, 2)

9. Common RAG Failure Modes

9.1 Retrieval Misses the Right Chunk

Cause: - Bad chunking - Weak matching - Poor query phrasing

Symptom: - Model says it cannot find the answer - Model answers from weakly related text

9.2 Too Much Irrelevant Context

Cause: - Top-k too large - Weak retriever returns noisy chunks

Symptom: - Model gets distracted - Answer becomes verbose or incorrect

9.3 Prompt Does Not Restrict the Model

Cause: - Vague instructions - No fallback behavior

Symptom: - Model fills gaps with guesses

9.4 Chunk Too Small or Too Large

Too small: - Lacks enough context

Too large: - Includes unrelated information - Harder to rank accurately

10. Best Practices

Start simple and inspect your chunks
Print retrieved chunks during development
Keep metadata with every chunk
Explicitly instruct the model to use only context
Add a fallback phrase for missing information
Evaluate both retrieval quality and answer quality
Prefer smaller, inspectable experiments before scaling up

11. Guided Practice Questions

Use the RAG script and try these prompts:

“Are digital products refundable?”
“How fast is standard shipping?”
“What should I do after unauthorized access?”
“How long until approved refunds are processed?”
“What is the customs duty fee for international shipping?”

Reflection prompts

Did retrieval return the right chunk?
Did the model answer only from context?
What happened when the answer was not present?
Which chunking or prompt improvements would help?

12. Summary

In this session, learners built a minimal RAG pipeline in Python:

Loaded local documents
Split them into chunks
Retrieved relevant chunks with a simple scoring strategy
Passed context into gpt-5.4-mini
Generated grounded answers using the OpenAI Responses API

This is the conceptual foundation of many practical GenAI systems: - Internal document assistants - FAQ bots - Knowledge-grounded support tools - Enterprise search assistants

The most important lesson is that RAG quality depends on retrieval quality. The generation model can only be as grounded as the evidence it receives.

13. Useful Resources

OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
OpenAI API docs: https://platform.openai.com/docs
OpenAI Python SDK: https://github.com/openai/openai-python
Prompt engineering guide: https://platform.openai.com/docs/guides/prompt-engineering
Python dataclasses docs: https://docs.python.org/3/library/dataclasses.html

14. Suggested Homework

Homework Task 1

Extend the document set with a new policy, such as: - Subscription cancellation - Warranty policy - Technical support SLA

Then test whether your retriever can find the right chunk.

Homework Task 2

Modify chunking so that headings are merged into the next paragraph.

Homework Task 3

Add a function that returns both: - The generated answer - The list of cited chunks

Homework Task 4

Test edge cases where the answer is not present and verify the fallback response is used.

15. Instructor Notes

Recommended emphasis

RAG is a pipeline, not just a prompt
Retrieval quality is the main bottleneck
Debugging starts by printing chunks and scores
Grounded prompting reduces hallucinations

Optional extension if time remains

Ask learners to compare: - Direct question to the model without context - RAG-grounded question with retrieved context

Then discuss: - Accuracy - Confidence - Inspectability

End of Session

Back to Chapter | Back to Master Plan | Previous Session | Next Session