Session 4: Improving Retrieval Quality and Relevance

Synopsis

Covers practical strategies for better chunking, metadata use, reranking, query reformulation, and context selection. Learners understand how retrieval quality affects answer quality and system reliability.

Session Content

Session 4: Improving Retrieval Quality and Relevance

Session Overview

In this session, learners will focus on improving the quality of retrieved context in Retrieval-Augmented Generation (RAG) systems. A basic retriever may return loosely related or noisy chunks, which can reduce answer quality. This session covers practical strategies for making retrieval more relevant, precise, and useful for downstream generation.

By the end of this session, learners will be able to:

Diagnose common retrieval quality issues
Improve chunking strategies for better semantic matching
Apply metadata filtering to narrow results
Use query rewriting to improve retrieval recall
Add reranking logic to improve result ordering
Evaluate retrieval quality with simple practical metrics

Learning Objectives

After this session, learners should be able to:

Explain why retrieval quality matters in a RAG pipeline
Compare chunking strategies and their impact on retrieval results
Use metadata to constrain search results
Implement query rewriting to improve document recall
Build a lightweight reranking step
Measure retrieval relevance with practical evaluation techniques

Prerequisites

Learners should already be comfortable with:

Basic Python programming
Reading and writing lists/dictionaries
Calling an LLM with the OpenAI Python SDK
The high-level idea of embeddings and vector search
Basic RAG architecture

Session Timing (~45 minutes)

0–5 min: Why retrieval quality matters
5–15 min: Chunking strategies and metadata filtering
15–25 min: Query rewriting for better recall
25–35 min: Reranking retrieved chunks
35–42 min: Evaluating retrieval quality
42–45 min: Wrap-up and next steps

1. Why Retrieval Quality Matters

In a RAG system, the generator depends on the retriever. If the retriever supplies irrelevant, incomplete, or poorly ordered chunks, the final answer may be:

Factually incomplete
Overly generic
Focused on the wrong topic
Unable to answer despite the information existing in the corpus

Common retrieval problems include:

Chunk too large: the important concept is diluted by surrounding text
Chunk too small: useful context is split across many fragments
Ambiguous user queries: the retriever misses relevant terminology
No metadata constraints: results come from the wrong source, date, or topic
Poor ranking: relevant results are present but buried below weaker matches

A strong retrieval pipeline often uses multiple quality-improvement steps:

Better chunking
Metadata filtering
Query rewriting
Reranking
Retrieval evaluation

2. Chunking Strategies and Their Impact

2.1 Why Chunking Matters

Vector search generally operates on chunks rather than whole documents. Chunking determines what semantic units are embedded and retrieved.

Poor chunking can create these issues:

Key facts get separated from their explanation
Irrelevant neighboring text dominates the chunk meaning
Chunks become too long for accurate matching
Results lose document structure

2.2 Common Chunking Strategies

Fixed-size chunking

Split text every N characters or tokens.

Pros - Easy to implement - Predictable chunk size

Cons - May cut through sentences or ideas - Can separate related content awkwardly

Sentence-based chunking

Group one or more sentences into a chunk.

Pros - Better semantic coherence - Easy to explain and inspect

Cons - Chunk sizes may vary - Some sections may still be too small or too large

Paragraph-based chunking

Use paragraph boundaries as chunk units.

Pros - Preserves author structure - Often semantically meaningful

Cons - Paragraphs may be uneven - Some paragraphs may contain multiple topics

Sliding window chunking

Create overlapping chunks so neighboring context is preserved.

Pros - Helps preserve context across chunk boundaries - Improves recall for split concepts

Cons - Increases storage and retrieval redundancy - Can return near-duplicate results

2.3 Practical Guidance

A useful rule of thumb:

Start with semantically meaningful boundaries if available
Add overlap when information spans boundaries
Store metadata like source, section, date, and topic
Inspect retrieved chunks manually before optimizing further

Hands-On Exercise 1: Compare Chunking Strategies

In this exercise, learners will:

Create a small sample corpus
Chunk the same document in different ways
Inspect the resulting chunks
Discuss which strategy might improve retrieval

"""
Exercise 1: Compare chunking strategies.

This example demonstrates:
- Fixed-size chunking
- Sentence-based chunking
- Sliding window chunking

Run:
    python chunking_demo.py
"""

from typing import List


sample_text = """
Retrieval-augmented generation combines information retrieval with text generation.
A retriever identifies relevant documents or chunks from a knowledge base.
A generator then uses that retrieved context to answer the user's question.
If retrieval quality is poor, the generated answer may be incomplete or incorrect.
Chunking strategy has a major impact on retrieval performance.
Metadata filtering can further improve relevance by narrowing the search space.
Query rewriting helps when the user's wording differs from the source text.
Reranking can improve the ordering of retrieved results before generation.
""".strip()


def fixed_size_chunk(text: str, size: int = 120) -> List[str]:
    """
    Split text into fixed-size character chunks.

    Args:
        text: Input text.
        size: Maximum size of each chunk.

    Returns:
        A list of text chunks.
    """
    return [text[i:i + size].strip() for i in range(0, len(text), size)]


def sentence_chunk(text: str, sentences_per_chunk: int = 2) -> List[str]:
    """
    Split text by periods and group sentences.

    Note:
        This is a simple educational implementation and not a production-grade
        sentence segmenter.

    Args:
        text: Input text.
        sentences_per_chunk: Number of sentences per chunk.

    Returns:
        A list of grouped sentence chunks.
    """
    sentences = [s.strip() for s in text.split(".") if s.strip()]
    chunks = []

    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = ". ".join(sentences[i:i + sentences_per_chunk]) + "."
        chunks.append(chunk)

    return chunks


def sliding_window_chunk(text: str, window_size: int = 2, step: int = 1) -> List[str]:
    """
    Create overlapping chunks from sentence windows.

    Args:
        text: Input text.
        window_size: Number of sentences in each chunk.
        step: Number of sentences to move each time.

    Returns:
        A list of overlapping chunks.
    """
    sentences = [s.strip() for s in text.split(".") if s.strip()]
    chunks = []

    for i in range(0, len(sentences) - window_size + 1, step):
        chunk = ". ".join(sentences[i:i + window_size]) + "."
        chunks.append(chunk)

    return chunks


print("=== Fixed-size chunks ===")
for idx, chunk in enumerate(fixed_size_chunk(sample_text, size=120), start=1):
    print(f"\nChunk {idx}:\n{chunk}")

print("\n=== Sentence-based chunks ===")
for idx, chunk in enumerate(sentence_chunk(sample_text, sentences_per_chunk=2), start=1):
    print(f"\nChunk {idx}:\n{chunk}")

print("\n=== Sliding window chunks ===")
for idx, chunk in enumerate(sliding_window_chunk(sample_text, window_size=2, step=1), start=1):
    print(f"\nChunk {idx}:\n{chunk}")

Example Output

=== Fixed-size chunks ===

Chunk 1:
Retrieval-augmented generation combines information retrieval with text generation.
A retriever identifies relevant

Chunk 2:
documents or chunks from a knowledge base.
A generator then uses that retrieved context to answer the user's questio

Chunk 3:
n.
If retrieval quality is poor, the generated answer may be incomplete or incorrect.
Chunking strategy has a

...

=== Sentence-based chunks ===

Chunk 1:
Retrieval-augmented generation combines information retrieval with text generation. A retriever identifies relevant documents or chunks from a knowledge base.

Chunk 2:
A generator then uses that retrieved context to answer the user's question. If retrieval quality is poor, the generated answer may be incomplete or incorrect.

...

=== Sliding window chunks ===

Chunk 1:
Retrieval-augmented generation combines information retrieval with text generation. A retriever identifies relevant documents or chunks from a knowledge base.

Chunk 2:
A retriever identifies relevant documents or chunks from a knowledge base. A generator then uses that retrieved context to answer the user's question.

...

Discussion Prompts

Which chunks best preserve meaning?
Which strategy would likely match a semantic query more accurately?
What trade-offs appear when adding overlap?

3. Metadata Filtering

3.1 What Is Metadata?

Metadata is structured information attached to a document or chunk, such as:

Source filename
Topic
Product name
Date
Author
Access level
Language
Document type

Metadata lets us narrow retrieval before or after semantic search.

3.2 Why Metadata Helps

Suppose a user asks:

How do I configure retries in the Python SDK?

Without metadata filtering, a retriever may return:

Python SDK docs
JavaScript SDK docs
API retry behavior docs
Old release notes

With metadata filtering, we can constrain results to:

language=python
doc_type=guide
product=sdk

This reduces noise and improves precision.

3.3 Pre-filter vs Post-filter

Pre-filtering

Apply metadata constraints before similarity search.

Pros - Faster search on smaller subset - Better precision

Cons - If filters are too strict, relevant results may be excluded

Post-filtering

Retrieve semantically, then discard irrelevant metadata matches.

Pros - Easier to implement - Preserves recall

Cons - May waste retrieval slots on irrelevant documents

Hands-On Exercise 2: Add Metadata Filtering

This exercise simulates retrieval over a small chunk store and adds metadata-based filtering.

"""
Exercise 2: Metadata filtering for retrieval.

This example uses a toy keyword-overlap retriever so learners can focus on
retrieval logic without requiring a full vector database.

Run:
    python metadata_filtering_demo.py
"""

from typing import Dict, List, Any


documents: List[Dict[str, Any]] = [
    {
        "id": "doc1",
        "text": "The Python SDK supports configurable retries and timeout handling.",
        "metadata": {"language": "python", "doc_type": "guide", "topic": "sdk"},
    },
    {
        "id": "doc2",
        "text": "The JavaScript SDK supports retries using client configuration.",
        "metadata": {"language": "javascript", "doc_type": "guide", "topic": "sdk"},
    },
    {
        "id": "doc3",
        "text": "Release notes for API updates and rate-limit behavior.",
        "metadata": {"language": "generic", "doc_type": "release_notes", "topic": "api"},
    },
    {
        "id": "doc4",
        "text": "Python examples for streaming responses with the OpenAI SDK.",
        "metadata": {"language": "python", "doc_type": "example", "topic": "sdk"},
    },
]


def tokenize(text: str) -> set[str]:
    """
    Convert text into a lowercase token set.

    Args:
        text: Input text.

    Returns:
        Set of lowercase word tokens.
    """
    return set(text.lower().replace(",", "").replace(".", "").split())


def retrieve(
    query: str,
    docs: List[Dict[str, Any]],
    top_k: int = 3,
    filters: Dict[str, str] | None = None,
) -> List[Dict[str, Any]]:
    """
    Retrieve documents using simple token overlap plus optional metadata filters.

    Args:
        query: User query.
        docs: Document store.
        top_k: Number of results to return.
        filters: Optional metadata constraints.

    Returns:
        Ranked list of matching documents.
    """
    query_tokens = tokenize(query)
    filtered_docs = []

    for doc in docs:
        if filters:
            # Check that all requested metadata fields match.
            if not all(doc["metadata"].get(k) == v for k, v in filters.items()):
                continue

        doc_tokens = tokenize(doc["text"])
        score = len(query_tokens & doc_tokens)

        filtered_docs.append({
            "id": doc["id"],
            "text": doc["text"],
            "metadata": doc["metadata"],
            "score": score,
        })

    # Sort descending by overlap score.
    filtered_docs.sort(key=lambda item: item["score"], reverse=True)
    return filtered_docs[:top_k]


query = "How do retries work in the Python SDK?"

print("=== Without metadata filtering ===")
for result in retrieve(query, documents, top_k=3):
    print(f"{result['id']} | score={result['score']} | metadata={result['metadata']}")
    print(f"  {result['text']}")

print("\n=== With metadata filtering (language=python, topic=sdk) ===")
for result in retrieve(
    query,
    documents,
    top_k=3,
    filters={"language": "python", "topic": "sdk"},
):
    print(f"{result['id']} | score={result['score']} | metadata={result['metadata']}")
    print(f"  {result['text']}")

Example Output

=== Without metadata filtering ===
doc1 | score=4 | metadata={'language': 'python', 'doc_type': 'guide', 'topic': 'sdk'}
  The Python SDK supports configurable retries and timeout handling.
doc2 | score=3 | metadata={'language': 'javascript', 'doc_type': 'guide', 'topic': 'sdk'}
  The JavaScript SDK supports retries using client configuration.
doc4 | score=2 | metadata={'language': 'python', 'doc_type': 'example', 'topic': 'sdk'}
  Python examples for streaming responses with the OpenAI SDK.

=== With metadata filtering (language=python, topic=sdk) ===
doc1 | score=4 | metadata={'language': 'python', 'doc_type': 'guide', 'topic': 'sdk'}
  The Python SDK supports configurable retries and timeout handling.
doc4 | score=2 | metadata={'language': 'python', 'doc_type': 'example', 'topic': 'sdk'}
  Python examples for streaming responses with the OpenAI SDK.

Reflection Questions

Which irrelevant result disappeared after filtering?
What happens if filters are too strict?
Which metadata fields would help most in your own project?

4. Query Rewriting for Better Recall

4.1 Why Query Rewriting Helps

Users often ask questions using language different from the source material. For example:

User says: “How do I make it more reliable?”
Docs say: “configure retries and timeouts”

A literal retriever may miss relevant chunks because the terms do not overlap well.

Query rewriting improves recall by transforming the user query into one or more retrieval-friendly variants.

4.2 Common Query Rewriting Techniques

Synonym expansion

Add related terms.

Example: - “reliable” → “retries”, “timeouts”, “error handling”

Clarification rewrite

Rewrite vague phrasing into a more explicit search query.

Example: - “How do I make API calls more reliable?”
→ “How to configure retries and timeouts in the Python SDK”

Multi-query retrieval

Generate multiple reformulations and retrieve for each.

Example: - “configure retries in python sdk” - “timeout handling python client” - “error recovery python sdk”

This often improves recall because different phrasings surface different relevant chunks.

Hands-On Exercise 3: Query Rewriting with the OpenAI Responses API

This exercise uses the OpenAI Python SDK and the Responses API to generate better retrieval queries.

Setup

Install the SDK if needed:

pip install openai

Set your API key:

export OPENAI_API_KEY="your_api_key_here"

Code

"""
Exercise 3: Query rewriting with the OpenAI Responses API.

This script asks the model to rewrite a user query into multiple
retrieval-optimized search queries.

Run:
    python query_rewrite_demo.py
"""

import os
from openai import OpenAI

# Create the client using the API key from the environment.
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))


def rewrite_query(user_query: str) -> str:
    """
    Rewrite a user question into retrieval-friendly queries.

    Args:
        user_query: The original user question.

    Returns:
        The model's rewritten query suggestions as text.
    """
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {
                "role": "system",
                "content": (
                    "You improve retrieval queries for a RAG system. "
                    "Rewrite the user's question into 3 short search queries. "
                    "Focus on likely terminology found in technical documentation. "
                    "Return a numbered list only."
                ),
            },
            {
                "role": "user",
                "content": user_query,
            },
        ],
    )

    return response.output_text


if __name__ == "__main__":
    query = "How can I make my API calls more reliable in Python?"
    rewritten = rewrite_query(query)

    print("Original query:")
    print(query)

    print("\nRewritten retrieval queries:")
    print(rewritten)

Example Output

Original query:
How can I make my API calls more reliable in Python?

Rewritten retrieval queries:
1. configure retries in Python SDK
2. timeout handling for API requests in Python
3. Python client error handling and retry settings

Extension Idea

Use all rewritten queries in parallel retrieval, combine results, remove duplicates, and rerank the final candidates.

5. Reranking Retrieved Chunks

5.1 Why Reranking Helps

Initial retrieval is often optimized for speed. It may bring back a candidate set that includes relevant chunks, but the ranking may not be ideal.

Reranking is a second pass that scores the candidate chunks more carefully based on the exact query.

Typical pipeline:

Retrieve top 10–20 candidates quickly
Rerank them with a stronger relevance signal
Pass top 3–5 reranked chunks to the generator

This often improves answer quality because the most useful evidence is placed first.

5.2 Lightweight Reranking Approaches

Heuristic reranking

Use manually defined signals such as:

Exact term match
Metadata boost
Title match
Recency boost

LLM-based reranking

Ask an LLM to score or order candidate chunks by relevance.

Pros - Strong semantic understanding - Handles nuanced matching

Cons - More expensive and slower than heuristic scoring - Requires careful prompt design

Hands-On Exercise 4: LLM-Based Reranking with the Responses API

This exercise shows how to rerank a small set of candidate chunks using gpt-5.4-mini.

"""
Exercise 4: LLM-based reranking using the OpenAI Responses API.

This script sends a user query and candidate chunks to the model and asks it
to rank them by relevance.

Run:
    python rerank_demo.py
"""

import json
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))


def rerank_chunks(query: str, chunks: list[dict]) -> str:
    """
    Ask the model to rank chunks by relevance.

    Args:
        query: User question.
        chunks: Candidate chunks, each with an id and text.

    Returns:
        Ranked result in JSON text form.
    """
    chunk_text = json.dumps(chunks, indent=2)

    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {
                "role": "system",
                "content": (
                    "You are a reranker for a RAG pipeline. "
                    "Given a user query and candidate chunks, rank the chunks "
                    "from most relevant to least relevant. "
                    "Return JSON with a single key 'ranked_ids' containing an array of chunk IDs only."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Query:\n{query}\n\n"
                    f"Candidate chunks:\n{chunk_text}"
                ),
            },
        ],
    )

    return response.output_text


if __name__ == "__main__":
    query = "How do I configure retries in the Python SDK?"

    candidate_chunks = [
        {
            "id": "chunk_a",
            "text": "The JavaScript SDK supports configurable retry behavior for failed requests.",
        },
        {
            "id": "chunk_b",
            "text": "The Python SDK supports retries, request timeouts, and client configuration options.",
        },
        {
            "id": "chunk_c",
            "text": "Streaming responses can be processed incrementally in Python applications.",
        },
    ]

    result = rerank_chunks(query, candidate_chunks)

    print("Reranked chunk IDs:")
    print(result)

Example Output

Reranked chunk IDs:
{"ranked_ids":["chunk_b","chunk_a","chunk_c"]}

Production Notes

In production, you would typically:

Parse the JSON response safely
Validate returned chunk IDs
Keep a fallback if parsing fails
Limit reranking to a small candidate set for speed

6. Evaluating Retrieval Quality

6.1 Why Evaluate?

Retrieval improvements should be tested rather than assumed. A change that feels better may not actually improve relevance.

Evaluation helps answer questions like:

Are the top results more relevant?
Are important chunks being missed?
Did metadata filtering improve precision but hurt recall?
Did query rewriting recover more useful evidence?

6.2 Simple Practical Metrics

Precision@k

Of the top-k retrieved chunks, how many are relevant?

Example: - Top 5 results contain 3 relevant chunks - Precision@5 = 3/5 = 0.60

Recall@k

Of all known relevant chunks, how many were retrieved in the top-k?

Example: - There are 4 relevant chunks total - Top 5 retrieved 3 of them - Recall@5 = 3/4 = 0.75

MRR (Mean Reciprocal Rank)

How early does the first relevant result appear?

Example: - First relevant result is rank 2 - Reciprocal rank = 1/2 = 0.5

For education and prototyping, a small labeled dataset of queries and relevant chunk IDs is enough to start.

Hands-On Exercise 5: Compute Simple Retrieval Metrics

"""
Exercise 5: Evaluate retrieval quality with simple metrics.

Run:
    python retrieval_metrics_demo.py
"""

from typing import List


def precision_at_k(retrieved: List[str], relevant: List[str], k: int) -> float:
    """
    Compute Precision@k.

    Args:
        retrieved: Ranked retrieved IDs.
        relevant: Ground-truth relevant IDs.
        k: Cutoff rank.

    Returns:
        Precision@k score.
    """
    top_k = retrieved[:k]
    hits = sum(1 for item in top_k if item in relevant)
    return hits / k if k > 0 else 0.0


def recall_at_k(retrieved: List[str], relevant: List[str], k: int) -> float:
    """
    Compute Recall@k.

    Args:
        retrieved: Ranked retrieved IDs.
        relevant: Ground-truth relevant IDs.
        k: Cutoff rank.

    Returns:
        Recall@k score.
    """
    top_k = retrieved[:k]
    hits = sum(1 for item in top_k if item in relevant)
    return hits / len(relevant) if relevant else 0.0


def reciprocal_rank(retrieved: List[str], relevant: List[str]) -> float:
    """
    Compute reciprocal rank for the first relevant result.

    Args:
        retrieved: Ranked retrieved IDs.
        relevant: Ground-truth relevant IDs.

    Returns:
        Reciprocal rank score.
    """
    for rank, item in enumerate(retrieved, start=1):
        if item in relevant:
            return 1.0 / rank
    return 0.0


if __name__ == "__main__":
    retrieved_ids = ["chunk_c", "chunk_b", "chunk_a", "chunk_d"]
    relevant_ids = ["chunk_b", "chunk_d"]

    print(f"Retrieved IDs: {retrieved_ids}")
    print(f"Relevant IDs: {relevant_ids}")

    print("\nMetrics:")
    print(f"Precision@1: {precision_at_k(retrieved_ids, relevant_ids, 1):.2f}")
    print(f"Precision@3: {precision_at_k(retrieved_ids, relevant_ids, 3):.2f}")
    print(f"Recall@3:    {recall_at_k(retrieved_ids, relevant_ids, 3):.2f}")
    print(f"MRR:         {reciprocal_rank(retrieved_ids, relevant_ids):.2f}")

Example Output

Retrieved IDs: ['chunk_c', 'chunk_b', 'chunk_a', 'chunk_d']
Relevant IDs: ['chunk_b', 'chunk_d']

Metrics:
Precision@1: 0.00
Precision@3: 0.33
Recall@3:    0.50
MRR:         0.50

Suggested Mini-Activity

Ask learners to compare two retrieval strategies:

baseline retrieval
retrieval + metadata filtering
retrieval + query rewriting
retrieval + reranking

Then compute which version produces better Precision@k or MRR.

7. End-to-End Mini Pipeline

This final exercise combines several techniques:

Metadata filtering
Query rewriting
Candidate retrieval
LLM reranking

This is still simplified, but it mirrors a realistic retrieval-quality improvement workflow.

Hands-On Exercise 6: Build a Small Retrieval-Improvement Pipeline

"""
Exercise 6: End-to-end retrieval quality improvement pipeline.

This script demonstrates:
1. Query rewriting with the OpenAI Responses API
2. Metadata filtering
3. Simple candidate retrieval
4. LLM-based reranking

Run:
    python retrieval_pipeline_demo.py
"""

import json
import os
from typing import Any, Dict, List
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))


DOCUMENTS: List[Dict[str, Any]] = [
    {
        "id": "chunk_1",
        "text": "The Python SDK supports configurable retries and request timeouts.",
        "metadata": {"language": "python", "topic": "sdk"},
    },
    {
        "id": "chunk_2",
        "text": "The JavaScript SDK supports retry configuration for failed requests.",
        "metadata": {"language": "javascript", "topic": "sdk"},
    },
    {
        "id": "chunk_3",
        "text": "Streaming responses allow incremental processing of generated output.",
        "metadata": {"language": "python", "topic": "streaming"},
    },
    {
        "id": "chunk_4",
        "text": "Client configuration options include timeout settings and retry behavior in Python.",
        "metadata": {"language": "python", "topic": "sdk"},
    },
]


def tokenize(text: str) -> set[str]:
    """
    Tokenize text into a lowercase word set.
    """
    return set(text.lower().replace(",", "").replace(".", "").split())


def simple_retrieve(
    query: str,
    docs: List[Dict[str, Any]],
    filters: Dict[str, str] | None = None,
    top_k: int = 3,
) -> List[Dict[str, Any]]:
    """
    Retrieve candidate chunks using simple token overlap and optional filters.
    """
    query_tokens = tokenize(query)
    results = []

    for doc in docs:
        if filters and not all(doc["metadata"].get(k) == v for k, v in filters.items()):
            continue

        score = len(query_tokens & tokenize(doc["text"]))
        results.append({
            "id": doc["id"],
            "text": doc["text"],
            "metadata": doc["metadata"],
            "score": score,
        })

    results.sort(key=lambda item: item["score"], reverse=True)
    return results[:top_k]


def rewrite_query(user_query: str) -> List[str]:
    """
    Generate retrieval-friendly rewrites of the user's query.

    Returns:
        A list of query strings parsed from the model output.
    """
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {
                "role": "system",
                "content": (
                    "Rewrite the user's question into 3 short retrieval queries "
                    "suitable for searching technical documentation. "
                    "Return JSON with a single key 'queries' containing an array of strings."
                ),
            },
            {
                "role": "user",
                "content": user_query,
            },
        ],
    )

    raw_text = response.output_text
    data = json.loads(raw_text)
    return data["queries"]


def rerank(query: str, candidates: List[Dict[str, Any]]) -> List[str]:
    """
    Rerank candidate chunks using the LLM.

    Returns:
        Ordered list of chunk IDs.
    """
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {
                "role": "system",
                "content": (
                    "Rank candidate chunks for relevance to the user's query. "
                    "Return JSON with a single key 'ranked_ids' containing an array of chunk IDs."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Query:\n{query}\n\n"
                    f"Candidates:\n{json.dumps(candidates, indent=2)}"
                ),
            },
        ],
    )

    raw_text = response.output_text
    data = json.loads(raw_text)
    return data["ranked_ids"]


if __name__ == "__main__":
    user_query = "How can I make Python API requests more reliable?"

    print("User query:")
    print(user_query)

    rewrites = rewrite_query(user_query)
    print("\nRewritten queries:")
    for q in rewrites:
        print(f"- {q}")

    candidate_map = {}

    for q in rewrites:
        results = simple_retrieve(
            q,
            DOCUMENTS,
            filters={"language": "python"},
            top_k=3,
        )
        for item in results:
            candidate_map[item["id"]] = item

    candidates = list(candidate_map.values())

    print("\nRetrieved candidates before reranking:")
    for candidate in candidates:
        print(f"- {candidate['id']} | score={candidate['score']} | {candidate['text']}")

    ranked_ids = rerank(user_query, candidates)

    print("\nFinal reranked order:")
    for rank, chunk_id in enumerate(ranked_ids, start=1):
        print(f"{rank}. {chunk_id}")

Example Output

User query:
How can I make Python API requests more reliable?

Rewritten queries:
- configure retries python sdk
- timeout handling python api client
- python sdk retry and reliability settings

Retrieved candidates before reranking:
- chunk_1 | score=2 | The Python SDK supports configurable retries and request timeouts.
- chunk_4 | score=3 | Client configuration options include timeout settings and retry behavior in Python.
- chunk_3 | score=1 | Streaming responses allow incremental processing of generated output.

Final reranked order:
1. chunk_4
2. chunk_1
3. chunk_3

8. Common Pitfalls and Best Practices

Common Pitfalls

Using chunks that are too large to be semantically precise
Filtering so aggressively that relevant evidence is excluded
Rewriting queries into terms not actually used in the corpus
Reranking too many candidates, making the pipeline slow
Optimizing retrieval without measuring results

Best Practices

Start simple and inspect failures manually
Keep chunk metadata rich and consistent
Use query rewriting when user phrasing is often vague
Rerank only a small candidate set
Create a small labeled evaluation set early
Improve one retrieval component at a time and compare metrics

9. Summary

In this session, learners explored how to improve retrieval quality in RAG systems through multiple practical techniques.

Key takeaways:

Better retrieval leads to better generated answers
Chunking strongly affects semantic match quality
Metadata filtering improves precision
Query rewriting improves recall when phrasing differs
Reranking helps promote the best evidence
Simple metrics like Precision@k, Recall@k, and MRR make improvements measurable

A good retrieval system is rarely just “embed and search.” It is usually a pipeline with several quality-focused steps.

10. Practice Challenges

Try these after the session:

Modify the chunking exercise to support paragraph-based chunking
Add a doc_type filter to the end-to-end pipeline
Extend query rewriting to generate 5 candidate queries instead of 3
Add duplicate-removal logic based on chunk text similarity
Evaluate baseline retrieval vs reranked retrieval on 5 sample queries
Add fallback behavior if the LLM returns invalid JSON

Useful Resources

OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
OpenAI API docs: https://platform.openai.com/docs
OpenAI Python SDK: https://github.com/openai/openai-python
RAG overview and best practices: https://platform.openai.com/docs/guides
Python json module documentation: https://docs.python.org/3/library/json.html

Suggested Instructor Wrap-Up

Close the session by asking learners:

Which retrieval problem feels most common in real applications?
Which technique seems easiest to add first?
How would you know if a retrieval improvement actually helped?

Preview for the next session:

Building more robust agentic workflows that use retrieval as one tool in a larger reasoning loop

Back to Chapter | Back to Master Plan | Previous Session