Session 4: Improving Retrieval Quality and Relevance
Synopsis
Covers practical strategies for better chunking, metadata use, reranking, query reformulation, and context selection. Learners understand how retrieval quality affects answer quality and system reliability.
Session Content
Session 4: Improving Retrieval Quality and Relevance
Session Overview
In this session, learners will focus on improving the quality of retrieved context in Retrieval-Augmented Generation (RAG) systems. A basic retriever may return loosely related or noisy chunks, which can reduce answer quality. This session covers practical strategies for making retrieval more relevant, precise, and useful for downstream generation.
By the end of this session, learners will be able to:
- Diagnose common retrieval quality issues
- Improve chunking strategies for better semantic matching
- Apply metadata filtering to narrow results
- Use query rewriting to improve retrieval recall
- Add reranking logic to improve result ordering
- Evaluate retrieval quality with simple practical metrics
Learning Objectives
After this session, learners should be able to:
- Explain why retrieval quality matters in a RAG pipeline
- Compare chunking strategies and their impact on retrieval results
- Use metadata to constrain search results
- Implement query rewriting to improve document recall
- Build a lightweight reranking step
- Measure retrieval relevance with practical evaluation techniques
Prerequisites
Learners should already be comfortable with:
- Basic Python programming
- Reading and writing lists/dictionaries
- Calling an LLM with the OpenAI Python SDK
- The high-level idea of embeddings and vector search
- Basic RAG architecture
Session Timing (~45 minutes)
- 0–5 min: Why retrieval quality matters
- 5–15 min: Chunking strategies and metadata filtering
- 15–25 min: Query rewriting for better recall
- 25–35 min: Reranking retrieved chunks
- 35–42 min: Evaluating retrieval quality
- 42–45 min: Wrap-up and next steps
1. Why Retrieval Quality Matters
In a RAG system, the generator depends on the retriever. If the retriever supplies irrelevant, incomplete, or poorly ordered chunks, the final answer may be:
- Factually incomplete
- Overly generic
- Focused on the wrong topic
- Unable to answer despite the information existing in the corpus
Common retrieval problems include:
- Chunk too large: the important concept is diluted by surrounding text
- Chunk too small: useful context is split across many fragments
- Ambiguous user queries: the retriever misses relevant terminology
- No metadata constraints: results come from the wrong source, date, or topic
- Poor ranking: relevant results are present but buried below weaker matches
A strong retrieval pipeline often uses multiple quality-improvement steps:
- Better chunking
- Metadata filtering
- Query rewriting
- Reranking
- Retrieval evaluation
2. Chunking Strategies and Their Impact
2.1 Why Chunking Matters
Vector search generally operates on chunks rather than whole documents. Chunking determines what semantic units are embedded and retrieved.
Poor chunking can create these issues:
- Key facts get separated from their explanation
- Irrelevant neighboring text dominates the chunk meaning
- Chunks become too long for accurate matching
- Results lose document structure
2.2 Common Chunking Strategies
Fixed-size chunking
Split text every N characters or tokens.
Pros - Easy to implement - Predictable chunk size
Cons - May cut through sentences or ideas - Can separate related content awkwardly
Sentence-based chunking
Group one or more sentences into a chunk.
Pros - Better semantic coherence - Easy to explain and inspect
Cons - Chunk sizes may vary - Some sections may still be too small or too large
Paragraph-based chunking
Use paragraph boundaries as chunk units.
Pros - Preserves author structure - Often semantically meaningful
Cons - Paragraphs may be uneven - Some paragraphs may contain multiple topics
Sliding window chunking
Create overlapping chunks so neighboring context is preserved.
Pros - Helps preserve context across chunk boundaries - Improves recall for split concepts
Cons - Increases storage and retrieval redundancy - Can return near-duplicate results
2.3 Practical Guidance
A useful rule of thumb:
- Start with semantically meaningful boundaries if available
- Add overlap when information spans boundaries
- Store metadata like source, section, date, and topic
- Inspect retrieved chunks manually before optimizing further
Hands-On Exercise 1: Compare Chunking Strategies
In this exercise, learners will:
- Create a small sample corpus
- Chunk the same document in different ways
- Inspect the resulting chunks
- Discuss which strategy might improve retrieval
"""
Exercise 1: Compare chunking strategies.
This example demonstrates:
- Fixed-size chunking
- Sentence-based chunking
- Sliding window chunking
Run:
python chunking_demo.py
"""
from typing import List
sample_text = """
Retrieval-augmented generation combines information retrieval with text generation.
A retriever identifies relevant documents or chunks from a knowledge base.
A generator then uses that retrieved context to answer the user's question.
If retrieval quality is poor, the generated answer may be incomplete or incorrect.
Chunking strategy has a major impact on retrieval performance.
Metadata filtering can further improve relevance by narrowing the search space.
Query rewriting helps when the user's wording differs from the source text.
Reranking can improve the ordering of retrieved results before generation.
""".strip()
def fixed_size_chunk(text: str, size: int = 120) -> List[str]:
"""
Split text into fixed-size character chunks.
Args:
text: Input text.
size: Maximum size of each chunk.
Returns:
A list of text chunks.
"""
return [text[i:i + size].strip() for i in range(0, len(text), size)]
def sentence_chunk(text: str, sentences_per_chunk: int = 2) -> List[str]:
"""
Split text by periods and group sentences.
Note:
This is a simple educational implementation and not a production-grade
sentence segmenter.
Args:
text: Input text.
sentences_per_chunk: Number of sentences per chunk.
Returns:
A list of grouped sentence chunks.
"""
sentences = [s.strip() for s in text.split(".") if s.strip()]
chunks = []
for i in range(0, len(sentences), sentences_per_chunk):
chunk = ". ".join(sentences[i:i + sentences_per_chunk]) + "."
chunks.append(chunk)
return chunks
def sliding_window_chunk(text: str, window_size: int = 2, step: int = 1) -> List[str]:
"""
Create overlapping chunks from sentence windows.
Args:
text: Input text.
window_size: Number of sentences in each chunk.
step: Number of sentences to move each time.
Returns:
A list of overlapping chunks.
"""
sentences = [s.strip() for s in text.split(".") if s.strip()]
chunks = []
for i in range(0, len(sentences) - window_size + 1, step):
chunk = ". ".join(sentences[i:i + window_size]) + "."
chunks.append(chunk)
return chunks
print("=== Fixed-size chunks ===")
for idx, chunk in enumerate(fixed_size_chunk(sample_text, size=120), start=1):
print(f"\nChunk {idx}:\n{chunk}")
print("\n=== Sentence-based chunks ===")
for idx, chunk in enumerate(sentence_chunk(sample_text, sentences_per_chunk=2), start=1):
print(f"\nChunk {idx}:\n{chunk}")
print("\n=== Sliding window chunks ===")
for idx, chunk in enumerate(sliding_window_chunk(sample_text, window_size=2, step=1), start=1):
print(f"\nChunk {idx}:\n{chunk}")
Example Output
=== Fixed-size chunks ===
Chunk 1:
Retrieval-augmented generation combines information retrieval with text generation.
A retriever identifies relevant
Chunk 2:
documents or chunks from a knowledge base.
A generator then uses that retrieved context to answer the user's questio
Chunk 3:
n.
If retrieval quality is poor, the generated answer may be incomplete or incorrect.
Chunking strategy has a
...
=== Sentence-based chunks ===
Chunk 1:
Retrieval-augmented generation combines information retrieval with text generation. A retriever identifies relevant documents or chunks from a knowledge base.
Chunk 2:
A generator then uses that retrieved context to answer the user's question. If retrieval quality is poor, the generated answer may be incomplete or incorrect.
...
=== Sliding window chunks ===
Chunk 1:
Retrieval-augmented generation combines information retrieval with text generation. A retriever identifies relevant documents or chunks from a knowledge base.
Chunk 2:
A retriever identifies relevant documents or chunks from a knowledge base. A generator then uses that retrieved context to answer the user's question.
...
Discussion Prompts
- Which chunks best preserve meaning?
- Which strategy would likely match a semantic query more accurately?
- What trade-offs appear when adding overlap?
3. Metadata Filtering
3.1 What Is Metadata?
Metadata is structured information attached to a document or chunk, such as:
- Source filename
- Topic
- Product name
- Date
- Author
- Access level
- Language
- Document type
Metadata lets us narrow retrieval before or after semantic search.
3.2 Why Metadata Helps
Suppose a user asks:
How do I configure retries in the Python SDK?
Without metadata filtering, a retriever may return:
- Python SDK docs
- JavaScript SDK docs
- API retry behavior docs
- Old release notes
With metadata filtering, we can constrain results to:
language=pythondoc_type=guideproduct=sdk
This reduces noise and improves precision.
3.3 Pre-filter vs Post-filter
Pre-filtering
Apply metadata constraints before similarity search.
Pros - Faster search on smaller subset - Better precision
Cons - If filters are too strict, relevant results may be excluded
Post-filtering
Retrieve semantically, then discard irrelevant metadata matches.
Pros - Easier to implement - Preserves recall
Cons - May waste retrieval slots on irrelevant documents
Hands-On Exercise 2: Add Metadata Filtering
This exercise simulates retrieval over a small chunk store and adds metadata-based filtering.
"""
Exercise 2: Metadata filtering for retrieval.
This example uses a toy keyword-overlap retriever so learners can focus on
retrieval logic without requiring a full vector database.
Run:
python metadata_filtering_demo.py
"""
from typing import Dict, List, Any
documents: List[Dict[str, Any]] = [
{
"id": "doc1",
"text": "The Python SDK supports configurable retries and timeout handling.",
"metadata": {"language": "python", "doc_type": "guide", "topic": "sdk"},
},
{
"id": "doc2",
"text": "The JavaScript SDK supports retries using client configuration.",
"metadata": {"language": "javascript", "doc_type": "guide", "topic": "sdk"},
},
{
"id": "doc3",
"text": "Release notes for API updates and rate-limit behavior.",
"metadata": {"language": "generic", "doc_type": "release_notes", "topic": "api"},
},
{
"id": "doc4",
"text": "Python examples for streaming responses with the OpenAI SDK.",
"metadata": {"language": "python", "doc_type": "example", "topic": "sdk"},
},
]
def tokenize(text: str) -> set[str]:
"""
Convert text into a lowercase token set.
Args:
text: Input text.
Returns:
Set of lowercase word tokens.
"""
return set(text.lower().replace(",", "").replace(".", "").split())
def retrieve(
query: str,
docs: List[Dict[str, Any]],
top_k: int = 3,
filters: Dict[str, str] | None = None,
) -> List[Dict[str, Any]]:
"""
Retrieve documents using simple token overlap plus optional metadata filters.
Args:
query: User query.
docs: Document store.
top_k: Number of results to return.
filters: Optional metadata constraints.
Returns:
Ranked list of matching documents.
"""
query_tokens = tokenize(query)
filtered_docs = []
for doc in docs:
if filters:
# Check that all requested metadata fields match.
if not all(doc["metadata"].get(k) == v for k, v in filters.items()):
continue
doc_tokens = tokenize(doc["text"])
score = len(query_tokens & doc_tokens)
filtered_docs.append({
"id": doc["id"],
"text": doc["text"],
"metadata": doc["metadata"],
"score": score,
})
# Sort descending by overlap score.
filtered_docs.sort(key=lambda item: item["score"], reverse=True)
return filtered_docs[:top_k]
query = "How do retries work in the Python SDK?"
print("=== Without metadata filtering ===")
for result in retrieve(query, documents, top_k=3):
print(f"{result['id']} | score={result['score']} | metadata={result['metadata']}")
print(f" {result['text']}")
print("\n=== With metadata filtering (language=python, topic=sdk) ===")
for result in retrieve(
query,
documents,
top_k=3,
filters={"language": "python", "topic": "sdk"},
):
print(f"{result['id']} | score={result['score']} | metadata={result['metadata']}")
print(f" {result['text']}")
Example Output
=== Without metadata filtering ===
doc1 | score=4 | metadata={'language': 'python', 'doc_type': 'guide', 'topic': 'sdk'}
The Python SDK supports configurable retries and timeout handling.
doc2 | score=3 | metadata={'language': 'javascript', 'doc_type': 'guide', 'topic': 'sdk'}
The JavaScript SDK supports retries using client configuration.
doc4 | score=2 | metadata={'language': 'python', 'doc_type': 'example', 'topic': 'sdk'}
Python examples for streaming responses with the OpenAI SDK.
=== With metadata filtering (language=python, topic=sdk) ===
doc1 | score=4 | metadata={'language': 'python', 'doc_type': 'guide', 'topic': 'sdk'}
The Python SDK supports configurable retries and timeout handling.
doc4 | score=2 | metadata={'language': 'python', 'doc_type': 'example', 'topic': 'sdk'}
Python examples for streaming responses with the OpenAI SDK.
Reflection Questions
- Which irrelevant result disappeared after filtering?
- What happens if filters are too strict?
- Which metadata fields would help most in your own project?
4. Query Rewriting for Better Recall
4.1 Why Query Rewriting Helps
Users often ask questions using language different from the source material. For example:
- User says: “How do I make it more reliable?”
- Docs say: “configure retries and timeouts”
A literal retriever may miss relevant chunks because the terms do not overlap well.
Query rewriting improves recall by transforming the user query into one or more retrieval-friendly variants.
4.2 Common Query Rewriting Techniques
Synonym expansion
Add related terms.
Example: - “reliable” → “retries”, “timeouts”, “error handling”
Clarification rewrite
Rewrite vague phrasing into a more explicit search query.
Example:
- “How do I make API calls more reliable?”
→ “How to configure retries and timeouts in the Python SDK”
Multi-query retrieval
Generate multiple reformulations and retrieve for each.
Example: - “configure retries in python sdk” - “timeout handling python client” - “error recovery python sdk”
This often improves recall because different phrasings surface different relevant chunks.
Hands-On Exercise 3: Query Rewriting with the OpenAI Responses API
This exercise uses the OpenAI Python SDK and the Responses API to generate better retrieval queries.
Setup
Install the SDK if needed:
pip install openai
Set your API key:
export OPENAI_API_KEY="your_api_key_here"
Code
"""
Exercise 3: Query rewriting with the OpenAI Responses API.
This script asks the model to rewrite a user query into multiple
retrieval-optimized search queries.
Run:
python query_rewrite_demo.py
"""
import os
from openai import OpenAI
# Create the client using the API key from the environment.
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def rewrite_query(user_query: str) -> str:
"""
Rewrite a user question into retrieval-friendly queries.
Args:
user_query: The original user question.
Returns:
The model's rewritten query suggestions as text.
"""
response = client.responses.create(
model="gpt-5.4-mini",
input=[
{
"role": "system",
"content": (
"You improve retrieval queries for a RAG system. "
"Rewrite the user's question into 3 short search queries. "
"Focus on likely terminology found in technical documentation. "
"Return a numbered list only."
),
},
{
"role": "user",
"content": user_query,
},
],
)
return response.output_text
if __name__ == "__main__":
query = "How can I make my API calls more reliable in Python?"
rewritten = rewrite_query(query)
print("Original query:")
print(query)
print("\nRewritten retrieval queries:")
print(rewritten)
Example Output
Original query:
How can I make my API calls more reliable in Python?
Rewritten retrieval queries:
1. configure retries in Python SDK
2. timeout handling for API requests in Python
3. Python client error handling and retry settings
Extension Idea
Use all rewritten queries in parallel retrieval, combine results, remove duplicates, and rerank the final candidates.
5. Reranking Retrieved Chunks
5.1 Why Reranking Helps
Initial retrieval is often optimized for speed. It may bring back a candidate set that includes relevant chunks, but the ranking may not be ideal.
Reranking is a second pass that scores the candidate chunks more carefully based on the exact query.
Typical pipeline:
- Retrieve top 10–20 candidates quickly
- Rerank them with a stronger relevance signal
- Pass top 3–5 reranked chunks to the generator
This often improves answer quality because the most useful evidence is placed first.
5.2 Lightweight Reranking Approaches
Heuristic reranking
Use manually defined signals such as:
- Exact term match
- Metadata boost
- Title match
- Recency boost
LLM-based reranking
Ask an LLM to score or order candidate chunks by relevance.
Pros - Strong semantic understanding - Handles nuanced matching
Cons - More expensive and slower than heuristic scoring - Requires careful prompt design
Hands-On Exercise 4: LLM-Based Reranking with the Responses API
This exercise shows how to rerank a small set of candidate chunks using gpt-5.4-mini.
"""
Exercise 4: LLM-based reranking using the OpenAI Responses API.
This script sends a user query and candidate chunks to the model and asks it
to rank them by relevance.
Run:
python rerank_demo.py
"""
import json
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def rerank_chunks(query: str, chunks: list[dict]) -> str:
"""
Ask the model to rank chunks by relevance.
Args:
query: User question.
chunks: Candidate chunks, each with an id and text.
Returns:
Ranked result in JSON text form.
"""
chunk_text = json.dumps(chunks, indent=2)
response = client.responses.create(
model="gpt-5.4-mini",
input=[
{
"role": "system",
"content": (
"You are a reranker for a RAG pipeline. "
"Given a user query and candidate chunks, rank the chunks "
"from most relevant to least relevant. "
"Return JSON with a single key 'ranked_ids' containing an array of chunk IDs only."
),
},
{
"role": "user",
"content": (
f"Query:\n{query}\n\n"
f"Candidate chunks:\n{chunk_text}"
),
},
],
)
return response.output_text
if __name__ == "__main__":
query = "How do I configure retries in the Python SDK?"
candidate_chunks = [
{
"id": "chunk_a",
"text": "The JavaScript SDK supports configurable retry behavior for failed requests.",
},
{
"id": "chunk_b",
"text": "The Python SDK supports retries, request timeouts, and client configuration options.",
},
{
"id": "chunk_c",
"text": "Streaming responses can be processed incrementally in Python applications.",
},
]
result = rerank_chunks(query, candidate_chunks)
print("Reranked chunk IDs:")
print(result)
Example Output
Reranked chunk IDs:
{"ranked_ids":["chunk_b","chunk_a","chunk_c"]}
Production Notes
In production, you would typically:
- Parse the JSON response safely
- Validate returned chunk IDs
- Keep a fallback if parsing fails
- Limit reranking to a small candidate set for speed
6. Evaluating Retrieval Quality
6.1 Why Evaluate?
Retrieval improvements should be tested rather than assumed. A change that feels better may not actually improve relevance.
Evaluation helps answer questions like:
- Are the top results more relevant?
- Are important chunks being missed?
- Did metadata filtering improve precision but hurt recall?
- Did query rewriting recover more useful evidence?
6.2 Simple Practical Metrics
Precision@k
Of the top-k retrieved chunks, how many are relevant?
Example: - Top 5 results contain 3 relevant chunks - Precision@5 = 3/5 = 0.60
Recall@k
Of all known relevant chunks, how many were retrieved in the top-k?
Example: - There are 4 relevant chunks total - Top 5 retrieved 3 of them - Recall@5 = 3/4 = 0.75
MRR (Mean Reciprocal Rank)
How early does the first relevant result appear?
Example: - First relevant result is rank 2 - Reciprocal rank = 1/2 = 0.5
For education and prototyping, a small labeled dataset of queries and relevant chunk IDs is enough to start.
Hands-On Exercise 5: Compute Simple Retrieval Metrics
"""
Exercise 5: Evaluate retrieval quality with simple metrics.
Run:
python retrieval_metrics_demo.py
"""
from typing import List
def precision_at_k(retrieved: List[str], relevant: List[str], k: int) -> float:
"""
Compute Precision@k.
Args:
retrieved: Ranked retrieved IDs.
relevant: Ground-truth relevant IDs.
k: Cutoff rank.
Returns:
Precision@k score.
"""
top_k = retrieved[:k]
hits = sum(1 for item in top_k if item in relevant)
return hits / k if k > 0 else 0.0
def recall_at_k(retrieved: List[str], relevant: List[str], k: int) -> float:
"""
Compute Recall@k.
Args:
retrieved: Ranked retrieved IDs.
relevant: Ground-truth relevant IDs.
k: Cutoff rank.
Returns:
Recall@k score.
"""
top_k = retrieved[:k]
hits = sum(1 for item in top_k if item in relevant)
return hits / len(relevant) if relevant else 0.0
def reciprocal_rank(retrieved: List[str], relevant: List[str]) -> float:
"""
Compute reciprocal rank for the first relevant result.
Args:
retrieved: Ranked retrieved IDs.
relevant: Ground-truth relevant IDs.
Returns:
Reciprocal rank score.
"""
for rank, item in enumerate(retrieved, start=1):
if item in relevant:
return 1.0 / rank
return 0.0
if __name__ == "__main__":
retrieved_ids = ["chunk_c", "chunk_b", "chunk_a", "chunk_d"]
relevant_ids = ["chunk_b", "chunk_d"]
print(f"Retrieved IDs: {retrieved_ids}")
print(f"Relevant IDs: {relevant_ids}")
print("\nMetrics:")
print(f"Precision@1: {precision_at_k(retrieved_ids, relevant_ids, 1):.2f}")
print(f"Precision@3: {precision_at_k(retrieved_ids, relevant_ids, 3):.2f}")
print(f"Recall@3: {recall_at_k(retrieved_ids, relevant_ids, 3):.2f}")
print(f"MRR: {reciprocal_rank(retrieved_ids, relevant_ids):.2f}")
Example Output
Retrieved IDs: ['chunk_c', 'chunk_b', 'chunk_a', 'chunk_d']
Relevant IDs: ['chunk_b', 'chunk_d']
Metrics:
Precision@1: 0.00
Precision@3: 0.33
Recall@3: 0.50
MRR: 0.50
Suggested Mini-Activity
Ask learners to compare two retrieval strategies:
- baseline retrieval
- retrieval + metadata filtering
- retrieval + query rewriting
- retrieval + reranking
Then compute which version produces better Precision@k or MRR.
7. End-to-End Mini Pipeline
This final exercise combines several techniques:
- Metadata filtering
- Query rewriting
- Candidate retrieval
- LLM reranking
This is still simplified, but it mirrors a realistic retrieval-quality improvement workflow.
Hands-On Exercise 6: Build a Small Retrieval-Improvement Pipeline
"""
Exercise 6: End-to-end retrieval quality improvement pipeline.
This script demonstrates:
1. Query rewriting with the OpenAI Responses API
2. Metadata filtering
3. Simple candidate retrieval
4. LLM-based reranking
Run:
python retrieval_pipeline_demo.py
"""
import json
import os
from typing import Any, Dict, List
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
DOCUMENTS: List[Dict[str, Any]] = [
{
"id": "chunk_1",
"text": "The Python SDK supports configurable retries and request timeouts.",
"metadata": {"language": "python", "topic": "sdk"},
},
{
"id": "chunk_2",
"text": "The JavaScript SDK supports retry configuration for failed requests.",
"metadata": {"language": "javascript", "topic": "sdk"},
},
{
"id": "chunk_3",
"text": "Streaming responses allow incremental processing of generated output.",
"metadata": {"language": "python", "topic": "streaming"},
},
{
"id": "chunk_4",
"text": "Client configuration options include timeout settings and retry behavior in Python.",
"metadata": {"language": "python", "topic": "sdk"},
},
]
def tokenize(text: str) -> set[str]:
"""
Tokenize text into a lowercase word set.
"""
return set(text.lower().replace(",", "").replace(".", "").split())
def simple_retrieve(
query: str,
docs: List[Dict[str, Any]],
filters: Dict[str, str] | None = None,
top_k: int = 3,
) -> List[Dict[str, Any]]:
"""
Retrieve candidate chunks using simple token overlap and optional filters.
"""
query_tokens = tokenize(query)
results = []
for doc in docs:
if filters and not all(doc["metadata"].get(k) == v for k, v in filters.items()):
continue
score = len(query_tokens & tokenize(doc["text"]))
results.append({
"id": doc["id"],
"text": doc["text"],
"metadata": doc["metadata"],
"score": score,
})
results.sort(key=lambda item: item["score"], reverse=True)
return results[:top_k]
def rewrite_query(user_query: str) -> List[str]:
"""
Generate retrieval-friendly rewrites of the user's query.
Returns:
A list of query strings parsed from the model output.
"""
response = client.responses.create(
model="gpt-5.4-mini",
input=[
{
"role": "system",
"content": (
"Rewrite the user's question into 3 short retrieval queries "
"suitable for searching technical documentation. "
"Return JSON with a single key 'queries' containing an array of strings."
),
},
{
"role": "user",
"content": user_query,
},
],
)
raw_text = response.output_text
data = json.loads(raw_text)
return data["queries"]
def rerank(query: str, candidates: List[Dict[str, Any]]) -> List[str]:
"""
Rerank candidate chunks using the LLM.
Returns:
Ordered list of chunk IDs.
"""
response = client.responses.create(
model="gpt-5.4-mini",
input=[
{
"role": "system",
"content": (
"Rank candidate chunks for relevance to the user's query. "
"Return JSON with a single key 'ranked_ids' containing an array of chunk IDs."
),
},
{
"role": "user",
"content": (
f"Query:\n{query}\n\n"
f"Candidates:\n{json.dumps(candidates, indent=2)}"
),
},
],
)
raw_text = response.output_text
data = json.loads(raw_text)
return data["ranked_ids"]
if __name__ == "__main__":
user_query = "How can I make Python API requests more reliable?"
print("User query:")
print(user_query)
rewrites = rewrite_query(user_query)
print("\nRewritten queries:")
for q in rewrites:
print(f"- {q}")
candidate_map = {}
for q in rewrites:
results = simple_retrieve(
q,
DOCUMENTS,
filters={"language": "python"},
top_k=3,
)
for item in results:
candidate_map[item["id"]] = item
candidates = list(candidate_map.values())
print("\nRetrieved candidates before reranking:")
for candidate in candidates:
print(f"- {candidate['id']} | score={candidate['score']} | {candidate['text']}")
ranked_ids = rerank(user_query, candidates)
print("\nFinal reranked order:")
for rank, chunk_id in enumerate(ranked_ids, start=1):
print(f"{rank}. {chunk_id}")
Example Output
User query:
How can I make Python API requests more reliable?
Rewritten queries:
- configure retries python sdk
- timeout handling python api client
- python sdk retry and reliability settings
Retrieved candidates before reranking:
- chunk_1 | score=2 | The Python SDK supports configurable retries and request timeouts.
- chunk_4 | score=3 | Client configuration options include timeout settings and retry behavior in Python.
- chunk_3 | score=1 | Streaming responses allow incremental processing of generated output.
Final reranked order:
1. chunk_4
2. chunk_1
3. chunk_3
8. Common Pitfalls and Best Practices
Common Pitfalls
- Using chunks that are too large to be semantically precise
- Filtering so aggressively that relevant evidence is excluded
- Rewriting queries into terms not actually used in the corpus
- Reranking too many candidates, making the pipeline slow
- Optimizing retrieval without measuring results
Best Practices
- Start simple and inspect failures manually
- Keep chunk metadata rich and consistent
- Use query rewriting when user phrasing is often vague
- Rerank only a small candidate set
- Create a small labeled evaluation set early
- Improve one retrieval component at a time and compare metrics
9. Summary
In this session, learners explored how to improve retrieval quality in RAG systems through multiple practical techniques.
Key takeaways:
- Better retrieval leads to better generated answers
- Chunking strongly affects semantic match quality
- Metadata filtering improves precision
- Query rewriting improves recall when phrasing differs
- Reranking helps promote the best evidence
- Simple metrics like Precision@k, Recall@k, and MRR make improvements measurable
A good retrieval system is rarely just “embed and search.” It is usually a pipeline with several quality-focused steps.
10. Practice Challenges
Try these after the session:
- Modify the chunking exercise to support paragraph-based chunking
- Add a
doc_typefilter to the end-to-end pipeline - Extend query rewriting to generate 5 candidate queries instead of 3
- Add duplicate-removal logic based on chunk text similarity
- Evaluate baseline retrieval vs reranked retrieval on 5 sample queries
- Add fallback behavior if the LLM returns invalid JSON
Useful Resources
- OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
- OpenAI API docs: https://platform.openai.com/docs
- OpenAI Python SDK: https://github.com/openai/openai-python
- RAG overview and best practices: https://platform.openai.com/docs/guides
- Python
jsonmodule documentation: https://docs.python.org/3/library/json.html
Suggested Instructor Wrap-Up
Close the session by asking learners:
- Which retrieval problem feels most common in real applications?
- Which technique seems easiest to add first?
- How would you know if a retrieval improvement actually helped?
Preview for the next session:
- Building more robust agentic workflows that use retrieval as one tool in a larger reasoning loop