Session 3: Observability for Prompts, Retrieval, and Tools
Synopsis
Explains how to inspect prompt inputs, retrieved context, tool calls, intermediate outputs, and final responses. Learners gain visibility into where failures occur within multi-component systems.
Session Content
Session 3: Observability for Prompts, Retrieval, and Tools
Session Overview
In this session, learners will build a practical observability mindset for GenAI applications. The focus is on understanding how to inspect, debug, and improve systems that involve prompts, retrieval pipelines, and tool usage. By the end of the session, learners will be able to instrument a simple Python application, capture useful traces and logs, diagnose common failure modes, and evaluate outputs systematically.
Duration
~45 minutes
Learning Objectives
By the end of this session, learners should be able to:
- Explain why observability matters in LLM and agentic systems
- Identify the key signals to monitor for prompts, retrieval, and tool usage
- Add basic tracing and structured logging to a Python GenAI workflow
- Record prompt inputs, model outputs, retrieval context, and tool calls safely
- Diagnose common failures such as hallucinations, poor retrieval, and tool misuse
- Build a lightweight evaluation loop for iterative improvement
Agenda
- Why observability matters in GenAI systems
- Observability for prompts
- Observability for retrieval
- Observability for tools
- Hands-on Exercise 1: Add structured logging to an LLM workflow
- Hands-on Exercise 2: Observe retrieval quality in a mini RAG pipeline
- Hands-on Exercise 3: Track and debug tool calls
- Wrap-up and next steps
1. Why Observability Matters in GenAI Systems
Traditional software systems are often debugged with stack traces, logs, metrics, and tests. GenAI systems need all of those, but they also require visibility into probabilistic behavior.
Why GenAI systems are harder to debug
A GenAI application may fail because of:
- A vague or conflicting prompt
- Missing or irrelevant retrieval results
- Hallucinated answers
- A tool being called with incorrect arguments
- A tool not being called when it should be
- Model randomness or prompt sensitivity
- Hidden context-window issues
- Latency from external dependencies
What observability means in this context
Observability is the ability to understand what happened inside your system by looking at the signals it emits.
For GenAI applications, useful signals include:
- User input
- Final prompt sent to the model
- System/developer instructions
- Model response
- Token usage and latency
- Retrieved documents and relevance scores
- Tool selection decisions
- Tool inputs and outputs
- Error messages and retries
- Human or automated evaluation scores
The core principle
If you cannot inspect it, you cannot improve it.
2. Observability for Prompts
Prompt observability answers questions like:
- What exact instructions did the model receive?
- What context was inserted dynamically?
- Which model and parameters were used?
- How long did the request take?
- What output was produced?
Important prompt-level signals
Inputs
- User message
- System or developer prompt
- Retrieved context
- Tool results inserted into context
Request metadata
- Model name
- Temperature or generation controls where applicable
- Request timestamp
- Request ID
- Session ID or conversation ID
Outputs
- Final answer
- Refusal or uncertainty markers
- Output length
- Structured data validity
Prompt debugging checklist
When a prompt gives poor results, check:
- Was the user intent correctly captured?
- Were instructions too vague or too long?
- Did retrieval inject distracting information?
- Did the model output match the requested format?
- Was the answer grounded in provided evidence?
Best practices
- Log prompts in a structured format
- Separate template from dynamic values
- Version prompts
- Capture outputs alongside prompt versions
- Redact secrets and sensitive user data before storage
3. Observability for Retrieval
Retrieval is often the hidden source of poor answers in RAG systems.
Common retrieval failure modes
- No relevant documents retrieved
- Relevant documents ranked too low
- Chunk too small or too large
- Duplicate chunks dominate results
- Retrieved content is stale
- Answerable information exists but is phrased differently
- Retrieved text contains contradictions
What to log for retrieval
For each retrieval event, capture:
- Query text
- Query embedding version if embeddings are used
- Retrieved document IDs
- Chunk text preview
- Relevance scores
- Rank positions
- Data source
- Retrieval latency
Questions observability helps answer
- Did the system retrieve anything useful?
- Did the top-ranked chunk actually contain the answer?
- Are scores tightly clustered, suggesting weak ranking?
- Are users asking questions not covered by the corpus?
- Is bad answering actually a retrieval problem?
Lightweight retrieval diagnostics
A helpful inspection table might include:
| Rank | Doc ID | Score | Preview | Contains Answer? |
|---|---|---|---|---|
| 1 | faq_12 | 0.91 | “Refunds are available within…” | Yes |
| 2 | blog_4 | 0.88 | “Customer satisfaction is…” | No |
| 3 | policy_7 | 0.83 | “Returns are not accepted…” | Partial |
This quickly distinguishes ranking problems from answer-generation problems.
4. Observability for Tools
In agentic systems, tool usage adds another debugging layer.
Typical tool-related failures
- Wrong tool chosen
- Correct tool chosen with wrong arguments
- Tool output ignored by the model
- Tool called repeatedly in a loop
- Tool error not handled gracefully
- Slow tools causing timeouts
- Tool results contradict prompt assumptions
What to log for tool execution
For every tool call, record:
- Tool name
- Why the tool was selected, if available
- Tool arguments
- Start time and end time
- Latency
- Success or failure
- Raw tool output
- Post-processed output sent back to model
Tool observability checklist
- Did the tool run successfully?
- Were arguments validated?
- Was the output understandable to the model?
- Did the model ground its answer in the tool output?
- Were there repeated or unnecessary calls?
Safety considerations
Be careful not to log:
- API secrets
- Authentication tokens
- Personally identifiable information
- Full raw payloads if they contain sensitive content
Prefer: - Redacted logs - Hashed IDs - Partial previews - Structured fields over raw dumps
5. Hands-on Exercise 1: Add Structured Logging to an LLM Workflow
Goal
Create a small Python app that: - sends a prompt using the OpenAI Responses API - records structured logs for request and response - tracks latency - prints a clean summary
What learners will practice
- Using the OpenAI Python SDK
- Calling
client.responses.create(...) - Creating structured logs
- Measuring latency
- Capturing prompt/response data safely
Code
import json
import time
import uuid
from datetime import datetime, timezone
from openai import OpenAI
# Create the client.
# Make sure the OPENAI_API_KEY environment variable is set.
client = OpenAI()
def utc_now_iso() -> str:
"""Return the current UTC timestamp in ISO 8601 format."""
return datetime.now(timezone.utc).isoformat()
def log_event(event_type: str, payload: dict) -> None:
"""
Print a structured JSON log entry.
In production, this would usually be sent to a logging platform
or stored in a file/database.
"""
entry = {
"timestamp": utc_now_iso(),
"event_type": event_type,
**payload,
}
print(json.dumps(entry, indent=2))
def ask_llm(user_question: str) -> str:
"""
Send a request to the OpenAI Responses API and log observability data.
"""
request_id = str(uuid.uuid4())
model = "gpt-5.4-mini"
# A simple developer instruction.
instructions = (
"You are a concise assistant. "
"Answer clearly in 2-4 bullet points."
)
log_event(
"llm_request_started",
{
"request_id": request_id,
"model": model,
"instructions_preview": instructions[:120],
"user_question": user_question,
},
)
start = time.perf_counter()
response = client.responses.create(
model=model,
instructions=instructions,
input=user_question,
)
elapsed_ms = round((time.perf_counter() - start) * 1000, 2)
answer_text = response.output_text
log_event(
"llm_request_completed",
{
"request_id": request_id,
"model": model,
"latency_ms": elapsed_ms,
"answer_preview": answer_text[:300],
},
)
return answer_text
if __name__ == "__main__":
question = "What are three benefits of structured logging in AI applications?"
answer = ask_llm(question)
print("\n=== Final Answer ===")
print(answer)
Example Output
{
"timestamp": "2026-03-22T10:00:00.000000+00:00",
"event_type": "llm_request_started",
"request_id": "5f6f6f1a-6f10-4f54-a5d8-dfd5158b77b1",
"model": "gpt-5.4-mini",
"instructions_preview": "You are a concise assistant. Answer clearly in 2-4 bullet points.",
"user_question": "What are three benefits of structured logging in AI applications?"
}
{
"timestamp": "2026-03-22T10:00:01.200000+00:00",
"event_type": "llm_request_completed",
"request_id": "5f6f6f1a-6f10-4f54-a5d8-dfd5158b77b1",
"model": "gpt-5.4-mini",
"latency_ms": 1187.32,
"answer_preview": "- Makes debugging easier by preserving request and response context...\n- Enables filtering and aggregation across events...\n- Supports evaluation and performance monitoring over time..."
}
=== Final Answer ===
- Makes debugging easier by preserving request and response context.
- Enables filtering and aggregation across events across prompts and users.
- Supports evaluation, alerting, and trend analysis over time.
Exercise Tasks
- Run the script and inspect the logs.
- Add a
session_idfield to each log entry. - Add a
prompt_versionfield. - Redact emails if they appear in
user_question. - Save logs to a local file instead of printing them.
Discussion
Ask learners: - What would be useful to search for in these logs? - Which fields are important for debugging failures? - Which fields should not be stored in plaintext?
6. Hands-on Exercise 2: Observe Retrieval Quality in a Mini RAG Pipeline
Goal
Build a tiny retrieval pipeline using in-memory documents and log: - the user query - the retrieved documents - simple relevance scores - the final prompt context - the model answer
This exercise emphasizes that many “LLM problems” are actually retrieval problems.
What learners will practice
- Simulating retrieval with Python
- Logging retrieval rankings
- Passing retrieved context to the model
- Inspecting whether the answer was grounded in useful context
Code
import json
import math
import re
import time
import uuid
from collections import Counter
from datetime import datetime, timezone
from openai import OpenAI
client = OpenAI()
DOCUMENTS = [
{
"id": "doc_1",
"title": "Refund Policy",
"text": "Customers may request a full refund within 30 days of purchase with proof of payment.",
},
{
"id": "doc_2",
"title": "Shipping Policy",
"text": "Standard shipping takes 5 to 7 business days. Expedited shipping takes 2 business days.",
},
{
"id": "doc_3",
"title": "Account Security",
"text": "Users should enable two-factor authentication and use a strong unique password.",
},
{
"id": "doc_4",
"title": "Subscription Terms",
"text": "Monthly subscriptions renew automatically unless canceled before the next billing date.",
},
]
def utc_now_iso() -> str:
"""Return the current UTC timestamp in ISO 8601 format."""
return datetime.now(timezone.utc).isoformat()
def log_event(event_type: str, payload: dict) -> None:
"""Print a structured event log."""
print(
json.dumps(
{
"timestamp": utc_now_iso(),
"event_type": event_type,
**payload,
},
indent=2,
)
)
def tokenize(text: str) -> list[str]:
"""
Convert text into lowercase word tokens.
This is a simple tokenizer for demonstration purposes.
"""
return re.findall(r"\b\w+\b", text.lower())
def cosine_similarity(text_a: str, text_b: str) -> float:
"""
Compute cosine similarity between two texts using bag-of-words counts.
This is intentionally simple so learners can inspect retrieval behavior.
"""
tokens_a = tokenize(text_a)
tokens_b = tokenize(text_b)
counts_a = Counter(tokens_a)
counts_b = Counter(tokens_b)
all_terms = set(counts_a) | set(counts_b)
dot = sum(counts_a[t] * counts_b[t] for t in all_terms)
norm_a = math.sqrt(sum(v * v for v in counts_a.values()))
norm_b = math.sqrt(sum(v * v for v in counts_b.values()))
if norm_a == 0 or norm_b == 0:
return 0.0
return dot / (norm_a * norm_b)
def retrieve(query: str, top_k: int = 2) -> list[dict]:
"""
Rank documents by simple cosine similarity against the user query.
"""
scored = []
for doc in DOCUMENTS:
score = cosine_similarity(query, doc["text"] + " " + doc["title"])
scored.append(
{
"id": doc["id"],
"title": doc["title"],
"text": doc["text"],
"score": round(score, 4),
}
)
ranked = sorted(scored, key=lambda d: d["score"], reverse=True)
return ranked[:top_k]
def answer_question_with_rag(query: str) -> str:
"""
Retrieve context, log retrieval details, then ask the model to answer
using only the retrieved information.
"""
request_id = str(uuid.uuid4())
model = "gpt-5.4-mini"
log_event(
"rag_request_started",
{
"request_id": request_id,
"query": query,
"model": model,
},
)
retrieval_start = time.perf_counter()
top_docs = retrieve(query, top_k=2)
retrieval_latency_ms = round((time.perf_counter() - retrieval_start) * 1000, 2)
log_event(
"retrieval_completed",
{
"request_id": request_id,
"query": query,
"retrieval_latency_ms": retrieval_latency_ms,
"results": [
{
"rank": i + 1,
"doc_id": doc["id"],
"title": doc["title"],
"score": doc["score"],
"preview": doc["text"][:100],
}
for i, doc in enumerate(top_docs)
],
},
)
context = "\n\n".join(
[
f"[{doc['id']}] {doc['title']}\n{doc['text']}"
for doc in top_docs
]
)
instructions = (
"Answer the user's question using only the provided context. "
"If the answer is not in the context, say: 'I could not find that in the provided documents.'"
)
model_input = (
f"Context:\n{context}\n\n"
f"Question: {query}"
)
log_event(
"prompt_prepared",
{
"request_id": request_id,
"instructions_preview": instructions[:150],
"context_preview": context[:300],
},
)
llm_start = time.perf_counter()
response = client.responses.create(
model=model,
instructions=instructions,
input=model_input,
)
llm_latency_ms = round((time.perf_counter() - llm_start) * 1000, 2)
answer = response.output_text
log_event(
"rag_request_completed",
{
"request_id": request_id,
"llm_latency_ms": llm_latency_ms,
"answer_preview": answer[:300],
},
)
return answer
if __name__ == "__main__":
question = "Can I get my money back after buying a product?"
answer = answer_question_with_rag(question)
print("\n=== RAG Answer ===")
print(answer)
Example Output
{
"timestamp": "2026-03-22T10:05:00.000000+00:00",
"event_type": "rag_request_started",
"request_id": "b30dbf8d-fcf4-4f81-b0da-d4a91315dc1c",
"query": "Can I get my money back after buying a product?",
"model": "gpt-5.4-mini"
}
{
"timestamp": "2026-03-22T10:05:00.010000+00:00",
"event_type": "retrieval_completed",
"request_id": "b30dbf8d-fcf4-4f81-b0da-d4a91315dc1c",
"query": "Can I get my money back after buying a product?",
"retrieval_latency_ms": 2.13,
"results": [
{
"rank": 1,
"doc_id": "doc_1",
"title": "Refund Policy",
"score": 0.1633,
"preview": "Customers may request a full refund within 30 days of purchase with proof of payment."
},
{
"rank": 2,
"doc_id": "doc_4",
"title": "Subscription Terms",
"score": 0.0,
"preview": "Monthly subscriptions renew automatically unless canceled before the next billing date."
}
]
}
{
"timestamp": "2026-03-22T10:05:01.220000+00:00",
"event_type": "rag_request_completed",
"request_id": "b30dbf8d-fcf4-4f81-b0da-d4a91315dc1c",
"llm_latency_ms": 1198.76,
"answer_preview": "Yes. According to the provided context, customers may request a full refund within 30 days of purchase with proof of payment."
}
=== RAG Answer ===
Yes. According to the provided context, customers may request a full refund within 30 days of purchase with proof of payment.
Exercise Tasks
- Run the script with three different user questions.
- For each question, inspect whether the top result truly contains the answer.
- Add a boolean field called
likely_relevantbased on a score threshold. - Add a warning log if all retrieved scores are near zero.
- Modify the corpus to include a conflicting refund policy and observe what happens.
Reflection Questions
- Did the model fail, or did retrieval fail?
- How would you detect stale or contradictory documents?
- What metadata would help explain ranking decisions?
7. Hands-on Exercise 3: Track and Debug Tool Calls
Goal
Create a small workflow where the model can use a tool to look up weather data, and log: - the tool schema - the tool call arguments - the tool result - the final answer
This exercise helps learners understand how to observe and debug tool-enabled systems.
What learners will practice
- Defining a tool for the Responses API
- Executing the tool in Python
- Feeding tool results back to the model
- Logging each tool interaction
Code
import json
import time
import uuid
from datetime import datetime, timezone
from openai import OpenAI
client = OpenAI()
def utc_now_iso() -> str:
"""Return the current UTC timestamp in ISO 8601 format."""
return datetime.now(timezone.utc).isoformat()
def log_event(event_type: str, payload: dict) -> None:
"""Print a structured event log."""
print(
json.dumps(
{
"timestamp": utc_now_iso(),
"event_type": event_type,
**payload,
},
indent=2,
)
)
def get_weather(city: str) -> dict:
"""
Mock weather lookup tool.
In a real application, this function would call an external API.
"""
fake_weather_db = {
"london": {"temperature_c": 14, "condition": "Cloudy"},
"paris": {"temperature_c": 18, "condition": "Sunny"},
"tokyo": {"temperature_c": 22, "condition": "Rain showers"},
}
normalized = city.strip().lower()
return fake_weather_db.get(
normalized,
{"temperature_c": None, "condition": "Unknown city"},
)
def run_weather_agent(user_question: str) -> str:
"""
Ask the model to answer a weather question using a Python tool.
"""
request_id = str(uuid.uuid4())
model = "gpt-5.4-mini"
tools = [
{
"type": "function",
"name": "get_weather",
"description": "Look up current weather for a city.",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The city name to look up.",
}
},
"required": ["city"],
"additionalProperties": False,
},
}
]
log_event(
"agent_request_started",
{
"request_id": request_id,
"model": model,
"user_question": user_question,
"tools": tools,
},
)
start = time.perf_counter()
first_response = client.responses.create(
model=model,
instructions=(
"You are a helpful assistant. "
"Use the available tool when the user asks for weather information."
),
input=user_question,
tools=tools,
)
# Inspect tool calls emitted by the model.
tool_outputs = []
tool_call_count = 0
for item in first_response.output:
if item.type == "function_call":
tool_call_count += 1
tool_name = item.name
call_id = item.call_id
arguments = json.loads(item.arguments)
log_event(
"tool_call_requested",
{
"request_id": request_id,
"tool_name": tool_name,
"call_id": call_id,
"arguments": arguments,
},
)
tool_start = time.perf_counter()
if tool_name == "get_weather":
result = get_weather(arguments["city"])
else:
result = {"error": f"Unknown tool: {tool_name}"}
tool_latency_ms = round((time.perf_counter() - tool_start) * 1000, 2)
log_event(
"tool_call_completed",
{
"request_id": request_id,
"tool_name": tool_name,
"call_id": call_id,
"latency_ms": tool_latency_ms,
"result": result,
},
)
tool_outputs.append(
{
"type": "function_call_output",
"call_id": call_id,
"output": json.dumps(result),
}
)
if tool_outputs:
final_response = client.responses.create(
model=model,
previous_response_id=first_response.id,
input=tool_outputs,
)
answer = final_response.output_text
else:
answer = first_response.output_text
total_latency_ms = round((time.perf_counter() - start) * 1000, 2)
log_event(
"agent_request_completed",
{
"request_id": request_id,
"tool_call_count": tool_call_count,
"total_latency_ms": total_latency_ms,
"answer_preview": answer[:300],
},
)
return answer
if __name__ == "__main__":
question = "What's the weather like in Paris today?"
answer = run_weather_agent(question)
print("\n=== Agent Answer ===")
print(answer)
Example Output
{
"timestamp": "2026-03-22T10:10:00.000000+00:00",
"event_type": "agent_request_started",
"request_id": "8d1436b4-6c67-40dc-86d1-24486a2f0ec0",
"model": "gpt-5.4-mini",
"user_question": "What's the weather like in Paris today?",
"tools": [
{
"type": "function",
"name": "get_weather",
"description": "Look up current weather for a city.",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The city name to look up."
}
},
"required": [
"city"
],
"additionalProperties": false
}
}
]
}
{
"timestamp": "2026-03-22T10:10:00.900000+00:00",
"event_type": "tool_call_requested",
"request_id": "8d1436b4-6c67-40dc-86d1-24486a2f0ec0",
"tool_name": "get_weather",
"call_id": "call_abc123",
"arguments": {
"city": "Paris"
}
}
{
"timestamp": "2026-03-22T10:10:00.905000+00:00",
"event_type": "tool_call_completed",
"request_id": "8d1436b4-6c67-40dc-86d1-24486a2f0ec0",
"tool_name": "get_weather",
"call_id": "call_abc123",
"latency_ms": 0.08,
"result": {
"temperature_c": 18,
"condition": "Sunny"
}
}
{
"timestamp": "2026-03-22T10:10:01.600000+00:00",
"event_type": "agent_request_completed",
"request_id": "8d1436b4-6c67-40dc-86d1-24486a2f0ec0",
"tool_call_count": 1,
"total_latency_ms": 1598.42,
"answer_preview": "The current weather in Paris is sunny with a temperature of 18°C."
}
=== Agent Answer ===
The current weather in Paris is sunny with a temperature of 18°C.
Exercise Tasks
- Run the example with cities inside and outside the fake weather database.
- Add validation for empty
cityarguments. - Log a warning if the model answers without calling the weather tool.
- Add a second tool, such as
get_time_in_city, and observe tool selection behavior. - Add error handling so the tool returns structured errors rather than crashing.
Debugging Questions
- Did the model choose the correct tool?
- Were the arguments complete and valid?
- Did the final answer correctly incorporate the tool result?
- What would you alert on in production?
8. Building a Simple Evaluation Loop
Observability gives you raw signals. Evaluation turns those signals into improvement.
A practical evaluation loop
- Collect examples of user inputs
- Save prompts, retrieval results, tool calls, and outputs
- Review failures
- Label failure causes:
- prompt issue
- retrieval issue
- tool issue
- unclear user input
- model limitation
- Update the system
- Re-run the examples
- Compare results over time
Example failure taxonomy
| Failure Type | Symptom | Likely Fix |
|---|---|---|
| Prompt issue | Output format wrong | Improve instructions or examples |
| Retrieval issue | Answer misses known fact | Improve chunking/ranking/filtering |
| Tool issue | Wrong tool args | Tighten schema, validate inputs |
| Safety issue | Sensitive info in logs | Redact or minimize stored data |
| Latency issue | Slow responses | Cache results, reduce calls, optimize tools |
What to measure over time
- Answer correctness
- Groundedness in retrieved/tool evidence
- Tool success rate
- Retrieval relevance
- Request latency
- Error rate
- Cost per request
9. Summary
In this session, learners explored observability across three major areas of GenAI systems:
- Prompts: inspect inputs, instructions, outputs, and latency
- Retrieval: inspect rankings, scores, context quality, and answer grounding
- Tools: inspect selection, arguments, execution results, and final usage
The key lesson is that good observability makes GenAI systems easier to debug, safer to operate, and faster to improve.
10. Suggested Instructor Flow
First 10 minutes
- Introduce observability and why it matters
- Compare debugging GenAI systems vs traditional software
Next 10 minutes
- Cover prompt, retrieval, and tool observability concepts
- Show examples of failure modes
Next 20 minutes
- Work through the three exercises
- Ask learners to inspect logs and identify likely causes of failure
Final 5 minutes
- Discuss evaluation loops and production-readiness
- Preview future sessions on testing, evaluation, or agent orchestration
Useful Resources
- OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
- OpenAI API docs: https://platform.openai.com/docs
- OpenAI Python SDK: https://github.com/openai/openai-python
- Python
loggingmodule: https://docs.python.org/3/library/logging.html - JSON module docs: https://docs.python.org/3/library/json.html
- Time module docs: https://docs.python.org/3/library/time.html
Optional Homework
- Take one of your previous GenAI scripts and add structured observability logs.
- Create a small log schema with fields for:
- request ID
- user input
- prompt version
- retrieval results
- tool calls
- latency
- final output
- Run five test prompts and identify at least two failure patterns.
- Write a short note describing whether each issue was caused by prompting, retrieval, or tool execution.
Back to Chapter | Back to Master Plan | Previous Session | Next Session