Session 3: Observability for Prompts, Retrieval, and Tools

Synopsis

Explains how to inspect prompt inputs, retrieved context, tool calls, intermediate outputs, and final responses. Learners gain visibility into where failures occur within multi-component systems.

Session Content

Session 3: Observability for Prompts, Retrieval, and Tools

Session Overview

In this session, learners will build a practical observability mindset for GenAI applications. The focus is on understanding how to inspect, debug, and improve systems that involve prompts, retrieval pipelines, and tool usage. By the end of the session, learners will be able to instrument a simple Python application, capture useful traces and logs, diagnose common failure modes, and evaluate outputs systematically.

Duration

~45 minutes

Learning Objectives

By the end of this session, learners should be able to:

Explain why observability matters in LLM and agentic systems
Identify the key signals to monitor for prompts, retrieval, and tool usage
Add basic tracing and structured logging to a Python GenAI workflow
Record prompt inputs, model outputs, retrieval context, and tool calls safely
Diagnose common failures such as hallucinations, poor retrieval, and tool misuse
Build a lightweight evaluation loop for iterative improvement

Agenda

Why observability matters in GenAI systems
Observability for prompts
Observability for retrieval
Observability for tools
Hands-on Exercise 1: Add structured logging to an LLM workflow
Hands-on Exercise 2: Observe retrieval quality in a mini RAG pipeline
Hands-on Exercise 3: Track and debug tool calls
Wrap-up and next steps

1. Why Observability Matters in GenAI Systems

Traditional software systems are often debugged with stack traces, logs, metrics, and tests. GenAI systems need all of those, but they also require visibility into probabilistic behavior.

Why GenAI systems are harder to debug

A GenAI application may fail because of:

A vague or conflicting prompt
Missing or irrelevant retrieval results
Hallucinated answers
A tool being called with incorrect arguments
A tool not being called when it should be
Model randomness or prompt sensitivity
Hidden context-window issues
Latency from external dependencies

What observability means in this context

Observability is the ability to understand what happened inside your system by looking at the signals it emits.

For GenAI applications, useful signals include:

User input
Final prompt sent to the model
System/developer instructions
Model response
Token usage and latency
Retrieved documents and relevance scores
Tool selection decisions
Tool inputs and outputs
Error messages and retries
Human or automated evaluation scores

The core principle

If you cannot inspect it, you cannot improve it.

2. Observability for Prompts

Prompt observability answers questions like:

What exact instructions did the model receive?
What context was inserted dynamically?
Which model and parameters were used?
How long did the request take?
What output was produced?

Important prompt-level signals

Inputs

User message
System or developer prompt
Retrieved context
Tool results inserted into context

Request metadata

Model name
Temperature or generation controls where applicable
Request timestamp
Request ID
Session ID or conversation ID

Outputs

Final answer
Refusal or uncertainty markers
Output length
Structured data validity

Prompt debugging checklist

When a prompt gives poor results, check:

Was the user intent correctly captured?
Were instructions too vague or too long?
Did retrieval inject distracting information?
Did the model output match the requested format?
Was the answer grounded in provided evidence?

Best practices

Log prompts in a structured format
Separate template from dynamic values
Version prompts
Capture outputs alongside prompt versions
Redact secrets and sensitive user data before storage

3. Observability for Retrieval

Retrieval is often the hidden source of poor answers in RAG systems.

Common retrieval failure modes

No relevant documents retrieved
Relevant documents ranked too low
Chunk too small or too large
Duplicate chunks dominate results
Retrieved content is stale
Answerable information exists but is phrased differently
Retrieved text contains contradictions

What to log for retrieval

For each retrieval event, capture:

Query text
Query embedding version if embeddings are used
Retrieved document IDs
Chunk text preview
Relevance scores
Rank positions
Data source
Retrieval latency

Questions observability helps answer

Did the system retrieve anything useful?
Did the top-ranked chunk actually contain the answer?
Are scores tightly clustered, suggesting weak ranking?
Are users asking questions not covered by the corpus?
Is bad answering actually a retrieval problem?

Lightweight retrieval diagnostics

A helpful inspection table might include:

Rank	Doc ID	Score	Preview	Contains Answer?
1	faq_12	0.91	“Refunds are available within…”	Yes
2	blog_4	0.88	“Customer satisfaction is…”	No
3	policy_7	0.83	“Returns are not accepted…”	Partial

This quickly distinguishes ranking problems from answer-generation problems.

4. Observability for Tools

In agentic systems, tool usage adds another debugging layer.

Wrong tool chosen
Correct tool chosen with wrong arguments
Tool output ignored by the model
Tool called repeatedly in a loop
Tool error not handled gracefully
Slow tools causing timeouts
Tool results contradict prompt assumptions

What to log for tool execution

For every tool call, record:

Tool name
Why the tool was selected, if available
Tool arguments
Start time and end time
Latency
Success or failure
Raw tool output
Post-processed output sent back to model

Tool observability checklist

Did the tool run successfully?
Were arguments validated?
Was the output understandable to the model?
Did the model ground its answer in the tool output?
Were there repeated or unnecessary calls?

Safety considerations

Be careful not to log:

API secrets
Authentication tokens
Personally identifiable information
Full raw payloads if they contain sensitive content

Prefer: - Redacted logs - Hashed IDs - Partial previews - Structured fields over raw dumps

5. Hands-on Exercise 1: Add Structured Logging to an LLM Workflow

Goal

Create a small Python app that: - sends a prompt using the OpenAI Responses API - records structured logs for request and response - tracks latency - prints a clean summary

What learners will practice

Using the OpenAI Python SDK
Calling client.responses.create(...)
Creating structured logs
Measuring latency
Capturing prompt/response data safely

Code

import json
import time
import uuid
from datetime import datetime, timezone

from openai import OpenAI

# Create the client.
# Make sure the OPENAI_API_KEY environment variable is set.
client = OpenAI()


def utc_now_iso() -> str:
    """Return the current UTC timestamp in ISO 8601 format."""
    return datetime.now(timezone.utc).isoformat()


def log_event(event_type: str, payload: dict) -> None:
    """
    Print a structured JSON log entry.

    In production, this would usually be sent to a logging platform
    or stored in a file/database.
    """
    entry = {
        "timestamp": utc_now_iso(),
        "event_type": event_type,
        **payload,
    }
    print(json.dumps(entry, indent=2))


def ask_llm(user_question: str) -> str:
    """
    Send a request to the OpenAI Responses API and log observability data.
    """
    request_id = str(uuid.uuid4())
    model = "gpt-5.4-mini"

    # A simple developer instruction.
    instructions = (
        "You are a concise assistant. "
        "Answer clearly in 2-4 bullet points."
    )

    log_event(
        "llm_request_started",
        {
            "request_id": request_id,
            "model": model,
            "instructions_preview": instructions[:120],
            "user_question": user_question,
        },
    )

    start = time.perf_counter()

    response = client.responses.create(
        model=model,
        instructions=instructions,
        input=user_question,
    )

    elapsed_ms = round((time.perf_counter() - start) * 1000, 2)

    answer_text = response.output_text

    log_event(
        "llm_request_completed",
        {
            "request_id": request_id,
            "model": model,
            "latency_ms": elapsed_ms,
            "answer_preview": answer_text[:300],
        },
    )

    return answer_text


if __name__ == "__main__":
    question = "What are three benefits of structured logging in AI applications?"
    answer = ask_llm(question)

    print("\n=== Final Answer ===")
    print(answer)

Example Output

{
  "timestamp": "2026-03-22T10:00:00.000000+00:00",
  "event_type": "llm_request_started",
  "request_id": "5f6f6f1a-6f10-4f54-a5d8-dfd5158b77b1",
  "model": "gpt-5.4-mini",
  "instructions_preview": "You are a concise assistant. Answer clearly in 2-4 bullet points.",
  "user_question": "What are three benefits of structured logging in AI applications?"
}
{
  "timestamp": "2026-03-22T10:00:01.200000+00:00",
  "event_type": "llm_request_completed",
  "request_id": "5f6f6f1a-6f10-4f54-a5d8-dfd5158b77b1",
  "model": "gpt-5.4-mini",
  "latency_ms": 1187.32,
  "answer_preview": "- Makes debugging easier by preserving request and response context...\n- Enables filtering and aggregation across events...\n- Supports evaluation and performance monitoring over time..."
}

=== Final Answer ===
- Makes debugging easier by preserving request and response context.
- Enables filtering and aggregation across events across prompts and users.
- Supports evaluation, alerting, and trend analysis over time.

Exercise Tasks

Run the script and inspect the logs.
Add a session_id field to each log entry.
Add a prompt_version field.
Redact emails if they appear in user_question.
Save logs to a local file instead of printing them.

Discussion

Ask learners: - What would be useful to search for in these logs? - Which fields are important for debugging failures? - Which fields should not be stored in plaintext?

6. Hands-on Exercise 2: Observe Retrieval Quality in a Mini RAG Pipeline

Goal

Build a tiny retrieval pipeline using in-memory documents and log: - the user query - the retrieved documents - simple relevance scores - the final prompt context - the model answer

This exercise emphasizes that many “LLM problems” are actually retrieval problems.

What learners will practice

Simulating retrieval with Python
Logging retrieval rankings
Passing retrieved context to the model
Inspecting whether the answer was grounded in useful context

Code

import json
import math
import re
import time
import uuid
from collections import Counter
from datetime import datetime, timezone

from openai import OpenAI

client = OpenAI()


DOCUMENTS = [
    {
        "id": "doc_1",
        "title": "Refund Policy",
        "text": "Customers may request a full refund within 30 days of purchase with proof of payment.",
    },
    {
        "id": "doc_2",
        "title": "Shipping Policy",
        "text": "Standard shipping takes 5 to 7 business days. Expedited shipping takes 2 business days.",
    },
    {
        "id": "doc_3",
        "title": "Account Security",
        "text": "Users should enable two-factor authentication and use a strong unique password.",
    },
    {
        "id": "doc_4",
        "title": "Subscription Terms",
        "text": "Monthly subscriptions renew automatically unless canceled before the next billing date.",
    },
]


def utc_now_iso() -> str:
    """Return the current UTC timestamp in ISO 8601 format."""
    return datetime.now(timezone.utc).isoformat()


def log_event(event_type: str, payload: dict) -> None:
    """Print a structured event log."""
    print(
        json.dumps(
            {
                "timestamp": utc_now_iso(),
                "event_type": event_type,
                **payload,
            },
            indent=2,
        )
    )


def tokenize(text: str) -> list[str]:
    """
    Convert text into lowercase word tokens.
    This is a simple tokenizer for demonstration purposes.
    """
    return re.findall(r"\b\w+\b", text.lower())


def cosine_similarity(text_a: str, text_b: str) -> float:
    """
    Compute cosine similarity between two texts using bag-of-words counts.
    This is intentionally simple so learners can inspect retrieval behavior.
    """
    tokens_a = tokenize(text_a)
    tokens_b = tokenize(text_b)

    counts_a = Counter(tokens_a)
    counts_b = Counter(tokens_b)

    all_terms = set(counts_a) | set(counts_b)

    dot = sum(counts_a[t] * counts_b[t] for t in all_terms)
    norm_a = math.sqrt(sum(v * v for v in counts_a.values()))
    norm_b = math.sqrt(sum(v * v for v in counts_b.values()))

    if norm_a == 0 or norm_b == 0:
        return 0.0

    return dot / (norm_a * norm_b)


def retrieve(query: str, top_k: int = 2) -> list[dict]:
    """
    Rank documents by simple cosine similarity against the user query.
    """
    scored = []
    for doc in DOCUMENTS:
        score = cosine_similarity(query, doc["text"] + " " + doc["title"])
        scored.append(
            {
                "id": doc["id"],
                "title": doc["title"],
                "text": doc["text"],
                "score": round(score, 4),
            }
        )

    ranked = sorted(scored, key=lambda d: d["score"], reverse=True)
    return ranked[:top_k]


def answer_question_with_rag(query: str) -> str:
    """
    Retrieve context, log retrieval details, then ask the model to answer
    using only the retrieved information.
    """
    request_id = str(uuid.uuid4())
    model = "gpt-5.4-mini"

    log_event(
        "rag_request_started",
        {
            "request_id": request_id,
            "query": query,
            "model": model,
        },
    )

    retrieval_start = time.perf_counter()
    top_docs = retrieve(query, top_k=2)
    retrieval_latency_ms = round((time.perf_counter() - retrieval_start) * 1000, 2)

    log_event(
        "retrieval_completed",
        {
            "request_id": request_id,
            "query": query,
            "retrieval_latency_ms": retrieval_latency_ms,
            "results": [
                {
                    "rank": i + 1,
                    "doc_id": doc["id"],
                    "title": doc["title"],
                    "score": doc["score"],
                    "preview": doc["text"][:100],
                }
                for i, doc in enumerate(top_docs)
            ],
        },
    )

    context = "\n\n".join(
        [
            f"[{doc['id']}] {doc['title']}\n{doc['text']}"
            for doc in top_docs
        ]
    )

    instructions = (
        "Answer the user's question using only the provided context. "
        "If the answer is not in the context, say: 'I could not find that in the provided documents.'"
    )

    model_input = (
        f"Context:\n{context}\n\n"
        f"Question: {query}"
    )

    log_event(
        "prompt_prepared",
        {
            "request_id": request_id,
            "instructions_preview": instructions[:150],
            "context_preview": context[:300],
        },
    )

    llm_start = time.perf_counter()
    response = client.responses.create(
        model=model,
        instructions=instructions,
        input=model_input,
    )
    llm_latency_ms = round((time.perf_counter() - llm_start) * 1000, 2)

    answer = response.output_text

    log_event(
        "rag_request_completed",
        {
            "request_id": request_id,
            "llm_latency_ms": llm_latency_ms,
            "answer_preview": answer[:300],
        },
    )

    return answer


if __name__ == "__main__":
    question = "Can I get my money back after buying a product?"
    answer = answer_question_with_rag(question)

    print("\n=== RAG Answer ===")
    print(answer)

Example Output

{
  "timestamp": "2026-03-22T10:05:00.000000+00:00",
  "event_type": "rag_request_started",
  "request_id": "b30dbf8d-fcf4-4f81-b0da-d4a91315dc1c",
  "query": "Can I get my money back after buying a product?",
  "model": "gpt-5.4-mini"
}
{
  "timestamp": "2026-03-22T10:05:00.010000+00:00",
  "event_type": "retrieval_completed",
  "request_id": "b30dbf8d-fcf4-4f81-b0da-d4a91315dc1c",
  "query": "Can I get my money back after buying a product?",
  "retrieval_latency_ms": 2.13,
  "results": [
    {
      "rank": 1,
      "doc_id": "doc_1",
      "title": "Refund Policy",
      "score": 0.1633,
      "preview": "Customers may request a full refund within 30 days of purchase with proof of payment."
    },
    {
      "rank": 2,
      "doc_id": "doc_4",
      "title": "Subscription Terms",
      "score": 0.0,
      "preview": "Monthly subscriptions renew automatically unless canceled before the next billing date."
    }
  ]
}
{
  "timestamp": "2026-03-22T10:05:01.220000+00:00",
  "event_type": "rag_request_completed",
  "request_id": "b30dbf8d-fcf4-4f81-b0da-d4a91315dc1c",
  "llm_latency_ms": 1198.76,
  "answer_preview": "Yes. According to the provided context, customers may request a full refund within 30 days of purchase with proof of payment."
}

=== RAG Answer ===
Yes. According to the provided context, customers may request a full refund within 30 days of purchase with proof of payment.

Exercise Tasks

Run the script with three different user questions.
For each question, inspect whether the top result truly contains the answer.
Add a boolean field called likely_relevant based on a score threshold.
Add a warning log if all retrieved scores are near zero.
Modify the corpus to include a conflicting refund policy and observe what happens.

Reflection Questions

Did the model fail, or did retrieval fail?
How would you detect stale or contradictory documents?
What metadata would help explain ranking decisions?

7. Hands-on Exercise 3: Track and Debug Tool Calls

Goal

Create a small workflow where the model can use a tool to look up weather data, and log: - the tool schema - the tool call arguments - the tool result - the final answer

This exercise helps learners understand how to observe and debug tool-enabled systems.

What learners will practice

Defining a tool for the Responses API
Executing the tool in Python
Feeding tool results back to the model
Logging each tool interaction

Code

import json
import time
import uuid
from datetime import datetime, timezone

from openai import OpenAI

client = OpenAI()


def utc_now_iso() -> str:
    """Return the current UTC timestamp in ISO 8601 format."""
    return datetime.now(timezone.utc).isoformat()


def log_event(event_type: str, payload: dict) -> None:
    """Print a structured event log."""
    print(
        json.dumps(
            {
                "timestamp": utc_now_iso(),
                "event_type": event_type,
                **payload,
            },
            indent=2,
        )
    )


def get_weather(city: str) -> dict:
    """
    Mock weather lookup tool.

    In a real application, this function would call an external API.
    """
    fake_weather_db = {
        "london": {"temperature_c": 14, "condition": "Cloudy"},
        "paris": {"temperature_c": 18, "condition": "Sunny"},
        "tokyo": {"temperature_c": 22, "condition": "Rain showers"},
    }

    normalized = city.strip().lower()
    return fake_weather_db.get(
        normalized,
        {"temperature_c": None, "condition": "Unknown city"},
    )


def run_weather_agent(user_question: str) -> str:
    """
    Ask the model to answer a weather question using a Python tool.
    """
    request_id = str(uuid.uuid4())
    model = "gpt-5.4-mini"

    tools = [
        {
            "type": "function",
            "name": "get_weather",
            "description": "Look up current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city name to look up.",
                    }
                },
                "required": ["city"],
                "additionalProperties": False,
            },
        }
    ]

    log_event(
        "agent_request_started",
        {
            "request_id": request_id,
            "model": model,
            "user_question": user_question,
            "tools": tools,
        },
    )

    start = time.perf_counter()

    first_response = client.responses.create(
        model=model,
        instructions=(
            "You are a helpful assistant. "
            "Use the available tool when the user asks for weather information."
        ),
        input=user_question,
        tools=tools,
    )

    # Inspect tool calls emitted by the model.
    tool_outputs = []
    tool_call_count = 0

    for item in first_response.output:
        if item.type == "function_call":
            tool_call_count += 1
            tool_name = item.name
            call_id = item.call_id
            arguments = json.loads(item.arguments)

            log_event(
                "tool_call_requested",
                {
                    "request_id": request_id,
                    "tool_name": tool_name,
                    "call_id": call_id,
                    "arguments": arguments,
                },
            )

            tool_start = time.perf_counter()

            if tool_name == "get_weather":
                result = get_weather(arguments["city"])
            else:
                result = {"error": f"Unknown tool: {tool_name}"}

            tool_latency_ms = round((time.perf_counter() - tool_start) * 1000, 2)

            log_event(
                "tool_call_completed",
                {
                    "request_id": request_id,
                    "tool_name": tool_name,
                    "call_id": call_id,
                    "latency_ms": tool_latency_ms,
                    "result": result,
                },
            )

            tool_outputs.append(
                {
                    "type": "function_call_output",
                    "call_id": call_id,
                    "output": json.dumps(result),
                }
            )

    if tool_outputs:
        final_response = client.responses.create(
            model=model,
            previous_response_id=first_response.id,
            input=tool_outputs,
        )
        answer = final_response.output_text
    else:
        answer = first_response.output_text

    total_latency_ms = round((time.perf_counter() - start) * 1000, 2)

    log_event(
        "agent_request_completed",
        {
            "request_id": request_id,
            "tool_call_count": tool_call_count,
            "total_latency_ms": total_latency_ms,
            "answer_preview": answer[:300],
        },
    )

    return answer


if __name__ == "__main__":
    question = "What's the weather like in Paris today?"
    answer = run_weather_agent(question)

    print("\n=== Agent Answer ===")
    print(answer)

Example Output

{
  "timestamp": "2026-03-22T10:10:00.000000+00:00",
  "event_type": "agent_request_started",
  "request_id": "8d1436b4-6c67-40dc-86d1-24486a2f0ec0",
  "model": "gpt-5.4-mini",
  "user_question": "What's the weather like in Paris today?",
  "tools": [
    {
      "type": "function",
      "name": "get_weather",
      "description": "Look up current weather for a city.",
      "parameters": {
        "type": "object",
        "properties": {
          "city": {
            "type": "string",
            "description": "The city name to look up."
          }
        },
        "required": [
          "city"
        ],
        "additionalProperties": false
      }
    }
  ]
}
{
  "timestamp": "2026-03-22T10:10:00.900000+00:00",
  "event_type": "tool_call_requested",
  "request_id": "8d1436b4-6c67-40dc-86d1-24486a2f0ec0",
  "tool_name": "get_weather",
  "call_id": "call_abc123",
  "arguments": {
    "city": "Paris"
  }
}
{
  "timestamp": "2026-03-22T10:10:00.905000+00:00",
  "event_type": "tool_call_completed",
  "request_id": "8d1436b4-6c67-40dc-86d1-24486a2f0ec0",
  "tool_name": "get_weather",
  "call_id": "call_abc123",
  "latency_ms": 0.08,
  "result": {
    "temperature_c": 18,
    "condition": "Sunny"
  }
}
{
  "timestamp": "2026-03-22T10:10:01.600000+00:00",
  "event_type": "agent_request_completed",
  "request_id": "8d1436b4-6c67-40dc-86d1-24486a2f0ec0",
  "tool_call_count": 1,
  "total_latency_ms": 1598.42,
  "answer_preview": "The current weather in Paris is sunny with a temperature of 18°C."
}

=== Agent Answer ===
The current weather in Paris is sunny with a temperature of 18°C.

Exercise Tasks

Run the example with cities inside and outside the fake weather database.
Add validation for empty city arguments.
Log a warning if the model answers without calling the weather tool.
Add a second tool, such as get_time_in_city, and observe tool selection behavior.
Add error handling so the tool returns structured errors rather than crashing.

Debugging Questions

Did the model choose the correct tool?
Were the arguments complete and valid?
Did the final answer correctly incorporate the tool result?
What would you alert on in production?

8. Building a Simple Evaluation Loop

Observability gives you raw signals. Evaluation turns those signals into improvement.

A practical evaluation loop

Collect examples of user inputs
Save prompts, retrieval results, tool calls, and outputs
Review failures
Label failure causes:
prompt issue
retrieval issue
tool issue
unclear user input
model limitation
Update the system
Re-run the examples
Compare results over time

Example failure taxonomy

Failure Type	Symptom	Likely Fix
Prompt issue	Output format wrong	Improve instructions or examples
Retrieval issue	Answer misses known fact	Improve chunking/ranking/filtering
Tool issue	Wrong tool args	Tighten schema, validate inputs
Safety issue	Sensitive info in logs	Redact or minimize stored data
Latency issue	Slow responses	Cache results, reduce calls, optimize tools

What to measure over time

Answer correctness
Groundedness in retrieved/tool evidence
Tool success rate
Retrieval relevance
Request latency
Error rate
Cost per request

9. Summary

In this session, learners explored observability across three major areas of GenAI systems:

Prompts: inspect inputs, instructions, outputs, and latency
Retrieval: inspect rankings, scores, context quality, and answer grounding
Tools: inspect selection, arguments, execution results, and final usage

The key lesson is that good observability makes GenAI systems easier to debug, safer to operate, and faster to improve.

10. Suggested Instructor Flow

First 10 minutes

Introduce observability and why it matters
Compare debugging GenAI systems vs traditional software

Next 10 minutes

Cover prompt, retrieval, and tool observability concepts
Show examples of failure modes

Next 20 minutes

Work through the three exercises
Ask learners to inspect logs and identify likely causes of failure

Final 5 minutes

Discuss evaluation loops and production-readiness
Preview future sessions on testing, evaluation, or agent orchestration

Useful Resources

OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
OpenAI API docs: https://platform.openai.com/docs
OpenAI Python SDK: https://github.com/openai/openai-python
Python logging module: https://docs.python.org/3/library/logging.html
JSON module docs: https://docs.python.org/3/library/json.html
Time module docs: https://docs.python.org/3/library/time.html

Optional Homework

Take one of your previous GenAI scripts and add structured observability logs.
Create a small log schema with fields for:
request ID
user input
prompt version
retrieval results
tool calls
latency
final output
Run five test prompts and identify at least two failure patterns.
Write a short note describing whether each issue was caused by prompting, retrieval, or tool execution.

Back to Chapter | Back to Master Plan | Previous Session | Next Session

Session 3: Observability for Prompts, Retrieval, and Tools

Synopsis

Session Content

Session 3: Observability for Prompts, Retrieval, and Tools

Session Overview

Duration

Learning Objectives

Agenda

1. Why Observability Matters in GenAI Systems

Why GenAI systems are harder to debug

What observability means in this context

The core principle

2. Observability for Prompts

Important prompt-level signals

Inputs

Request metadata

Outputs

Prompt debugging checklist

Best practices

3. Observability for Retrieval

Common retrieval failure modes

What to log for retrieval

Questions observability helps answer

Lightweight retrieval diagnostics

4. Observability for Tools

Typical tool-related failures

What to log for tool execution

Tool observability checklist

Safety considerations

5. Hands-on Exercise 1: Add Structured Logging to an LLM Workflow

Goal

What learners will practice

Code

Example Output

Exercise Tasks

Discussion

6. Hands-on Exercise 2: Observe Retrieval Quality in a Mini RAG Pipeline

Goal

What learners will practice

Code

Example Output

Exercise Tasks

Reflection Questions

7. Hands-on Exercise 3: Track and Debug Tool Calls

Goal

What learners will practice

Code

Example Output

Exercise Tasks

Debugging Questions

8. Building a Simple Evaluation Loop

A practical evaluation loop

Example failure taxonomy

What to measure over time

9. Summary

10. Suggested Instructor Flow

First 10 minutes

Next 10 minutes

Next 20 minutes

Final 5 minutes

Useful Resources

Optional Homework