Skip to content

Session 3: Observability for Prompts, Retrieval, and Tools

Synopsis

Explains how to inspect prompt inputs, retrieved context, tool calls, intermediate outputs, and final responses. Learners gain visibility into where failures occur within multi-component systems.

Session Content

Session 3: Observability for Prompts, Retrieval, and Tools

Session Overview

In this session, learners will build a practical observability mindset for GenAI applications. The focus is on understanding how to inspect, debug, and improve systems that involve prompts, retrieval pipelines, and tool usage. By the end of the session, learners will be able to instrument a simple Python application, capture useful traces and logs, diagnose common failure modes, and evaluate outputs systematically.

Duration

~45 minutes

Learning Objectives

By the end of this session, learners should be able to:

  • Explain why observability matters in LLM and agentic systems
  • Identify the key signals to monitor for prompts, retrieval, and tool usage
  • Add basic tracing and structured logging to a Python GenAI workflow
  • Record prompt inputs, model outputs, retrieval context, and tool calls safely
  • Diagnose common failures such as hallucinations, poor retrieval, and tool misuse
  • Build a lightweight evaluation loop for iterative improvement

Agenda

  1. Why observability matters in GenAI systems
  2. Observability for prompts
  3. Observability for retrieval
  4. Observability for tools
  5. Hands-on Exercise 1: Add structured logging to an LLM workflow
  6. Hands-on Exercise 2: Observe retrieval quality in a mini RAG pipeline
  7. Hands-on Exercise 3: Track and debug tool calls
  8. Wrap-up and next steps

1. Why Observability Matters in GenAI Systems

Traditional software systems are often debugged with stack traces, logs, metrics, and tests. GenAI systems need all of those, but they also require visibility into probabilistic behavior.

Why GenAI systems are harder to debug

A GenAI application may fail because of:

  • A vague or conflicting prompt
  • Missing or irrelevant retrieval results
  • Hallucinated answers
  • A tool being called with incorrect arguments
  • A tool not being called when it should be
  • Model randomness or prompt sensitivity
  • Hidden context-window issues
  • Latency from external dependencies

What observability means in this context

Observability is the ability to understand what happened inside your system by looking at the signals it emits.

For GenAI applications, useful signals include:

  • User input
  • Final prompt sent to the model
  • System/developer instructions
  • Model response
  • Token usage and latency
  • Retrieved documents and relevance scores
  • Tool selection decisions
  • Tool inputs and outputs
  • Error messages and retries
  • Human or automated evaluation scores

The core principle

If you cannot inspect it, you cannot improve it.


2. Observability for Prompts

Prompt observability answers questions like:

  • What exact instructions did the model receive?
  • What context was inserted dynamically?
  • Which model and parameters were used?
  • How long did the request take?
  • What output was produced?

Important prompt-level signals

Inputs

  • User message
  • System or developer prompt
  • Retrieved context
  • Tool results inserted into context

Request metadata

  • Model name
  • Temperature or generation controls where applicable
  • Request timestamp
  • Request ID
  • Session ID or conversation ID

Outputs

  • Final answer
  • Refusal or uncertainty markers
  • Output length
  • Structured data validity

Prompt debugging checklist

When a prompt gives poor results, check:

  1. Was the user intent correctly captured?
  2. Were instructions too vague or too long?
  3. Did retrieval inject distracting information?
  4. Did the model output match the requested format?
  5. Was the answer grounded in provided evidence?

Best practices

  • Log prompts in a structured format
  • Separate template from dynamic values
  • Version prompts
  • Capture outputs alongside prompt versions
  • Redact secrets and sensitive user data before storage

3. Observability for Retrieval

Retrieval is often the hidden source of poor answers in RAG systems.

Common retrieval failure modes

  • No relevant documents retrieved
  • Relevant documents ranked too low
  • Chunk too small or too large
  • Duplicate chunks dominate results
  • Retrieved content is stale
  • Answerable information exists but is phrased differently
  • Retrieved text contains contradictions

What to log for retrieval

For each retrieval event, capture:

  • Query text
  • Query embedding version if embeddings are used
  • Retrieved document IDs
  • Chunk text preview
  • Relevance scores
  • Rank positions
  • Data source
  • Retrieval latency

Questions observability helps answer

  • Did the system retrieve anything useful?
  • Did the top-ranked chunk actually contain the answer?
  • Are scores tightly clustered, suggesting weak ranking?
  • Are users asking questions not covered by the corpus?
  • Is bad answering actually a retrieval problem?

Lightweight retrieval diagnostics

A helpful inspection table might include:

Rank Doc ID Score Preview Contains Answer?
1 faq_12 0.91 “Refunds are available within…” Yes
2 blog_4 0.88 “Customer satisfaction is…” No
3 policy_7 0.83 “Returns are not accepted…” Partial

This quickly distinguishes ranking problems from answer-generation problems.


4. Observability for Tools

In agentic systems, tool usage adds another debugging layer.

  • Wrong tool chosen
  • Correct tool chosen with wrong arguments
  • Tool output ignored by the model
  • Tool called repeatedly in a loop
  • Tool error not handled gracefully
  • Slow tools causing timeouts
  • Tool results contradict prompt assumptions

What to log for tool execution

For every tool call, record:

  • Tool name
  • Why the tool was selected, if available
  • Tool arguments
  • Start time and end time
  • Latency
  • Success or failure
  • Raw tool output
  • Post-processed output sent back to model

Tool observability checklist

  • Did the tool run successfully?
  • Were arguments validated?
  • Was the output understandable to the model?
  • Did the model ground its answer in the tool output?
  • Were there repeated or unnecessary calls?

Safety considerations

Be careful not to log:

  • API secrets
  • Authentication tokens
  • Personally identifiable information
  • Full raw payloads if they contain sensitive content

Prefer: - Redacted logs - Hashed IDs - Partial previews - Structured fields over raw dumps


5. Hands-on Exercise 1: Add Structured Logging to an LLM Workflow

Goal

Create a small Python app that: - sends a prompt using the OpenAI Responses API - records structured logs for request and response - tracks latency - prints a clean summary

What learners will practice

  • Using the OpenAI Python SDK
  • Calling client.responses.create(...)
  • Creating structured logs
  • Measuring latency
  • Capturing prompt/response data safely

Code

import json
import time
import uuid
from datetime import datetime, timezone

from openai import OpenAI

# Create the client.
# Make sure the OPENAI_API_KEY environment variable is set.
client = OpenAI()


def utc_now_iso() -> str:
    """Return the current UTC timestamp in ISO 8601 format."""
    return datetime.now(timezone.utc).isoformat()


def log_event(event_type: str, payload: dict) -> None:
    """
    Print a structured JSON log entry.

    In production, this would usually be sent to a logging platform
    or stored in a file/database.
    """
    entry = {
        "timestamp": utc_now_iso(),
        "event_type": event_type,
        **payload,
    }
    print(json.dumps(entry, indent=2))


def ask_llm(user_question: str) -> str:
    """
    Send a request to the OpenAI Responses API and log observability data.
    """
    request_id = str(uuid.uuid4())
    model = "gpt-5.4-mini"

    # A simple developer instruction.
    instructions = (
        "You are a concise assistant. "
        "Answer clearly in 2-4 bullet points."
    )

    log_event(
        "llm_request_started",
        {
            "request_id": request_id,
            "model": model,
            "instructions_preview": instructions[:120],
            "user_question": user_question,
        },
    )

    start = time.perf_counter()

    response = client.responses.create(
        model=model,
        instructions=instructions,
        input=user_question,
    )

    elapsed_ms = round((time.perf_counter() - start) * 1000, 2)

    answer_text = response.output_text

    log_event(
        "llm_request_completed",
        {
            "request_id": request_id,
            "model": model,
            "latency_ms": elapsed_ms,
            "answer_preview": answer_text[:300],
        },
    )

    return answer_text


if __name__ == "__main__":
    question = "What are three benefits of structured logging in AI applications?"
    answer = ask_llm(question)

    print("\n=== Final Answer ===")
    print(answer)

Example Output

{
  "timestamp": "2026-03-22T10:00:00.000000+00:00",
  "event_type": "llm_request_started",
  "request_id": "5f6f6f1a-6f10-4f54-a5d8-dfd5158b77b1",
  "model": "gpt-5.4-mini",
  "instructions_preview": "You are a concise assistant. Answer clearly in 2-4 bullet points.",
  "user_question": "What are three benefits of structured logging in AI applications?"
}
{
  "timestamp": "2026-03-22T10:00:01.200000+00:00",
  "event_type": "llm_request_completed",
  "request_id": "5f6f6f1a-6f10-4f54-a5d8-dfd5158b77b1",
  "model": "gpt-5.4-mini",
  "latency_ms": 1187.32,
  "answer_preview": "- Makes debugging easier by preserving request and response context...\n- Enables filtering and aggregation across events...\n- Supports evaluation and performance monitoring over time..."
}

=== Final Answer ===
- Makes debugging easier by preserving request and response context.
- Enables filtering and aggregation across events across prompts and users.
- Supports evaluation, alerting, and trend analysis over time.

Exercise Tasks

  1. Run the script and inspect the logs.
  2. Add a session_id field to each log entry.
  3. Add a prompt_version field.
  4. Redact emails if they appear in user_question.
  5. Save logs to a local file instead of printing them.

Discussion

Ask learners: - What would be useful to search for in these logs? - Which fields are important for debugging failures? - Which fields should not be stored in plaintext?


6. Hands-on Exercise 2: Observe Retrieval Quality in a Mini RAG Pipeline

Goal

Build a tiny retrieval pipeline using in-memory documents and log: - the user query - the retrieved documents - simple relevance scores - the final prompt context - the model answer

This exercise emphasizes that many “LLM problems” are actually retrieval problems.

What learners will practice

  • Simulating retrieval with Python
  • Logging retrieval rankings
  • Passing retrieved context to the model
  • Inspecting whether the answer was grounded in useful context

Code

import json
import math
import re
import time
import uuid
from collections import Counter
from datetime import datetime, timezone

from openai import OpenAI

client = OpenAI()


DOCUMENTS = [
    {
        "id": "doc_1",
        "title": "Refund Policy",
        "text": "Customers may request a full refund within 30 days of purchase with proof of payment.",
    },
    {
        "id": "doc_2",
        "title": "Shipping Policy",
        "text": "Standard shipping takes 5 to 7 business days. Expedited shipping takes 2 business days.",
    },
    {
        "id": "doc_3",
        "title": "Account Security",
        "text": "Users should enable two-factor authentication and use a strong unique password.",
    },
    {
        "id": "doc_4",
        "title": "Subscription Terms",
        "text": "Monthly subscriptions renew automatically unless canceled before the next billing date.",
    },
]


def utc_now_iso() -> str:
    """Return the current UTC timestamp in ISO 8601 format."""
    return datetime.now(timezone.utc).isoformat()


def log_event(event_type: str, payload: dict) -> None:
    """Print a structured event log."""
    print(
        json.dumps(
            {
                "timestamp": utc_now_iso(),
                "event_type": event_type,
                **payload,
            },
            indent=2,
        )
    )


def tokenize(text: str) -> list[str]:
    """
    Convert text into lowercase word tokens.
    This is a simple tokenizer for demonstration purposes.
    """
    return re.findall(r"\b\w+\b", text.lower())


def cosine_similarity(text_a: str, text_b: str) -> float:
    """
    Compute cosine similarity between two texts using bag-of-words counts.
    This is intentionally simple so learners can inspect retrieval behavior.
    """
    tokens_a = tokenize(text_a)
    tokens_b = tokenize(text_b)

    counts_a = Counter(tokens_a)
    counts_b = Counter(tokens_b)

    all_terms = set(counts_a) | set(counts_b)

    dot = sum(counts_a[t] * counts_b[t] for t in all_terms)
    norm_a = math.sqrt(sum(v * v for v in counts_a.values()))
    norm_b = math.sqrt(sum(v * v for v in counts_b.values()))

    if norm_a == 0 or norm_b == 0:
        return 0.0

    return dot / (norm_a * norm_b)


def retrieve(query: str, top_k: int = 2) -> list[dict]:
    """
    Rank documents by simple cosine similarity against the user query.
    """
    scored = []
    for doc in DOCUMENTS:
        score = cosine_similarity(query, doc["text"] + " " + doc["title"])
        scored.append(
            {
                "id": doc["id"],
                "title": doc["title"],
                "text": doc["text"],
                "score": round(score, 4),
            }
        )

    ranked = sorted(scored, key=lambda d: d["score"], reverse=True)
    return ranked[:top_k]


def answer_question_with_rag(query: str) -> str:
    """
    Retrieve context, log retrieval details, then ask the model to answer
    using only the retrieved information.
    """
    request_id = str(uuid.uuid4())
    model = "gpt-5.4-mini"

    log_event(
        "rag_request_started",
        {
            "request_id": request_id,
            "query": query,
            "model": model,
        },
    )

    retrieval_start = time.perf_counter()
    top_docs = retrieve(query, top_k=2)
    retrieval_latency_ms = round((time.perf_counter() - retrieval_start) * 1000, 2)

    log_event(
        "retrieval_completed",
        {
            "request_id": request_id,
            "query": query,
            "retrieval_latency_ms": retrieval_latency_ms,
            "results": [
                {
                    "rank": i + 1,
                    "doc_id": doc["id"],
                    "title": doc["title"],
                    "score": doc["score"],
                    "preview": doc["text"][:100],
                }
                for i, doc in enumerate(top_docs)
            ],
        },
    )

    context = "\n\n".join(
        [
            f"[{doc['id']}] {doc['title']}\n{doc['text']}"
            for doc in top_docs
        ]
    )

    instructions = (
        "Answer the user's question using only the provided context. "
        "If the answer is not in the context, say: 'I could not find that in the provided documents.'"
    )

    model_input = (
        f"Context:\n{context}\n\n"
        f"Question: {query}"
    )

    log_event(
        "prompt_prepared",
        {
            "request_id": request_id,
            "instructions_preview": instructions[:150],
            "context_preview": context[:300],
        },
    )

    llm_start = time.perf_counter()
    response = client.responses.create(
        model=model,
        instructions=instructions,
        input=model_input,
    )
    llm_latency_ms = round((time.perf_counter() - llm_start) * 1000, 2)

    answer = response.output_text

    log_event(
        "rag_request_completed",
        {
            "request_id": request_id,
            "llm_latency_ms": llm_latency_ms,
            "answer_preview": answer[:300],
        },
    )

    return answer


if __name__ == "__main__":
    question = "Can I get my money back after buying a product?"
    answer = answer_question_with_rag(question)

    print("\n=== RAG Answer ===")
    print(answer)

Example Output

{
  "timestamp": "2026-03-22T10:05:00.000000+00:00",
  "event_type": "rag_request_started",
  "request_id": "b30dbf8d-fcf4-4f81-b0da-d4a91315dc1c",
  "query": "Can I get my money back after buying a product?",
  "model": "gpt-5.4-mini"
}
{
  "timestamp": "2026-03-22T10:05:00.010000+00:00",
  "event_type": "retrieval_completed",
  "request_id": "b30dbf8d-fcf4-4f81-b0da-d4a91315dc1c",
  "query": "Can I get my money back after buying a product?",
  "retrieval_latency_ms": 2.13,
  "results": [
    {
      "rank": 1,
      "doc_id": "doc_1",
      "title": "Refund Policy",
      "score": 0.1633,
      "preview": "Customers may request a full refund within 30 days of purchase with proof of payment."
    },
    {
      "rank": 2,
      "doc_id": "doc_4",
      "title": "Subscription Terms",
      "score": 0.0,
      "preview": "Monthly subscriptions renew automatically unless canceled before the next billing date."
    }
  ]
}
{
  "timestamp": "2026-03-22T10:05:01.220000+00:00",
  "event_type": "rag_request_completed",
  "request_id": "b30dbf8d-fcf4-4f81-b0da-d4a91315dc1c",
  "llm_latency_ms": 1198.76,
  "answer_preview": "Yes. According to the provided context, customers may request a full refund within 30 days of purchase with proof of payment."
}

=== RAG Answer ===
Yes. According to the provided context, customers may request a full refund within 30 days of purchase with proof of payment.

Exercise Tasks

  1. Run the script with three different user questions.
  2. For each question, inspect whether the top result truly contains the answer.
  3. Add a boolean field called likely_relevant based on a score threshold.
  4. Add a warning log if all retrieved scores are near zero.
  5. Modify the corpus to include a conflicting refund policy and observe what happens.

Reflection Questions

  • Did the model fail, or did retrieval fail?
  • How would you detect stale or contradictory documents?
  • What metadata would help explain ranking decisions?

7. Hands-on Exercise 3: Track and Debug Tool Calls

Goal

Create a small workflow where the model can use a tool to look up weather data, and log: - the tool schema - the tool call arguments - the tool result - the final answer

This exercise helps learners understand how to observe and debug tool-enabled systems.

What learners will practice

  • Defining a tool for the Responses API
  • Executing the tool in Python
  • Feeding tool results back to the model
  • Logging each tool interaction

Code

import json
import time
import uuid
from datetime import datetime, timezone

from openai import OpenAI

client = OpenAI()


def utc_now_iso() -> str:
    """Return the current UTC timestamp in ISO 8601 format."""
    return datetime.now(timezone.utc).isoformat()


def log_event(event_type: str, payload: dict) -> None:
    """Print a structured event log."""
    print(
        json.dumps(
            {
                "timestamp": utc_now_iso(),
                "event_type": event_type,
                **payload,
            },
            indent=2,
        )
    )


def get_weather(city: str) -> dict:
    """
    Mock weather lookup tool.

    In a real application, this function would call an external API.
    """
    fake_weather_db = {
        "london": {"temperature_c": 14, "condition": "Cloudy"},
        "paris": {"temperature_c": 18, "condition": "Sunny"},
        "tokyo": {"temperature_c": 22, "condition": "Rain showers"},
    }

    normalized = city.strip().lower()
    return fake_weather_db.get(
        normalized,
        {"temperature_c": None, "condition": "Unknown city"},
    )


def run_weather_agent(user_question: str) -> str:
    """
    Ask the model to answer a weather question using a Python tool.
    """
    request_id = str(uuid.uuid4())
    model = "gpt-5.4-mini"

    tools = [
        {
            "type": "function",
            "name": "get_weather",
            "description": "Look up current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city name to look up.",
                    }
                },
                "required": ["city"],
                "additionalProperties": False,
            },
        }
    ]

    log_event(
        "agent_request_started",
        {
            "request_id": request_id,
            "model": model,
            "user_question": user_question,
            "tools": tools,
        },
    )

    start = time.perf_counter()

    first_response = client.responses.create(
        model=model,
        instructions=(
            "You are a helpful assistant. "
            "Use the available tool when the user asks for weather information."
        ),
        input=user_question,
        tools=tools,
    )

    # Inspect tool calls emitted by the model.
    tool_outputs = []
    tool_call_count = 0

    for item in first_response.output:
        if item.type == "function_call":
            tool_call_count += 1
            tool_name = item.name
            call_id = item.call_id
            arguments = json.loads(item.arguments)

            log_event(
                "tool_call_requested",
                {
                    "request_id": request_id,
                    "tool_name": tool_name,
                    "call_id": call_id,
                    "arguments": arguments,
                },
            )

            tool_start = time.perf_counter()

            if tool_name == "get_weather":
                result = get_weather(arguments["city"])
            else:
                result = {"error": f"Unknown tool: {tool_name}"}

            tool_latency_ms = round((time.perf_counter() - tool_start) * 1000, 2)

            log_event(
                "tool_call_completed",
                {
                    "request_id": request_id,
                    "tool_name": tool_name,
                    "call_id": call_id,
                    "latency_ms": tool_latency_ms,
                    "result": result,
                },
            )

            tool_outputs.append(
                {
                    "type": "function_call_output",
                    "call_id": call_id,
                    "output": json.dumps(result),
                }
            )

    if tool_outputs:
        final_response = client.responses.create(
            model=model,
            previous_response_id=first_response.id,
            input=tool_outputs,
        )
        answer = final_response.output_text
    else:
        answer = first_response.output_text

    total_latency_ms = round((time.perf_counter() - start) * 1000, 2)

    log_event(
        "agent_request_completed",
        {
            "request_id": request_id,
            "tool_call_count": tool_call_count,
            "total_latency_ms": total_latency_ms,
            "answer_preview": answer[:300],
        },
    )

    return answer


if __name__ == "__main__":
    question = "What's the weather like in Paris today?"
    answer = run_weather_agent(question)

    print("\n=== Agent Answer ===")
    print(answer)

Example Output

{
  "timestamp": "2026-03-22T10:10:00.000000+00:00",
  "event_type": "agent_request_started",
  "request_id": "8d1436b4-6c67-40dc-86d1-24486a2f0ec0",
  "model": "gpt-5.4-mini",
  "user_question": "What's the weather like in Paris today?",
  "tools": [
    {
      "type": "function",
      "name": "get_weather",
      "description": "Look up current weather for a city.",
      "parameters": {
        "type": "object",
        "properties": {
          "city": {
            "type": "string",
            "description": "The city name to look up."
          }
        },
        "required": [
          "city"
        ],
        "additionalProperties": false
      }
    }
  ]
}
{
  "timestamp": "2026-03-22T10:10:00.900000+00:00",
  "event_type": "tool_call_requested",
  "request_id": "8d1436b4-6c67-40dc-86d1-24486a2f0ec0",
  "tool_name": "get_weather",
  "call_id": "call_abc123",
  "arguments": {
    "city": "Paris"
  }
}
{
  "timestamp": "2026-03-22T10:10:00.905000+00:00",
  "event_type": "tool_call_completed",
  "request_id": "8d1436b4-6c67-40dc-86d1-24486a2f0ec0",
  "tool_name": "get_weather",
  "call_id": "call_abc123",
  "latency_ms": 0.08,
  "result": {
    "temperature_c": 18,
    "condition": "Sunny"
  }
}
{
  "timestamp": "2026-03-22T10:10:01.600000+00:00",
  "event_type": "agent_request_completed",
  "request_id": "8d1436b4-6c67-40dc-86d1-24486a2f0ec0",
  "tool_call_count": 1,
  "total_latency_ms": 1598.42,
  "answer_preview": "The current weather in Paris is sunny with a temperature of 18°C."
}

=== Agent Answer ===
The current weather in Paris is sunny with a temperature of 18°C.

Exercise Tasks

  1. Run the example with cities inside and outside the fake weather database.
  2. Add validation for empty city arguments.
  3. Log a warning if the model answers without calling the weather tool.
  4. Add a second tool, such as get_time_in_city, and observe tool selection behavior.
  5. Add error handling so the tool returns structured errors rather than crashing.

Debugging Questions

  • Did the model choose the correct tool?
  • Were the arguments complete and valid?
  • Did the final answer correctly incorporate the tool result?
  • What would you alert on in production?

8. Building a Simple Evaluation Loop

Observability gives you raw signals. Evaluation turns those signals into improvement.

A practical evaluation loop

  1. Collect examples of user inputs
  2. Save prompts, retrieval results, tool calls, and outputs
  3. Review failures
  4. Label failure causes:
  5. prompt issue
  6. retrieval issue
  7. tool issue
  8. unclear user input
  9. model limitation
  10. Update the system
  11. Re-run the examples
  12. Compare results over time

Example failure taxonomy

Failure Type Symptom Likely Fix
Prompt issue Output format wrong Improve instructions or examples
Retrieval issue Answer misses known fact Improve chunking/ranking/filtering
Tool issue Wrong tool args Tighten schema, validate inputs
Safety issue Sensitive info in logs Redact or minimize stored data
Latency issue Slow responses Cache results, reduce calls, optimize tools

What to measure over time

  • Answer correctness
  • Groundedness in retrieved/tool evidence
  • Tool success rate
  • Retrieval relevance
  • Request latency
  • Error rate
  • Cost per request

9. Summary

In this session, learners explored observability across three major areas of GenAI systems:

  • Prompts: inspect inputs, instructions, outputs, and latency
  • Retrieval: inspect rankings, scores, context quality, and answer grounding
  • Tools: inspect selection, arguments, execution results, and final usage

The key lesson is that good observability makes GenAI systems easier to debug, safer to operate, and faster to improve.


10. Suggested Instructor Flow

First 10 minutes

  • Introduce observability and why it matters
  • Compare debugging GenAI systems vs traditional software

Next 10 minutes

  • Cover prompt, retrieval, and tool observability concepts
  • Show examples of failure modes

Next 20 minutes

  • Work through the three exercises
  • Ask learners to inspect logs and identify likely causes of failure

Final 5 minutes

  • Discuss evaluation loops and production-readiness
  • Preview future sessions on testing, evaluation, or agent orchestration

Useful Resources

  • OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
  • OpenAI API docs: https://platform.openai.com/docs
  • OpenAI Python SDK: https://github.com/openai/openai-python
  • Python logging module: https://docs.python.org/3/library/logging.html
  • JSON module docs: https://docs.python.org/3/library/json.html
  • Time module docs: https://docs.python.org/3/library/time.html

Optional Homework

  1. Take one of your previous GenAI scripts and add structured observability logs.
  2. Create a small log schema with fields for:
  3. request ID
  4. user input
  5. prompt version
  6. retrieval results
  7. tool calls
  8. latency
  9. final output
  10. Run five test prompts and identify at least two failure patterns.
  11. Write a short note describing whether each issue was caused by prompting, retrieval, or tool execution.

Back to Chapter | Back to Master Plan | Previous Session | Next Session