Session 3: Monitoring, Logging, and Incident Response

Synopsis

Shows how to track application health, model behavior, tool failures, user interactions, and abnormal events in production. Learners gain the operational visibility needed to maintain trust and service quality.

Session Content

Session 3: Monitoring, Logging, and Incident Response

Duration: ~45 minutes
Audience: Python developers with basic programming knowledge learning GenAI and agentic development

Session Overview

In this session, you will learn how to observe, debug, and respond to failures in GenAI applications and agentic systems. Monitoring and incident response are essential because LLM-powered systems can fail in ways that differ from traditional software: malformed outputs, prompt regressions, unexpected latency, tool misuse, cost spikes, and downstream integration errors.

By the end of this session, you will be able to:

Explain the role of monitoring in GenAI systems
Distinguish between logs, metrics, traces, and alerts
Add structured logging to Python applications using the OpenAI Responses API
Capture latency, token usage, and failure signals
Build simple incident-response workflows for GenAI failures
Debug common production issues in LLM applications

Learning Objectives

After this session, learners should be able to:

Define key observability concepts for GenAI systems
Instrument Python code with structured logs
Track request/response metadata for OpenAI API calls
Detect common issues such as retries, malformed outputs, and latency spikes
Create a basic incident response checklist for LLM applications
Practice debugging and remediation through hands-on exercises

Agenda

Why monitoring matters for GenAI systems
Core observability concepts: logs, metrics, traces, alerts
What to monitor in LLM and agentic applications
Logging and monitoring with Python
Hands-on Exercise 1: Structured logging around Responses API calls
Hands-on Exercise 2: Incident detection and response simulation
Production best practices
Useful resources

1. Why Monitoring Matters for GenAI Systems

Traditional applications often fail in deterministic ways. GenAI applications introduce probabilistic behavior and new failure modes:

The model may return valid text that is semantically wrong
Latency may vary widely depending on prompt size and task complexity
Costs may spike due to excessive token use
Structured outputs may break parsers
Agents may loop, call tools incorrectly, or make poor decisions
Prompt or model changes can silently degrade quality

Common Failure Categories

1. Model Output Failures

Hallucinations
Incorrect formatting
Missing required fields
Unsafe or policy-violating responses

2. Operational Failures

API timeouts
Rate limits
Authentication errors
Network failures

3. Agentic Failures

Tool call errors
Infinite or excessive loops
Bad planning
Incorrect tool selection
State corruption between steps

4. Business Failures

Higher cost per request
User dissatisfaction
Increased abandonment
SLA breaches

Monitoring helps teams answer:

Is the system healthy?
Are users getting acceptable responses?
Are costs under control?
Did a recent deployment degrade performance?
What happened during an incident?

2. Core Observability Concepts

2.1 Logs

Logs are timestamped records of events.

Examples: - A request was sent to the model - A response was received - Parsing failed - A retry occurred

Logs are best for: - Debugging - Auditing - Post-incident analysis

2.2 Metrics

Metrics are numerical values tracked over time.

Examples: - Request count - Error rate - Average latency - P95 latency - Tokens per request - Cost per request - Tool call success rate

Metrics are best for: - Dashboards - Threshold alerts - Trend analysis

2.3 Traces

Traces show the full lifecycle of a request across components.

For agentic applications, a trace might include: - User message received - Prompt construction - LLM request - Tool selection - Tool execution - Follow-up LLM call - Final response sent

Traces are best for: - Multi-step debugging - Distributed systems - Agent workflows

2.4 Alerts

Alerts notify you when a system crosses a threshold or displays abnormal behavior.

Examples: - Error rate > 5% - P95 latency > 8 seconds - Token usage doubles after deployment - Tool call failures exceed threshold - JSON parse failures spike

Alerts should be: - Actionable - Specific - Low-noise - Mapped to a runbook or response plan

3. What to Monitor in LLM and Agentic Applications

3.1 API Health Metrics

Track: - Request count - Success/failure count - Retry count - Timeout count - Rate-limit events - Latency per request

3.2 Model Usage Metrics

Track: - Input tokens - Output tokens - Total tokens - Estimated cost - Prompt sizes - Completion sizes

3.3 Quality Signals

Track: - Structured output parse success/failure - Human feedback score - User re-ask rate - Escalation rate to human review - Prompt version performance

3.4 Agent-Specific Metrics

Track: - Tool call count per task - Tool error rate - Steps per task - Loop detection count - Final task success/failure - Time spent in each agent step

3.5 Security and Compliance Signals

Track: - Prompt injection attempts - Sensitive data detection - Abuse patterns - Unauthorized tool access attempts

4. Logging and Monitoring with Python

A practical monitoring approach starts simple:

Use structured logs in JSON format
Add request IDs for correlation
Measure latency
Capture response metadata
Record exceptions with enough context
Avoid logging sensitive data

Example Structured Log Fields

timestamp
level
event
request_id
model
latency_ms
input_chars
output_chars
status
error_type
user_id or anonymized session ID

5. Hands-on Exercise 1: Structured Logging Around Responses API Calls

Objective

Build a Python script that: - Sends a request using the OpenAI Responses API - Logs request lifecycle events in structured JSON - Measures latency - Captures errors safely - Prints a concise summary

Prerequisites

Install the OpenAI Python SDK:

pip install openai

Set your API key:

export OPENAI_API_KEY="your_api_key_here"

Exercise Code

"""
Exercise 1: Structured logging for OpenAI Responses API calls.

What this script demonstrates:
- Structured JSON logging
- Request correlation with a request_id
- Latency measurement
- Basic error handling
- Safe logging practices

Run:
    python exercise1_logging.py
"""

import json
import logging
import os
import sys
import time
import uuid
from datetime import datetime, timezone

from openai import OpenAI


class JsonFormatter(logging.Formatter):
    """Format logs as JSON for easier ingestion by monitoring tools."""

    def format(self, record: logging.LogRecord) -> str:
        log_entry = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "level": record.levelname,
            "message": record.getMessage(),
        }

        # Include extra structured fields if they exist on the record.
        for field in [
            "event",
            "request_id",
            "model",
            "latency_ms",
            "status",
            "input_chars",
            "output_chars",
            "error_type",
        ]:
            if hasattr(record, field):
                log_entry[field] = getattr(record, field)

        return json.dumps(log_entry)


def build_logger() -> logging.Logger:
    """Create and configure a JSON logger."""
    logger = logging.getLogger("genai_monitoring")
    logger.setLevel(logging.INFO)

    # Avoid duplicate handlers if re-run in notebooks or interactive sessions.
    if not logger.handlers:
        handler = logging.StreamHandler(sys.stdout)
        handler.setFormatter(JsonFormatter())
        logger.addHandler(handler)

    return logger


def extract_output_text(response) -> str:
    """
    Extract output text from a Responses API object.

    The SDK provides `response.output_text` for convenience.
    """
    return getattr(response, "output_text", "") or ""


def main() -> None:
    """Send a request and log the lifecycle."""
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise RuntimeError("OPENAI_API_KEY environment variable is not set.")

    client = OpenAI(api_key=api_key)
    logger = build_logger()

    request_id = str(uuid.uuid4())
    model = "gpt-5.4-mini"
    user_prompt = "Summarize why monitoring matters in LLM applications in 3 bullet points."

    logger.info(
        "Starting model request",
        extra={
            "event": "llm_request_started",
            "request_id": request_id,
            "model": model,
            "input_chars": len(user_prompt),
            "status": "started",
        },
    )

    start_time = time.perf_counter()

    try:
        response = client.responses.create(
            model=model,
            input=user_prompt,
        )

        latency_ms = round((time.perf_counter() - start_time) * 1000, 2)
        output_text = extract_output_text(response)

        logger.info(
            "Model request completed successfully",
            extra={
                "event": "llm_request_completed",
                "request_id": request_id,
                "model": model,
                "latency_ms": latency_ms,
                "status": "success",
                "output_chars": len(output_text),
            },
        )

        print("\n=== Model Output ===")
        print(output_text)
        print("\n=== Summary ===")
        print(f"Request ID : {request_id}")
        print(f"Model      : {model}")
        print(f"Latency ms : {latency_ms}")
        print(f"Chars out  : {len(output_text)}")

    except Exception as exc:
        latency_ms = round((time.perf_counter() - start_time) * 1000, 2)

        logger.error(
            f"Model request failed: {exc}",
            extra={
                "event": "llm_request_failed",
                "request_id": request_id,
                "model": model,
                "latency_ms": latency_ms,
                "status": "error",
                "error_type": type(exc).__name__,
            },
        )
        raise


if __name__ == "__main__":
    main()

Example Output

{"timestamp": "2026-03-22T10:00:00.000000+00:00", "level": "INFO", "message": "Starting model request", "event": "llm_request_started", "request_id": "8b5d1c3b-4d2f-4b2e-9d0a-111111111111", "model": "gpt-5.4-mini", "status": "started", "input_chars": 66}
{"timestamp": "2026-03-22T10:00:01.200000+00:00", "level": "INFO", "message": "Model request completed successfully", "event": "llm_request_completed", "request_id": "8b5d1c3b-4d2f-4b2e-9d0a-111111111111", "model": "gpt-5.4-mini", "latency_ms": 1198.54, "status": "success", "output_chars": 187}

=== Model Output ===
- Monitoring helps detect failures such as malformed outputs, latency spikes, and tool misuse.
- It provides visibility into quality, reliability, and cost trends over time.
- It supports faster debugging and incident response when production issues occur.

=== Summary ===
Request ID : 8b5d1c3b-4d2f-4b2e-9d0a-111111111111
Model      : gpt-5.4-mini
Latency ms : 1198.54
Chars out  : 187

Discussion

Questions to ask after running: - Which fields would help most during debugging? - What should never be logged? - How would you correlate logs across an agent workflow? - What happens if you need retries or fallback models?

6. Hands-on Exercise 2: Incident Detection and Response Simulation

Objective

Simulate a small GenAI service that: - Sends multiple requests - Detects slow or failed requests - Produces an incident summary - Suggests remediation actions

This exercise demonstrates the basics of incident detection logic.

Scenario

You are operating a support assistant service. Your monitoring thresholds are:

Alert if latency > 4000 ms
Alert if request fails
Alert if output is unexpectedly short for the prompt

Exercise Code

"""
Exercise 2: Simulate monitoring and incident response for a GenAI service.

What this script demonstrates:
- Batch request monitoring
- Threshold-based alerting
- Incident record generation
- Operational summary reporting

Run:
    python exercise2_incident_response.py
"""

import json
import os
import time
import uuid
from dataclasses import dataclass, asdict
from typing import List

from openai import OpenAI


@dataclass
class RequestResult:
    """Represents the outcome of one monitored request."""
    request_id: str
    prompt: str
    latency_ms: float
    status: str
    output_chars: int
    incident_reason: str = ""


def extract_output_text(response) -> str:
    """Extract text content from a Responses API response."""
    return getattr(response, "output_text", "") or ""


def evaluate_incident(result: RequestResult) -> bool:
    """Return True if this request should be flagged as an incident."""
    if result.status != "success":
        return True
    if result.latency_ms > 4000:
        return True
    if result.output_chars < 40:
        return True
    return False


def incident_reason(result: RequestResult) -> str:
    """Generate a human-readable incident reason."""
    reasons = []
    if result.status != "success":
        reasons.append("request_failed")
    if result.latency_ms > 4000:
        reasons.append("high_latency")
    if result.output_chars < 40:
        reasons.append("short_output")
    return ",".join(reasons)


def call_model(client: OpenAI, prompt: str, model: str) -> RequestResult:
    """Make one monitored model request."""
    request_id = str(uuid.uuid4())
    start_time = time.perf_counter()

    try:
        response = client.responses.create(
            model=model,
            input=prompt,
        )
        latency_ms = round((time.perf_counter() - start_time) * 1000, 2)
        output_text = extract_output_text(response)

        return RequestResult(
            request_id=request_id,
            prompt=prompt,
            latency_ms=latency_ms,
            status="success",
            output_chars=len(output_text),
        )
    except Exception:
        latency_ms = round((time.perf_counter() - start_time) * 1000, 2)
        return RequestResult(
            request_id=request_id,
            prompt=prompt,
            latency_ms=latency_ms,
            status="error",
            output_chars=0,
        )


def print_incident_report(results: List[RequestResult]) -> None:
    """Print a concise incident report."""
    total = len(results)
    failures = sum(1 for r in results if r.status != "success")
    incidents = [r for r in results if evaluate_incident(r)]

    print("\n=== Monitoring Summary ===")
    print(f"Total requests : {total}")
    print(f"Failures       : {failures}")
    print(f"Incidents      : {len(incidents)}")

    if results:
        avg_latency = round(sum(r.latency_ms for r in results) / len(results), 2)
        print(f"Avg latency ms : {avg_latency}")

    print("\n=== Incident Details ===")
    if not incidents:
        print("No incidents detected.")
        return

    for result in incidents:
        result.incident_reason = incident_reason(result)
        print(json.dumps(asdict(result), indent=2))

    print("\n=== Suggested Actions ===")
    print("- Check recent deployments or prompt changes.")
    print("- Review latency trends and API health.")
    print("- Inspect failing prompts and response formats.")
    print("- Consider fallback logic or retries for transient failures.")
    print("- Escalate to human review if critical workflows are impacted.")


def main() -> None:
    """Run the incident monitoring simulation."""
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise RuntimeError("OPENAI_API_KEY environment variable is not set.")

    client = OpenAI(api_key=api_key)
    model = "gpt-5.4-mini"

    prompts = [
        "Explain monitoring in LLM systems in 2 sentences.",
        "List 3 reasons logging is important in AI applications.",
        "Give a one-line explanation of incident response.",
    ]

    results = [call_model(client, prompt, model) for prompt in prompts]
    print_incident_report(results)


if __name__ == "__main__":
    main()

Example Output

=== Monitoring Summary ===
Total requests : 3
Failures       : 0
Incidents      : 1
Avg latency ms : 1520.44

=== Incident Details ===
{
  "request_id": "a1b2c3d4-e5f6-7890-abcd-1234567890ab",
  "prompt": "Give a one-line explanation of incident response.",
  "latency_ms": 1302.18,
  "status": "success",
  "output_chars": 32,
  "incident_reason": "short_output"
}

=== Suggested Actions ===
- Check recent deployments or prompt changes.
- Review latency trends and API health.
- Inspect failing prompts and response formats.
- Consider fallback logic or retries for transient failures.
- Escalate to human review if critical workflows are impacted.

Exercise Extension Ideas

Try modifying the script to: - Add retries with exponential backoff - Write incident reports to a JSON file - Track prompt version in each request - Add a fallback model strategy - Send alerts to email, Slack, or a webhook - Log agent step count and tool usage

7. Incident Response for GenAI Systems

When an incident happens, teams need a repeatable process.

7.1 Incident Response Lifecycle

Detect

Examples: - Alert fires on error rate - User reports poor output quality - Tool failures spike after deployment

Triage

Ask: - Is this widespread or isolated? - Is the issue model-related, prompt-related, tool-related, or infrastructure-related? - Which users or workflows are affected?

Mitigate

Possible mitigations: - Roll back a prompt change - Disable a failing tool - Route traffic to a fallback workflow - Increase timeout thresholds temporarily - Escalate to human review

Investigate

Gather: - Logs - Request IDs - Prompt versions - Error patterns - Latency trends - Tool execution records

Resolve

Examples: - Fix parser assumptions - Correct prompt template - Add validation to outputs - Improve retries or backoff logic - Patch tool integration

Review

Perform a postmortem: - What happened? - Why did detection or mitigation take time? - What monitoring was missing? - What action items will prevent recurrence?

8. Common GenAI Incidents and Responses

Incident	Symptoms	Likely Causes	Initial Response
Output format failures	JSON parser errors, missing fields	Prompt drift, model behavior changes	Add validation, retry, stronger schema instructions
Latency spike	Slow responses, timeouts	Large prompts, upstream API delay, model overload	Reduce prompt size, add fallback, monitor P95 latency
Cost spike	Token usage jumps	Prompt expansion, agent loops, repeated retries	Cap steps, inspect prompt changes, add usage alerts
Tool failure	Agent cannot complete task	External API down, auth failure, bad arguments	Disable tool, retry safely, degrade gracefully
Hallucination increase	More incorrect answers	Prompt regression, context retrieval issues	Tighten grounding, evaluate prompts, add human review
Rate limits	API errors and retries	Traffic burst, inadequate backoff	Add queueing, jittered retries, traffic shaping

9. Best Practices

Logging Best Practices

Use structured JSON logs
Include correlation/request IDs
Log lifecycle events consistently
Avoid sensitive data in logs
Redact secrets and personally identifiable information
Keep messages machine-readable and human-usable

Monitoring Best Practices

Track both technical and quality metrics
Monitor by model version and prompt version
Use dashboards for latency, error rate, and token usage
Add alerts with clear thresholds
Review trends after every release

Incident Response Best Practices

Maintain simple runbooks
Keep alert noise low
Practice response drills
Record incident timelines
Conduct blameless postmortems

Agentic System Best Practices

Cap max steps
Log each tool invocation
Track tool inputs/outputs safely
Detect loops
Add fallback paths for tool failures
Use validation before acting on model output

10. Mini Challenge

Spend 5–10 minutes extending one of the exercises.

Challenge Options

Add a prompt_version field to every log entry
Write all request results to a local monitoring_report.json file
Add retry logic for transient failures
Trigger an alert when average latency exceeds a threshold
Simulate an agent workflow with multiple monitored steps

Example: Write Results to a JSON File

"""
Mini challenge: Save monitoring results to a JSON file.
"""

import json
from dataclasses import asdict

def save_results(results, filename="monitoring_report.json"):
    """Save request results to disk as JSON."""
    with open(filename, "w", encoding="utf-8") as f:
        json.dump([asdict(r) for r in results], f, indent=2)

# Example usage:
# save_results(results)
# print("Saved monitoring report to monitoring_report.json")

11. Recap

In this session, you learned:

Why monitoring is critical in LLM and agentic systems
The difference between logs, metrics, traces, and alerts
What to monitor in production GenAI applications
How to implement structured logging around OpenAI Responses API calls
How to detect incidents using latency, failure, and output-quality thresholds
How to build a simple incident response process

Useful Resources

OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
OpenAI API docs: https://platform.openai.com/docs
OpenAI Python SDK: https://github.com/openai/openai-python
Python logging documentation: https://docs.python.org/3/library/logging.html
JSON module documentation: https://docs.python.org/3/library/json.html
Dataclasses documentation: https://docs.python.org/3/library/dataclasses.html
Google SRE book: https://sre.google/sre-book/table-of-contents/
OpenTelemetry: https://opentelemetry.io/docs/

Suggested Homework

Build a small monitored GenAI service in Python that:

Accepts a user prompt
Calls gpt-5.4-mini through the Responses API
Logs request IDs, latency, and status
Detects short outputs or failed requests
Writes a local incident report JSON file
Includes a short runbook for what to do when an incident occurs

End of Session

Next, learners can build on this foundation by integrating monitoring into multi-step agents, adding retries, fallback strategies, and quality evaluation pipelines.

Back to Chapter | Back to Master Plan | Previous Session | Next Session