Skip to content

Session 3: Monitoring, Logging, and Incident Response

Synopsis

Shows how to track application health, model behavior, tool failures, user interactions, and abnormal events in production. Learners gain the operational visibility needed to maintain trust and service quality.

Session Content

Session 3: Monitoring, Logging, and Incident Response

Duration: ~45 minutes
Audience: Python developers with basic programming knowledge learning GenAI and agentic development


Session Overview

In this session, you will learn how to observe, debug, and respond to failures in GenAI applications and agentic systems. Monitoring and incident response are essential because LLM-powered systems can fail in ways that differ from traditional software: malformed outputs, prompt regressions, unexpected latency, tool misuse, cost spikes, and downstream integration errors.

By the end of this session, you will be able to:

  • Explain the role of monitoring in GenAI systems
  • Distinguish between logs, metrics, traces, and alerts
  • Add structured logging to Python applications using the OpenAI Responses API
  • Capture latency, token usage, and failure signals
  • Build simple incident-response workflows for GenAI failures
  • Debug common production issues in LLM applications

Learning Objectives

After this session, learners should be able to:

  1. Define key observability concepts for GenAI systems
  2. Instrument Python code with structured logs
  3. Track request/response metadata for OpenAI API calls
  4. Detect common issues such as retries, malformed outputs, and latency spikes
  5. Create a basic incident response checklist for LLM applications
  6. Practice debugging and remediation through hands-on exercises

Agenda

  1. Why monitoring matters for GenAI systems
  2. Core observability concepts: logs, metrics, traces, alerts
  3. What to monitor in LLM and agentic applications
  4. Logging and monitoring with Python
  5. Hands-on Exercise 1: Structured logging around Responses API calls
  6. Hands-on Exercise 2: Incident detection and response simulation
  7. Production best practices
  8. Useful resources

1. Why Monitoring Matters for GenAI Systems

Traditional applications often fail in deterministic ways. GenAI applications introduce probabilistic behavior and new failure modes:

  • The model may return valid text that is semantically wrong
  • Latency may vary widely depending on prompt size and task complexity
  • Costs may spike due to excessive token use
  • Structured outputs may break parsers
  • Agents may loop, call tools incorrectly, or make poor decisions
  • Prompt or model changes can silently degrade quality

Common Failure Categories

1. Model Output Failures

  • Hallucinations
  • Incorrect formatting
  • Missing required fields
  • Unsafe or policy-violating responses

2. Operational Failures

  • API timeouts
  • Rate limits
  • Authentication errors
  • Network failures

3. Agentic Failures

  • Tool call errors
  • Infinite or excessive loops
  • Bad planning
  • Incorrect tool selection
  • State corruption between steps

4. Business Failures

  • Higher cost per request
  • User dissatisfaction
  • Increased abandonment
  • SLA breaches

Monitoring helps teams answer:

  • Is the system healthy?
  • Are users getting acceptable responses?
  • Are costs under control?
  • Did a recent deployment degrade performance?
  • What happened during an incident?

2. Core Observability Concepts

2.1 Logs

Logs are timestamped records of events.

Examples: - A request was sent to the model - A response was received - Parsing failed - A retry occurred

Logs are best for: - Debugging - Auditing - Post-incident analysis

2.2 Metrics

Metrics are numerical values tracked over time.

Examples: - Request count - Error rate - Average latency - P95 latency - Tokens per request - Cost per request - Tool call success rate

Metrics are best for: - Dashboards - Threshold alerts - Trend analysis

2.3 Traces

Traces show the full lifecycle of a request across components.

For agentic applications, a trace might include: - User message received - Prompt construction - LLM request - Tool selection - Tool execution - Follow-up LLM call - Final response sent

Traces are best for: - Multi-step debugging - Distributed systems - Agent workflows

2.4 Alerts

Alerts notify you when a system crosses a threshold or displays abnormal behavior.

Examples: - Error rate > 5% - P95 latency > 8 seconds - Token usage doubles after deployment - Tool call failures exceed threshold - JSON parse failures spike

Alerts should be: - Actionable - Specific - Low-noise - Mapped to a runbook or response plan


3. What to Monitor in LLM and Agentic Applications

3.1 API Health Metrics

Track: - Request count - Success/failure count - Retry count - Timeout count - Rate-limit events - Latency per request

3.2 Model Usage Metrics

Track: - Input tokens - Output tokens - Total tokens - Estimated cost - Prompt sizes - Completion sizes

3.3 Quality Signals

Track: - Structured output parse success/failure - Human feedback score - User re-ask rate - Escalation rate to human review - Prompt version performance

3.4 Agent-Specific Metrics

Track: - Tool call count per task - Tool error rate - Steps per task - Loop detection count - Final task success/failure - Time spent in each agent step

3.5 Security and Compliance Signals

Track: - Prompt injection attempts - Sensitive data detection - Abuse patterns - Unauthorized tool access attempts


4. Logging and Monitoring with Python

A practical monitoring approach starts simple:

  1. Use structured logs in JSON format
  2. Add request IDs for correlation
  3. Measure latency
  4. Capture response metadata
  5. Record exceptions with enough context
  6. Avoid logging sensitive data

Example Structured Log Fields

  • timestamp
  • level
  • event
  • request_id
  • model
  • latency_ms
  • input_chars
  • output_chars
  • status
  • error_type
  • user_id or anonymized session ID

5. Hands-on Exercise 1: Structured Logging Around Responses API Calls

Objective

Build a Python script that: - Sends a request using the OpenAI Responses API - Logs request lifecycle events in structured JSON - Measures latency - Captures errors safely - Prints a concise summary

Prerequisites

Install the OpenAI Python SDK:

pip install openai

Set your API key:

export OPENAI_API_KEY="your_api_key_here"

Exercise Code

"""
Exercise 1: Structured logging for OpenAI Responses API calls.

What this script demonstrates:
- Structured JSON logging
- Request correlation with a request_id
- Latency measurement
- Basic error handling
- Safe logging practices

Run:
    python exercise1_logging.py
"""

import json
import logging
import os
import sys
import time
import uuid
from datetime import datetime, timezone

from openai import OpenAI


class JsonFormatter(logging.Formatter):
    """Format logs as JSON for easier ingestion by monitoring tools."""

    def format(self, record: logging.LogRecord) -> str:
        log_entry = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "level": record.levelname,
            "message": record.getMessage(),
        }

        # Include extra structured fields if they exist on the record.
        for field in [
            "event",
            "request_id",
            "model",
            "latency_ms",
            "status",
            "input_chars",
            "output_chars",
            "error_type",
        ]:
            if hasattr(record, field):
                log_entry[field] = getattr(record, field)

        return json.dumps(log_entry)


def build_logger() -> logging.Logger:
    """Create and configure a JSON logger."""
    logger = logging.getLogger("genai_monitoring")
    logger.setLevel(logging.INFO)

    # Avoid duplicate handlers if re-run in notebooks or interactive sessions.
    if not logger.handlers:
        handler = logging.StreamHandler(sys.stdout)
        handler.setFormatter(JsonFormatter())
        logger.addHandler(handler)

    return logger


def extract_output_text(response) -> str:
    """
    Extract output text from a Responses API object.

    The SDK provides `response.output_text` for convenience.
    """
    return getattr(response, "output_text", "") or ""


def main() -> None:
    """Send a request and log the lifecycle."""
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise RuntimeError("OPENAI_API_KEY environment variable is not set.")

    client = OpenAI(api_key=api_key)
    logger = build_logger()

    request_id = str(uuid.uuid4())
    model = "gpt-5.4-mini"
    user_prompt = "Summarize why monitoring matters in LLM applications in 3 bullet points."

    logger.info(
        "Starting model request",
        extra={
            "event": "llm_request_started",
            "request_id": request_id,
            "model": model,
            "input_chars": len(user_prompt),
            "status": "started",
        },
    )

    start_time = time.perf_counter()

    try:
        response = client.responses.create(
            model=model,
            input=user_prompt,
        )

        latency_ms = round((time.perf_counter() - start_time) * 1000, 2)
        output_text = extract_output_text(response)

        logger.info(
            "Model request completed successfully",
            extra={
                "event": "llm_request_completed",
                "request_id": request_id,
                "model": model,
                "latency_ms": latency_ms,
                "status": "success",
                "output_chars": len(output_text),
            },
        )

        print("\n=== Model Output ===")
        print(output_text)
        print("\n=== Summary ===")
        print(f"Request ID : {request_id}")
        print(f"Model      : {model}")
        print(f"Latency ms : {latency_ms}")
        print(f"Chars out  : {len(output_text)}")

    except Exception as exc:
        latency_ms = round((time.perf_counter() - start_time) * 1000, 2)

        logger.error(
            f"Model request failed: {exc}",
            extra={
                "event": "llm_request_failed",
                "request_id": request_id,
                "model": model,
                "latency_ms": latency_ms,
                "status": "error",
                "error_type": type(exc).__name__,
            },
        )
        raise


if __name__ == "__main__":
    main()

Example Output

{"timestamp": "2026-03-22T10:00:00.000000+00:00", "level": "INFO", "message": "Starting model request", "event": "llm_request_started", "request_id": "8b5d1c3b-4d2f-4b2e-9d0a-111111111111", "model": "gpt-5.4-mini", "status": "started", "input_chars": 66}
{"timestamp": "2026-03-22T10:00:01.200000+00:00", "level": "INFO", "message": "Model request completed successfully", "event": "llm_request_completed", "request_id": "8b5d1c3b-4d2f-4b2e-9d0a-111111111111", "model": "gpt-5.4-mini", "latency_ms": 1198.54, "status": "success", "output_chars": 187}

=== Model Output ===
- Monitoring helps detect failures such as malformed outputs, latency spikes, and tool misuse.
- It provides visibility into quality, reliability, and cost trends over time.
- It supports faster debugging and incident response when production issues occur.

=== Summary ===
Request ID : 8b5d1c3b-4d2f-4b2e-9d0a-111111111111
Model      : gpt-5.4-mini
Latency ms : 1198.54
Chars out  : 187

Discussion

Questions to ask after running: - Which fields would help most during debugging? - What should never be logged? - How would you correlate logs across an agent workflow? - What happens if you need retries or fallback models?


6. Hands-on Exercise 2: Incident Detection and Response Simulation

Objective

Simulate a small GenAI service that: - Sends multiple requests - Detects slow or failed requests - Produces an incident summary - Suggests remediation actions

This exercise demonstrates the basics of incident detection logic.

Scenario

You are operating a support assistant service. Your monitoring thresholds are:

  • Alert if latency > 4000 ms
  • Alert if request fails
  • Alert if output is unexpectedly short for the prompt

Exercise Code

"""
Exercise 2: Simulate monitoring and incident response for a GenAI service.

What this script demonstrates:
- Batch request monitoring
- Threshold-based alerting
- Incident record generation
- Operational summary reporting

Run:
    python exercise2_incident_response.py
"""

import json
import os
import time
import uuid
from dataclasses import dataclass, asdict
from typing import List

from openai import OpenAI


@dataclass
class RequestResult:
    """Represents the outcome of one monitored request."""
    request_id: str
    prompt: str
    latency_ms: float
    status: str
    output_chars: int
    incident_reason: str = ""


def extract_output_text(response) -> str:
    """Extract text content from a Responses API response."""
    return getattr(response, "output_text", "") or ""


def evaluate_incident(result: RequestResult) -> bool:
    """Return True if this request should be flagged as an incident."""
    if result.status != "success":
        return True
    if result.latency_ms > 4000:
        return True
    if result.output_chars < 40:
        return True
    return False


def incident_reason(result: RequestResult) -> str:
    """Generate a human-readable incident reason."""
    reasons = []
    if result.status != "success":
        reasons.append("request_failed")
    if result.latency_ms > 4000:
        reasons.append("high_latency")
    if result.output_chars < 40:
        reasons.append("short_output")
    return ",".join(reasons)


def call_model(client: OpenAI, prompt: str, model: str) -> RequestResult:
    """Make one monitored model request."""
    request_id = str(uuid.uuid4())
    start_time = time.perf_counter()

    try:
        response = client.responses.create(
            model=model,
            input=prompt,
        )
        latency_ms = round((time.perf_counter() - start_time) * 1000, 2)
        output_text = extract_output_text(response)

        return RequestResult(
            request_id=request_id,
            prompt=prompt,
            latency_ms=latency_ms,
            status="success",
            output_chars=len(output_text),
        )
    except Exception:
        latency_ms = round((time.perf_counter() - start_time) * 1000, 2)
        return RequestResult(
            request_id=request_id,
            prompt=prompt,
            latency_ms=latency_ms,
            status="error",
            output_chars=0,
        )


def print_incident_report(results: List[RequestResult]) -> None:
    """Print a concise incident report."""
    total = len(results)
    failures = sum(1 for r in results if r.status != "success")
    incidents = [r for r in results if evaluate_incident(r)]

    print("\n=== Monitoring Summary ===")
    print(f"Total requests : {total}")
    print(f"Failures       : {failures}")
    print(f"Incidents      : {len(incidents)}")

    if results:
        avg_latency = round(sum(r.latency_ms for r in results) / len(results), 2)
        print(f"Avg latency ms : {avg_latency}")

    print("\n=== Incident Details ===")
    if not incidents:
        print("No incidents detected.")
        return

    for result in incidents:
        result.incident_reason = incident_reason(result)
        print(json.dumps(asdict(result), indent=2))

    print("\n=== Suggested Actions ===")
    print("- Check recent deployments or prompt changes.")
    print("- Review latency trends and API health.")
    print("- Inspect failing prompts and response formats.")
    print("- Consider fallback logic or retries for transient failures.")
    print("- Escalate to human review if critical workflows are impacted.")


def main() -> None:
    """Run the incident monitoring simulation."""
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise RuntimeError("OPENAI_API_KEY environment variable is not set.")

    client = OpenAI(api_key=api_key)
    model = "gpt-5.4-mini"

    prompts = [
        "Explain monitoring in LLM systems in 2 sentences.",
        "List 3 reasons logging is important in AI applications.",
        "Give a one-line explanation of incident response.",
    ]

    results = [call_model(client, prompt, model) for prompt in prompts]
    print_incident_report(results)


if __name__ == "__main__":
    main()

Example Output

=== Monitoring Summary ===
Total requests : 3
Failures       : 0
Incidents      : 1
Avg latency ms : 1520.44

=== Incident Details ===
{
  "request_id": "a1b2c3d4-e5f6-7890-abcd-1234567890ab",
  "prompt": "Give a one-line explanation of incident response.",
  "latency_ms": 1302.18,
  "status": "success",
  "output_chars": 32,
  "incident_reason": "short_output"
}

=== Suggested Actions ===
- Check recent deployments or prompt changes.
- Review latency trends and API health.
- Inspect failing prompts and response formats.
- Consider fallback logic or retries for transient failures.
- Escalate to human review if critical workflows are impacted.

Exercise Extension Ideas

Try modifying the script to: - Add retries with exponential backoff - Write incident reports to a JSON file - Track prompt version in each request - Add a fallback model strategy - Send alerts to email, Slack, or a webhook - Log agent step count and tool usage


7. Incident Response for GenAI Systems

When an incident happens, teams need a repeatable process.

7.1 Incident Response Lifecycle

Detect

Examples: - Alert fires on error rate - User reports poor output quality - Tool failures spike after deployment

Triage

Ask: - Is this widespread or isolated? - Is the issue model-related, prompt-related, tool-related, or infrastructure-related? - Which users or workflows are affected?

Mitigate

Possible mitigations: - Roll back a prompt change - Disable a failing tool - Route traffic to a fallback workflow - Increase timeout thresholds temporarily - Escalate to human review

Investigate

Gather: - Logs - Request IDs - Prompt versions - Error patterns - Latency trends - Tool execution records

Resolve

Examples: - Fix parser assumptions - Correct prompt template - Add validation to outputs - Improve retries or backoff logic - Patch tool integration

Review

Perform a postmortem: - What happened? - Why did detection or mitigation take time? - What monitoring was missing? - What action items will prevent recurrence?


8. Common GenAI Incidents and Responses

Incident Symptoms Likely Causes Initial Response
Output format failures JSON parser errors, missing fields Prompt drift, model behavior changes Add validation, retry, stronger schema instructions
Latency spike Slow responses, timeouts Large prompts, upstream API delay, model overload Reduce prompt size, add fallback, monitor P95 latency
Cost spike Token usage jumps Prompt expansion, agent loops, repeated retries Cap steps, inspect prompt changes, add usage alerts
Tool failure Agent cannot complete task External API down, auth failure, bad arguments Disable tool, retry safely, degrade gracefully
Hallucination increase More incorrect answers Prompt regression, context retrieval issues Tighten grounding, evaluate prompts, add human review
Rate limits API errors and retries Traffic burst, inadequate backoff Add queueing, jittered retries, traffic shaping

9. Best Practices

Logging Best Practices

  • Use structured JSON logs
  • Include correlation/request IDs
  • Log lifecycle events consistently
  • Avoid sensitive data in logs
  • Redact secrets and personally identifiable information
  • Keep messages machine-readable and human-usable

Monitoring Best Practices

  • Track both technical and quality metrics
  • Monitor by model version and prompt version
  • Use dashboards for latency, error rate, and token usage
  • Add alerts with clear thresholds
  • Review trends after every release

Incident Response Best Practices

  • Maintain simple runbooks
  • Keep alert noise low
  • Practice response drills
  • Record incident timelines
  • Conduct blameless postmortems

Agentic System Best Practices

  • Cap max steps
  • Log each tool invocation
  • Track tool inputs/outputs safely
  • Detect loops
  • Add fallback paths for tool failures
  • Use validation before acting on model output

10. Mini Challenge

Spend 5–10 minutes extending one of the exercises.

Challenge Options

  1. Add a prompt_version field to every log entry
  2. Write all request results to a local monitoring_report.json file
  3. Add retry logic for transient failures
  4. Trigger an alert when average latency exceeds a threshold
  5. Simulate an agent workflow with multiple monitored steps

Example: Write Results to a JSON File

"""
Mini challenge: Save monitoring results to a JSON file.
"""

import json
from dataclasses import asdict

def save_results(results, filename="monitoring_report.json"):
    """Save request results to disk as JSON."""
    with open(filename, "w", encoding="utf-8") as f:
        json.dump([asdict(r) for r in results], f, indent=2)

# Example usage:
# save_results(results)
# print("Saved monitoring report to monitoring_report.json")

11. Recap

In this session, you learned:

  • Why monitoring is critical in LLM and agentic systems
  • The difference between logs, metrics, traces, and alerts
  • What to monitor in production GenAI applications
  • How to implement structured logging around OpenAI Responses API calls
  • How to detect incidents using latency, failure, and output-quality thresholds
  • How to build a simple incident response process

Useful Resources

  • OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
  • OpenAI API docs: https://platform.openai.com/docs
  • OpenAI Python SDK: https://github.com/openai/openai-python
  • Python logging documentation: https://docs.python.org/3/library/logging.html
  • JSON module documentation: https://docs.python.org/3/library/json.html
  • Dataclasses documentation: https://docs.python.org/3/library/dataclasses.html
  • Google SRE book: https://sre.google/sre-book/table-of-contents/
  • OpenTelemetry: https://opentelemetry.io/docs/

Suggested Homework

Build a small monitored GenAI service in Python that:

  • Accepts a user prompt
  • Calls gpt-5.4-mini through the Responses API
  • Logs request IDs, latency, and status
  • Detects short outputs or failed requests
  • Writes a local incident report JSON file
  • Includes a short runbook for what to do when an incident occurs

End of Session

Next, learners can build on this foundation by integrating monitoring into multi-step agents, adding retries, fallback strategies, and quality evaluation pipelines.


Back to Chapter | Back to Master Plan | Previous Session | Next Session