Session 3: Monitoring, Logging, and Incident Response
Synopsis
Shows how to track application health, model behavior, tool failures, user interactions, and abnormal events in production. Learners gain the operational visibility needed to maintain trust and service quality.
Session Content
Session 3: Monitoring, Logging, and Incident Response
Duration: ~45 minutes
Audience: Python developers with basic programming knowledge learning GenAI and agentic development
Session Overview
In this session, you will learn how to observe, debug, and respond to failures in GenAI applications and agentic systems. Monitoring and incident response are essential because LLM-powered systems can fail in ways that differ from traditional software: malformed outputs, prompt regressions, unexpected latency, tool misuse, cost spikes, and downstream integration errors.
By the end of this session, you will be able to:
- Explain the role of monitoring in GenAI systems
- Distinguish between logs, metrics, traces, and alerts
- Add structured logging to Python applications using the OpenAI Responses API
- Capture latency, token usage, and failure signals
- Build simple incident-response workflows for GenAI failures
- Debug common production issues in LLM applications
Learning Objectives
After this session, learners should be able to:
- Define key observability concepts for GenAI systems
- Instrument Python code with structured logs
- Track request/response metadata for OpenAI API calls
- Detect common issues such as retries, malformed outputs, and latency spikes
- Create a basic incident response checklist for LLM applications
- Practice debugging and remediation through hands-on exercises
Agenda
- Why monitoring matters for GenAI systems
- Core observability concepts: logs, metrics, traces, alerts
- What to monitor in LLM and agentic applications
- Logging and monitoring with Python
- Hands-on Exercise 1: Structured logging around Responses API calls
- Hands-on Exercise 2: Incident detection and response simulation
- Production best practices
- Useful resources
1. Why Monitoring Matters for GenAI Systems
Traditional applications often fail in deterministic ways. GenAI applications introduce probabilistic behavior and new failure modes:
- The model may return valid text that is semantically wrong
- Latency may vary widely depending on prompt size and task complexity
- Costs may spike due to excessive token use
- Structured outputs may break parsers
- Agents may loop, call tools incorrectly, or make poor decisions
- Prompt or model changes can silently degrade quality
Common Failure Categories
1. Model Output Failures
- Hallucinations
- Incorrect formatting
- Missing required fields
- Unsafe or policy-violating responses
2. Operational Failures
- API timeouts
- Rate limits
- Authentication errors
- Network failures
3. Agentic Failures
- Tool call errors
- Infinite or excessive loops
- Bad planning
- Incorrect tool selection
- State corruption between steps
4. Business Failures
- Higher cost per request
- User dissatisfaction
- Increased abandonment
- SLA breaches
Monitoring helps teams answer:
- Is the system healthy?
- Are users getting acceptable responses?
- Are costs under control?
- Did a recent deployment degrade performance?
- What happened during an incident?
2. Core Observability Concepts
2.1 Logs
Logs are timestamped records of events.
Examples: - A request was sent to the model - A response was received - Parsing failed - A retry occurred
Logs are best for: - Debugging - Auditing - Post-incident analysis
2.2 Metrics
Metrics are numerical values tracked over time.
Examples: - Request count - Error rate - Average latency - P95 latency - Tokens per request - Cost per request - Tool call success rate
Metrics are best for: - Dashboards - Threshold alerts - Trend analysis
2.3 Traces
Traces show the full lifecycle of a request across components.
For agentic applications, a trace might include: - User message received - Prompt construction - LLM request - Tool selection - Tool execution - Follow-up LLM call - Final response sent
Traces are best for: - Multi-step debugging - Distributed systems - Agent workflows
2.4 Alerts
Alerts notify you when a system crosses a threshold or displays abnormal behavior.
Examples: - Error rate > 5% - P95 latency > 8 seconds - Token usage doubles after deployment - Tool call failures exceed threshold - JSON parse failures spike
Alerts should be: - Actionable - Specific - Low-noise - Mapped to a runbook or response plan
3. What to Monitor in LLM and Agentic Applications
3.1 API Health Metrics
Track: - Request count - Success/failure count - Retry count - Timeout count - Rate-limit events - Latency per request
3.2 Model Usage Metrics
Track: - Input tokens - Output tokens - Total tokens - Estimated cost - Prompt sizes - Completion sizes
3.3 Quality Signals
Track: - Structured output parse success/failure - Human feedback score - User re-ask rate - Escalation rate to human review - Prompt version performance
3.4 Agent-Specific Metrics
Track: - Tool call count per task - Tool error rate - Steps per task - Loop detection count - Final task success/failure - Time spent in each agent step
3.5 Security and Compliance Signals
Track: - Prompt injection attempts - Sensitive data detection - Abuse patterns - Unauthorized tool access attempts
4. Logging and Monitoring with Python
A practical monitoring approach starts simple:
- Use structured logs in JSON format
- Add request IDs for correlation
- Measure latency
- Capture response metadata
- Record exceptions with enough context
- Avoid logging sensitive data
Example Structured Log Fields
timestampleveleventrequest_idmodellatency_msinput_charsoutput_charsstatuserror_typeuser_idor anonymized session ID
5. Hands-on Exercise 1: Structured Logging Around Responses API Calls
Objective
Build a Python script that: - Sends a request using the OpenAI Responses API - Logs request lifecycle events in structured JSON - Measures latency - Captures errors safely - Prints a concise summary
Prerequisites
Install the OpenAI Python SDK:
pip install openai
Set your API key:
export OPENAI_API_KEY="your_api_key_here"
Exercise Code
"""
Exercise 1: Structured logging for OpenAI Responses API calls.
What this script demonstrates:
- Structured JSON logging
- Request correlation with a request_id
- Latency measurement
- Basic error handling
- Safe logging practices
Run:
python exercise1_logging.py
"""
import json
import logging
import os
import sys
import time
import uuid
from datetime import datetime, timezone
from openai import OpenAI
class JsonFormatter(logging.Formatter):
"""Format logs as JSON for easier ingestion by monitoring tools."""
def format(self, record: logging.LogRecord) -> str:
log_entry = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"level": record.levelname,
"message": record.getMessage(),
}
# Include extra structured fields if they exist on the record.
for field in [
"event",
"request_id",
"model",
"latency_ms",
"status",
"input_chars",
"output_chars",
"error_type",
]:
if hasattr(record, field):
log_entry[field] = getattr(record, field)
return json.dumps(log_entry)
def build_logger() -> logging.Logger:
"""Create and configure a JSON logger."""
logger = logging.getLogger("genai_monitoring")
logger.setLevel(logging.INFO)
# Avoid duplicate handlers if re-run in notebooks or interactive sessions.
if not logger.handlers:
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
logger.addHandler(handler)
return logger
def extract_output_text(response) -> str:
"""
Extract output text from a Responses API object.
The SDK provides `response.output_text` for convenience.
"""
return getattr(response, "output_text", "") or ""
def main() -> None:
"""Send a request and log the lifecycle."""
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise RuntimeError("OPENAI_API_KEY environment variable is not set.")
client = OpenAI(api_key=api_key)
logger = build_logger()
request_id = str(uuid.uuid4())
model = "gpt-5.4-mini"
user_prompt = "Summarize why monitoring matters in LLM applications in 3 bullet points."
logger.info(
"Starting model request",
extra={
"event": "llm_request_started",
"request_id": request_id,
"model": model,
"input_chars": len(user_prompt),
"status": "started",
},
)
start_time = time.perf_counter()
try:
response = client.responses.create(
model=model,
input=user_prompt,
)
latency_ms = round((time.perf_counter() - start_time) * 1000, 2)
output_text = extract_output_text(response)
logger.info(
"Model request completed successfully",
extra={
"event": "llm_request_completed",
"request_id": request_id,
"model": model,
"latency_ms": latency_ms,
"status": "success",
"output_chars": len(output_text),
},
)
print("\n=== Model Output ===")
print(output_text)
print("\n=== Summary ===")
print(f"Request ID : {request_id}")
print(f"Model : {model}")
print(f"Latency ms : {latency_ms}")
print(f"Chars out : {len(output_text)}")
except Exception as exc:
latency_ms = round((time.perf_counter() - start_time) * 1000, 2)
logger.error(
f"Model request failed: {exc}",
extra={
"event": "llm_request_failed",
"request_id": request_id,
"model": model,
"latency_ms": latency_ms,
"status": "error",
"error_type": type(exc).__name__,
},
)
raise
if __name__ == "__main__":
main()
Example Output
{"timestamp": "2026-03-22T10:00:00.000000+00:00", "level": "INFO", "message": "Starting model request", "event": "llm_request_started", "request_id": "8b5d1c3b-4d2f-4b2e-9d0a-111111111111", "model": "gpt-5.4-mini", "status": "started", "input_chars": 66}
{"timestamp": "2026-03-22T10:00:01.200000+00:00", "level": "INFO", "message": "Model request completed successfully", "event": "llm_request_completed", "request_id": "8b5d1c3b-4d2f-4b2e-9d0a-111111111111", "model": "gpt-5.4-mini", "latency_ms": 1198.54, "status": "success", "output_chars": 187}
=== Model Output ===
- Monitoring helps detect failures such as malformed outputs, latency spikes, and tool misuse.
- It provides visibility into quality, reliability, and cost trends over time.
- It supports faster debugging and incident response when production issues occur.
=== Summary ===
Request ID : 8b5d1c3b-4d2f-4b2e-9d0a-111111111111
Model : gpt-5.4-mini
Latency ms : 1198.54
Chars out : 187
Discussion
Questions to ask after running: - Which fields would help most during debugging? - What should never be logged? - How would you correlate logs across an agent workflow? - What happens if you need retries or fallback models?
6. Hands-on Exercise 2: Incident Detection and Response Simulation
Objective
Simulate a small GenAI service that: - Sends multiple requests - Detects slow or failed requests - Produces an incident summary - Suggests remediation actions
This exercise demonstrates the basics of incident detection logic.
Scenario
You are operating a support assistant service. Your monitoring thresholds are:
- Alert if latency > 4000 ms
- Alert if request fails
- Alert if output is unexpectedly short for the prompt
Exercise Code
"""
Exercise 2: Simulate monitoring and incident response for a GenAI service.
What this script demonstrates:
- Batch request monitoring
- Threshold-based alerting
- Incident record generation
- Operational summary reporting
Run:
python exercise2_incident_response.py
"""
import json
import os
import time
import uuid
from dataclasses import dataclass, asdict
from typing import List
from openai import OpenAI
@dataclass
class RequestResult:
"""Represents the outcome of one monitored request."""
request_id: str
prompt: str
latency_ms: float
status: str
output_chars: int
incident_reason: str = ""
def extract_output_text(response) -> str:
"""Extract text content from a Responses API response."""
return getattr(response, "output_text", "") or ""
def evaluate_incident(result: RequestResult) -> bool:
"""Return True if this request should be flagged as an incident."""
if result.status != "success":
return True
if result.latency_ms > 4000:
return True
if result.output_chars < 40:
return True
return False
def incident_reason(result: RequestResult) -> str:
"""Generate a human-readable incident reason."""
reasons = []
if result.status != "success":
reasons.append("request_failed")
if result.latency_ms > 4000:
reasons.append("high_latency")
if result.output_chars < 40:
reasons.append("short_output")
return ",".join(reasons)
def call_model(client: OpenAI, prompt: str, model: str) -> RequestResult:
"""Make one monitored model request."""
request_id = str(uuid.uuid4())
start_time = time.perf_counter()
try:
response = client.responses.create(
model=model,
input=prompt,
)
latency_ms = round((time.perf_counter() - start_time) * 1000, 2)
output_text = extract_output_text(response)
return RequestResult(
request_id=request_id,
prompt=prompt,
latency_ms=latency_ms,
status="success",
output_chars=len(output_text),
)
except Exception:
latency_ms = round((time.perf_counter() - start_time) * 1000, 2)
return RequestResult(
request_id=request_id,
prompt=prompt,
latency_ms=latency_ms,
status="error",
output_chars=0,
)
def print_incident_report(results: List[RequestResult]) -> None:
"""Print a concise incident report."""
total = len(results)
failures = sum(1 for r in results if r.status != "success")
incidents = [r for r in results if evaluate_incident(r)]
print("\n=== Monitoring Summary ===")
print(f"Total requests : {total}")
print(f"Failures : {failures}")
print(f"Incidents : {len(incidents)}")
if results:
avg_latency = round(sum(r.latency_ms for r in results) / len(results), 2)
print(f"Avg latency ms : {avg_latency}")
print("\n=== Incident Details ===")
if not incidents:
print("No incidents detected.")
return
for result in incidents:
result.incident_reason = incident_reason(result)
print(json.dumps(asdict(result), indent=2))
print("\n=== Suggested Actions ===")
print("- Check recent deployments or prompt changes.")
print("- Review latency trends and API health.")
print("- Inspect failing prompts and response formats.")
print("- Consider fallback logic or retries for transient failures.")
print("- Escalate to human review if critical workflows are impacted.")
def main() -> None:
"""Run the incident monitoring simulation."""
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise RuntimeError("OPENAI_API_KEY environment variable is not set.")
client = OpenAI(api_key=api_key)
model = "gpt-5.4-mini"
prompts = [
"Explain monitoring in LLM systems in 2 sentences.",
"List 3 reasons logging is important in AI applications.",
"Give a one-line explanation of incident response.",
]
results = [call_model(client, prompt, model) for prompt in prompts]
print_incident_report(results)
if __name__ == "__main__":
main()
Example Output
=== Monitoring Summary ===
Total requests : 3
Failures : 0
Incidents : 1
Avg latency ms : 1520.44
=== Incident Details ===
{
"request_id": "a1b2c3d4-e5f6-7890-abcd-1234567890ab",
"prompt": "Give a one-line explanation of incident response.",
"latency_ms": 1302.18,
"status": "success",
"output_chars": 32,
"incident_reason": "short_output"
}
=== Suggested Actions ===
- Check recent deployments or prompt changes.
- Review latency trends and API health.
- Inspect failing prompts and response formats.
- Consider fallback logic or retries for transient failures.
- Escalate to human review if critical workflows are impacted.
Exercise Extension Ideas
Try modifying the script to: - Add retries with exponential backoff - Write incident reports to a JSON file - Track prompt version in each request - Add a fallback model strategy - Send alerts to email, Slack, or a webhook - Log agent step count and tool usage
7. Incident Response for GenAI Systems
When an incident happens, teams need a repeatable process.
7.1 Incident Response Lifecycle
Detect
Examples: - Alert fires on error rate - User reports poor output quality - Tool failures spike after deployment
Triage
Ask: - Is this widespread or isolated? - Is the issue model-related, prompt-related, tool-related, or infrastructure-related? - Which users or workflows are affected?
Mitigate
Possible mitigations: - Roll back a prompt change - Disable a failing tool - Route traffic to a fallback workflow - Increase timeout thresholds temporarily - Escalate to human review
Investigate
Gather: - Logs - Request IDs - Prompt versions - Error patterns - Latency trends - Tool execution records
Resolve
Examples: - Fix parser assumptions - Correct prompt template - Add validation to outputs - Improve retries or backoff logic - Patch tool integration
Review
Perform a postmortem: - What happened? - Why did detection or mitigation take time? - What monitoring was missing? - What action items will prevent recurrence?
8. Common GenAI Incidents and Responses
| Incident | Symptoms | Likely Causes | Initial Response |
|---|---|---|---|
| Output format failures | JSON parser errors, missing fields | Prompt drift, model behavior changes | Add validation, retry, stronger schema instructions |
| Latency spike | Slow responses, timeouts | Large prompts, upstream API delay, model overload | Reduce prompt size, add fallback, monitor P95 latency |
| Cost spike | Token usage jumps | Prompt expansion, agent loops, repeated retries | Cap steps, inspect prompt changes, add usage alerts |
| Tool failure | Agent cannot complete task | External API down, auth failure, bad arguments | Disable tool, retry safely, degrade gracefully |
| Hallucination increase | More incorrect answers | Prompt regression, context retrieval issues | Tighten grounding, evaluate prompts, add human review |
| Rate limits | API errors and retries | Traffic burst, inadequate backoff | Add queueing, jittered retries, traffic shaping |
9. Best Practices
Logging Best Practices
- Use structured JSON logs
- Include correlation/request IDs
- Log lifecycle events consistently
- Avoid sensitive data in logs
- Redact secrets and personally identifiable information
- Keep messages machine-readable and human-usable
Monitoring Best Practices
- Track both technical and quality metrics
- Monitor by model version and prompt version
- Use dashboards for latency, error rate, and token usage
- Add alerts with clear thresholds
- Review trends after every release
Incident Response Best Practices
- Maintain simple runbooks
- Keep alert noise low
- Practice response drills
- Record incident timelines
- Conduct blameless postmortems
Agentic System Best Practices
- Cap max steps
- Log each tool invocation
- Track tool inputs/outputs safely
- Detect loops
- Add fallback paths for tool failures
- Use validation before acting on model output
10. Mini Challenge
Spend 5–10 minutes extending one of the exercises.
Challenge Options
- Add a
prompt_versionfield to every log entry - Write all request results to a local
monitoring_report.jsonfile - Add retry logic for transient failures
- Trigger an alert when average latency exceeds a threshold
- Simulate an agent workflow with multiple monitored steps
Example: Write Results to a JSON File
"""
Mini challenge: Save monitoring results to a JSON file.
"""
import json
from dataclasses import asdict
def save_results(results, filename="monitoring_report.json"):
"""Save request results to disk as JSON."""
with open(filename, "w", encoding="utf-8") as f:
json.dump([asdict(r) for r in results], f, indent=2)
# Example usage:
# save_results(results)
# print("Saved monitoring report to monitoring_report.json")
11. Recap
In this session, you learned:
- Why monitoring is critical in LLM and agentic systems
- The difference between logs, metrics, traces, and alerts
- What to monitor in production GenAI applications
- How to implement structured logging around OpenAI Responses API calls
- How to detect incidents using latency, failure, and output-quality thresholds
- How to build a simple incident response process
Useful Resources
- OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
- OpenAI API docs: https://platform.openai.com/docs
- OpenAI Python SDK: https://github.com/openai/openai-python
- Python logging documentation: https://docs.python.org/3/library/logging.html
- JSON module documentation: https://docs.python.org/3/library/json.html
- Dataclasses documentation: https://docs.python.org/3/library/dataclasses.html
- Google SRE book: https://sre.google/sre-book/table-of-contents/
- OpenTelemetry: https://opentelemetry.io/docs/
Suggested Homework
Build a small monitored GenAI service in Python that:
- Accepts a user prompt
- Calls
gpt-5.4-minithrough the Responses API - Logs request IDs, latency, and status
- Detects short outputs or failed requests
- Writes a local incident report JSON file
- Includes a short runbook for what to do when an incident occurs
End of Session
Next, learners can build on this foundation by integrating monitoring into multi-step agents, adding retries, fallback strategies, and quality evaluation pipelines.
Back to Chapter | Back to Master Plan | Previous Session | Next Session