Skip to content

Session 4: Iterative Reliability Improvement

Synopsis

Shows how to use evaluation results to refine prompts, tune retrieval, redesign tools, adjust workflows, and add safeguards. This session reinforces the engineering loop of measure, diagnose, and improve.

Session Content

Session 4: Iterative Reliability Improvement

Session Overview

Duration: ~45 minutes
Audience: Python developers with basic programming knowledge, learning GenAI and agentic development
Session Goal: Learn how to improve LLM application reliability through iterative refinement, evaluation, structured outputs, and failure analysis.

Learning Objectives

By the end of this session, learners will be able to:

  • Define reliability in the context of GenAI applications
  • Identify common sources of LLM failures
  • Improve outputs using prompt iteration and tighter instructions
  • Use structured outputs to reduce ambiguity
  • Build a small evaluation loop in Python
  • Analyze failures and systematically improve application behavior

Agenda

  1. Reliability in GenAI systems
  2. Common failure modes
  3. Prompt iteration strategies
  4. Structured outputs for consistency
  5. Hands-on: Build a simple reliability evaluation loop
  6. Hands-on: Improve a support-ticket classifier iteratively
  7. Wrap-up and resources

1. Reliability in GenAI Systems

Reliability in traditional software usually means that the same input gives the same correct output every time.

In GenAI systems, reliability is more nuanced. LLM outputs can vary, and “correctness” may depend on:

  • Factual accuracy
  • Instruction following
  • Format compliance
  • Safety constraints
  • Task completion quality

Reliability Dimensions

When building LLM-powered applications, consider these dimensions:

  • Accuracy: Is the answer correct?
  • Consistency: Does the model behave similarly across repeated or similar inputs?
  • Format adherence: Does the output match the expected schema?
  • Robustness: Does the system handle edge cases well?
  • Safety: Does it avoid harmful or disallowed responses?

Key Principle

Reliability is usually improved iteratively, not all at once.

A common workflow:

  1. Define the task
  2. Collect representative examples
  3. Test the current behavior
  4. Analyze failures
  5. Improve prompt/schema/logic
  6. Re-test
  7. Repeat

2. Common Failure Modes

LLM applications commonly fail in predictable ways.

A. Ambiguous Instructions

If your prompt is vague, the model may produce outputs that are plausible but not useful.

Example: - Weak prompt: “Classify this ticket.” - Better prompt: “Classify this customer support ticket into exactly one of: billing, technical, account, or other.”

B. Output Format Drift

The model may return extra text, omit fields, or produce inconsistent structures.

Example problem: - Sometimes returns JSON - Sometimes returns prose - Sometimes includes explanations when only labels are needed

C. Edge Case Misclassification

Examples: - Billing issue framed as account access - Mixed intent in one ticket - Very short messages like “help” - Sarcasm or unclear language

D. Hallucination

The model may invent facts, details, or policies if asked beyond available context.

E. Overconfidence

A model may provide a confident answer when it should ask for clarification or indicate uncertainty.


3. Prompt Iteration Strategies

Improving reliability often starts with better prompting.

Strategy 1: Make the Task Explicit

Specify:

  • The task
  • Allowed outputs
  • Decision criteria
  • Constraints
  • What to do when uncertain

Weak:

Summarize the message.

Better:

Summarize the customer message in one sentence under 20 words. Do not add information not present in the message.

Strategy 2: Define the Output Space

Constrain valid outputs as much as possible.

Example:

Return exactly one label from:
- billing
- technical
- account
- other

Strategy 3: Add Decision Rules

Tell the model how to choose between confusing categories.

Example:

If the message is about login, password reset, or account access, classify as account.
If the message is about charges, invoices, refunds, or payment methods, classify as billing.
If the message mentions bugs, crashes, performance, or errors, classify as technical.
If none apply, classify as other.

Strategy 4: Specify Behavior for Uncertainty

Example:

If the ticket is too vague to classify confidently, return "other".

Strategy 5: Separate Reasoning from Final Answer

In production, you may want either:

  • internal reasoning not shown to users, or
  • a structured final answer only

For reliability, output structure matters more than verbose prose.


4. Structured Outputs for Consistency

One of the easiest ways to improve reliability is to require structured output.

Instead of free-form text: - Ask for JSON-compatible fields - Validate the output in Python - Reject or retry invalid results

Why Structured Outputs Help

They reduce:

  • formatting ambiguity
  • downstream parsing errors
  • accidental extra commentary

Example Target Structure

For support ticket classification:

{
  "label": "billing",
  "confidence": 0.92,
  "short_reason": "Customer mentions duplicate charge."
}

Design Tips

Keep schemas:

  • small
  • focused
  • easy to validate
  • aligned with business needs

Avoid asking the model for unnecessary fields if you do not use them.


5. Hands-On Exercise 1: Build a Simple Reliability Evaluation Loop

Goal

Create a Python script that:

  • sends several support tickets to gpt-5.4-mini
  • classifies them
  • compares predictions against expected labels
  • prints accuracy and mismatches

Setup

Install the OpenAI Python SDK:

pip install openai

Set your API key:

export OPENAI_API_KEY="your_api_key_here"

Example Dataset

We will use a small labeled dataset:

  • billing
  • technical
  • account
  • other

Code: Baseline Classifier Evaluation

"""
Session 4 - Exercise 1
Baseline reliability evaluation for a support-ticket classifier.

This example uses the OpenAI Responses API with the Python SDK and:
- Sends one ticket at a time to the model
- Requests a structured JSON response
- Evaluates predicted labels against expected labels

Before running:
    pip install openai
    export OPENAI_API_KEY="your_api_key_here"
"""

import json
from openai import OpenAI

# Create the API client. The SDK reads OPENAI_API_KEY from the environment.
client = OpenAI()

# A small labeled dataset for evaluation.
DATASET = [
    {
        "text": "I was charged twice for my monthly subscription. Please refund one payment.",
        "expected_label": "billing",
    },
    {
        "text": "The app crashes every time I upload a photo.",
        "expected_label": "technical",
    },
    {
        "text": "I forgot my password and cannot log in to my account.",
        "expected_label": "account",
    },
    {
        "text": "Do you offer student discounts?",
        "expected_label": "other",
    },
    {
        "text": "My invoice shows a payment I do not recognize.",
        "expected_label": "billing",
    },
    {
        "text": "The website is extremely slow and sometimes gives a 500 error.",
        "expected_label": "technical",
    },
    {
        "text": "Please help me change the email address on my profile.",
        "expected_label": "account",
    },
    {
        "text": "Thanks for the great product!",
        "expected_label": "other",
    },
]

ALLOWED_LABELS = {"billing", "technical", "account", "other"}


def classify_ticket(ticket_text: str) -> dict:
    """
    Classify a support ticket into one of the allowed labels.

    Returns:
        dict with keys:
            - label
            - confidence
            - short_reason
    """
    prompt = f"""
You are a support-ticket classifier.

Classify the ticket into exactly one of these labels:
- billing
- technical
- account
- other

Decision rules:
- billing: charges, invoices, refunds, subscriptions, payment methods
- technical: bugs, crashes, errors, performance, broken functionality
- account: login, password reset, email change, account access, profile access
- other: anything else or unclear requests

Return valid JSON with exactly these keys:
- label: one of billing, technical, account, other
- confidence: a number from 0.0 to 1.0
- short_reason: a short explanation under 15 words

Ticket:
{ticket_text}
""".strip()

    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt,
    )

    # The model output is returned as text. We expect valid JSON.
    raw_text = response.output_text.strip()

    # Parse the JSON into a Python dictionary.
    result = json.loads(raw_text)

    # Basic validation for reliability.
    if result["label"] not in ALLOWED_LABELS:
        raise ValueError(f"Invalid label returned: {result['label']}")

    return result


def evaluate(dataset: list[dict]) -> None:
    """
    Evaluate classifier accuracy on the dataset and print mismatches.
    """
    correct = 0
    total = len(dataset)
    mismatches = []

    for row in dataset:
        prediction = classify_ticket(row["text"])
        predicted_label = prediction["label"]
        expected_label = row["expected_label"]

        if predicted_label == expected_label:
            correct += 1
        else:
            mismatches.append(
                {
                    "text": row["text"],
                    "expected": expected_label,
                    "predicted": predicted_label,
                    "reason": prediction.get("short_reason", ""),
                    "confidence": prediction.get("confidence"),
                }
            )

    accuracy = correct / total if total else 0.0

    print("=" * 60)
    print(f"Total examples: {total}")
    print(f"Correct: {correct}")
    print(f"Accuracy: {accuracy:.2%}")
    print("=" * 60)

    if mismatches:
        print("\nMismatches:")
        for i, item in enumerate(mismatches, start=1):
            print(f"\n{i}. Ticket: {item['text']}")
            print(f"   Expected:  {item['expected']}")
            print(f"   Predicted: {item['predicted']}")
            print(f"   Confidence: {item['confidence']}")
            print(f"   Reason: {item['reason']}")
    else:
        print("\nNo mismatches found.")


if __name__ == "__main__":
    evaluate(DATASET)

Example Output

============================================================
Total examples: 8
Correct: 7
Accuracy: 87.50%
============================================================

Mismatches:

1. Ticket: Please help me change the email address on my profile.
   Expected:  account
   Predicted: other
   Confidence: 0.61
   Reason: Request about profile update.

Discussion

This baseline script already teaches several important reliability practices:

  • explicit labels
  • decision rules
  • structured output
  • basic validation
  • evaluation on a test set

But there is still room to improve.


6. Failure Analysis and Iterative Improvement

After evaluation, do not immediately “guess” improvements. First inspect what failed.

Questions to Ask

  • Which labels are most often confused?
  • Are the instructions too broad or too narrow?
  • Do the decision rules cover the edge cases?
  • Is the schema clear?
  • Are there ambiguous examples in the dataset?

Example Failure

Ticket:

“Please help me change the email address on my profile.”

If this was incorrectly labeled as other, possible causes:

  • Prompt does not strongly associate profile/email changes with account
  • Model sees “profile” as generic rather than account-related
  • Rules need a more explicit tie-breaker

Prompt Revision Approach

We can improve the prompt by:

  • emphasizing account-management requests
  • adding examples or stronger rules
  • making other only a fallback

7. Hands-On Exercise 2: Iteratively Improve the Classifier

Goal

Update the prompt and evaluation loop to improve reliability.

Improvements We Will Add

  • stronger and more explicit category rules
  • a stricter output format
  • better fallback handling
  • evaluation reporting by label

Code: Improved Classifier with Better Reliability Reporting

"""
Session 4 - Exercise 2
Iteratively improve classifier reliability with:
- stronger prompt instructions
- explicit tie-breaking rules
- stricter validation
- per-label evaluation reporting

Before running:
    pip install openai
    export OPENAI_API_KEY="your_api_key_here"
"""

import json
from collections import Counter, defaultdict
from openai import OpenAI

client = OpenAI()

DATASET = [
    {
        "text": "I was charged twice for my monthly subscription. Please refund one payment.",
        "expected_label": "billing",
    },
    {
        "text": "The app crashes every time I upload a photo.",
        "expected_label": "technical",
    },
    {
        "text": "I forgot my password and cannot log in to my account.",
        "expected_label": "account",
    },
    {
        "text": "Do you offer student discounts?",
        "expected_label": "other",
    },
    {
        "text": "My invoice shows a payment I do not recognize.",
        "expected_label": "billing",
    },
    {
        "text": "The website is extremely slow and sometimes gives a 500 error.",
        "expected_label": "technical",
    },
    {
        "text": "Please help me change the email address on my profile.",
        "expected_label": "account",
    },
    {
        "text": "Thanks for the great product!",
        "expected_label": "other",
    },
    {
        "text": "I cannot access my profile after resetting my password.",
        "expected_label": "account",
    },
    {
        "text": "Your latest update introduced a bug in notifications.",
        "expected_label": "technical",
    },
]

ALLOWED_LABELS = {"billing", "technical", "account", "other"}


def classify_ticket_improved(ticket_text: str) -> dict:
    """
    Improved classification prompt with stricter instructions and tie-breaking rules.
    """
    prompt = f"""
You are a highly reliable support-ticket classifier.

Your task:
Classify the ticket into exactly one label from this list:
- billing
- technical
- account
- other

Definitions:
- billing: charges, invoices, refunds, subscriptions, payment methods, duplicate charges
- technical: bugs, crashes, errors, slow performance, broken features, server failures
- account: login issues, password resets, account access, email changes, profile changes, identity/access management
- other: product questions, compliments, sales questions, unclear requests, anything not covered above

Important rules:
1. Return exactly one label.
2. If a ticket involves account access, login, password reset, email change, or profile change, choose account.
3. If a ticket mentions software malfunction, crash, bug, slowness, or error codes, choose technical.
4. If a ticket mentions payment, invoice, subscription, charge, refund, or billing statement, choose billing.
5. Use other only if none of the above clearly apply.
6. Do not invent details not present in the ticket.

Return valid JSON only with exactly these keys:
- label
- confidence
- short_reason

Additional output requirements:
- label must be one of: billing, technical, account, other
- confidence must be a number between 0 and 1
- short_reason must be under 12 words

Ticket:
{ticket_text}
""".strip()

    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt,
    )

    raw_text = response.output_text.strip()
    result = json.loads(raw_text)

    validate_result(result)
    return result


def validate_result(result: dict) -> None:
    """
    Validate the model output to catch schema or value issues early.
    """
    required_keys = {"label", "confidence", "short_reason"}

    if set(result.keys()) != required_keys:
        raise ValueError(f"Unexpected keys in result: {set(result.keys())}")

    if result["label"] not in ALLOWED_LABELS:
        raise ValueError(f"Invalid label: {result['label']}")

    if not isinstance(result["confidence"], (int, float)):
        raise TypeError("confidence must be numeric")

    if not (0.0 <= float(result["confidence"]) <= 1.0):
        raise ValueError("confidence must be between 0 and 1")

    if not isinstance(result["short_reason"], str):
        raise TypeError("short_reason must be a string")


def evaluate(dataset: list[dict]) -> None:
    """
    Evaluate the improved classifier and print:
    - overall accuracy
    - per-label counts
    - confusion summary
    - mismatches
    """
    correct = 0
    total = len(dataset)

    expected_counts = Counter()
    correct_counts = Counter()
    confusion = defaultdict(Counter)
    mismatches = []

    for row in dataset:
        expected = row["expected_label"]
        expected_counts[expected] += 1

        prediction = classify_ticket_improved(row["text"])
        predicted = prediction["label"]

        confusion[expected][predicted] += 1

        if predicted == expected:
            correct += 1
            correct_counts[expected] += 1
        else:
            mismatches.append(
                {
                    "text": row["text"],
                    "expected": expected,
                    "predicted": predicted,
                    "confidence": prediction["confidence"],
                    "reason": prediction["short_reason"],
                }
            )

    accuracy = correct / total if total else 0.0

    print("=" * 60)
    print("OVERALL RESULTS")
    print("=" * 60)
    print(f"Total examples: {total}")
    print(f"Correct: {correct}")
    print(f"Accuracy: {accuracy:.2%}")

    print("\n" + "=" * 60)
    print("PER-LABEL RESULTS")
    print("=" * 60)

    for label in sorted(ALLOWED_LABELS):
        total_for_label = expected_counts[label]
        correct_for_label = correct_counts[label]
        label_accuracy = (correct_for_label / total_for_label) if total_for_label else 0.0
        print(
            f"{label:10s} | total={total_for_label:2d} | "
            f"correct={correct_for_label:2d} | accuracy={label_accuracy:.2%}"
        )

    print("\n" + "=" * 60)
    print("CONFUSION SUMMARY")
    print("=" * 60)
    for expected_label in sorted(confusion.keys()):
        print(f"{expected_label}: {dict(confusion[expected_label])}")

    if mismatches:
        print("\n" + "=" * 60)
        print("MISMATCHES")
        print("=" * 60)
        for i, item in enumerate(mismatches, start=1):
            print(f"\n{i}. Ticket: {item['text']}")
            print(f"   Expected:   {item['expected']}")
            print(f"   Predicted:  {item['predicted']}")
            print(f"   Confidence: {item['confidence']}")
            print(f"   Reason:     {item['reason']}")
    else:
        print("\nNo mismatches found.")


if __name__ == "__main__":
    evaluate(DATASET)

Example Output

============================================================
OVERALL RESULTS
============================================================
Total examples: 10
Correct: 10
Accuracy: 100.00%

============================================================
PER-LABEL RESULTS
============================================================
account    | total= 3 | correct= 3 | accuracy=100.00%
billing    | total= 2 | correct= 2 | accuracy=100.00%
other      | total= 2 | correct= 2 | accuracy=100.00%
technical  | total= 3 | correct= 3 | accuracy=100.00%

============================================================
CONFUSION SUMMARY
============================================================
account: {'account': 3}
billing: {'billing': 2}
other: {'other': 2}
technical: {'technical': 3}

No mismatches found.

8. Best Practices for Iterative Reliability Improvement

A. Use Representative Test Sets

Your evaluation data should include:

  • common cases
  • edge cases
  • ambiguous cases
  • short inputs
  • messy real-world inputs

B. Improve One Thing at a Time

When reliability changes, you want to know why.

Good candidates for iteration:

  • prompt wording
  • output schema
  • preprocessing
  • postprocessing validation
  • retry logic
  • business rules

C. Log Failures

Store:

  • input
  • model output
  • expected result
  • mismatch type
  • timestamp
  • prompt version

This makes improvements measurable.

D. Prefer Narrow Tasks

A narrowly defined task is easier to make reliable than a broad one.

E. Validate Outputs

Always validate model outputs before using them in downstream systems.

Validation examples:

  • required keys present
  • values from allowed set
  • numeric fields in range
  • strings under length limits

9. Optional Extension Exercise

Task

Extend the classifier to support a second field:

  • needs_human_review: true/false

Suggested Rule

Set needs_human_review to true when:

  • confidence is below 0.70
  • the message is too vague
  • the message contains multiple conflicting intents

Starter Prompt Addition

Also return:
- needs_human_review: true if the ticket is ambiguous, vague, or low confidence

Starter Validation Logic

if not isinstance(result["needs_human_review"], bool):
    raise TypeError("needs_human_review must be a boolean")

This is a realistic pattern in agentic systems: the model decides whether to continue automatically or escalate.


10. Wrap-Up

Key Takeaways

  • Reliability is improved iteratively through testing and refinement
  • Prompt clarity has a major impact on output quality
  • Structured outputs make LLM systems easier to validate and trust
  • Evaluation loops help you measure progress instead of guessing
  • Failure analysis is essential for systematic improvement

Practical Reliability Loop

Use this cycle in your projects:

  1. Define the task precisely
  2. Build a small labeled test set
  3. Measure baseline performance
  4. Inspect failures
  5. tighten prompt/schema/rules
  6. Re-run evaluation
  7. Repeat until acceptable

Useful Resources

  • OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
  • OpenAI API docs: https://platform.openai.com/docs
  • OpenAI Python SDK: https://github.com/openai/openai-python
  • JSON module in Python: https://docs.python.org/3/library/json.html
  • Python collections module: https://docs.python.org/3/library/collections.html

Suggested Instructor Notes

Theory Time Allocation

  • Reliability concepts: 10 minutes
  • Failure modes: 8 minutes
  • Prompt iteration and structure: 10 minutes

Hands-On Time Allocation

  • Exercise 1 baseline evaluation: 8 minutes
  • Exercise 2 iterative improvement: 7 minutes
  • Discussion and recap: 2 minutes

End of Session

You now have a repeatable workflow for improving LLM application reliability: - define - evaluate - inspect - refine - validate - repeat

In the next session, this iterative mindset can be extended to more agentic workflows involving planning, tool use, and recovery from failure.


Back to Chapter | Back to Master Plan | Previous Session