Session 4: Iterative Reliability Improvement
Synopsis
Shows how to use evaluation results to refine prompts, tune retrieval, redesign tools, adjust workflows, and add safeguards. This session reinforces the engineering loop of measure, diagnose, and improve.
Session Content
Session 4: Iterative Reliability Improvement
Session Overview
Duration: ~45 minutes
Audience: Python developers with basic programming knowledge, learning GenAI and agentic development
Session Goal: Learn how to improve LLM application reliability through iterative refinement, evaluation, structured outputs, and failure analysis.
Learning Objectives
By the end of this session, learners will be able to:
- Define reliability in the context of GenAI applications
- Identify common sources of LLM failures
- Improve outputs using prompt iteration and tighter instructions
- Use structured outputs to reduce ambiguity
- Build a small evaluation loop in Python
- Analyze failures and systematically improve application behavior
Agenda
- Reliability in GenAI systems
- Common failure modes
- Prompt iteration strategies
- Structured outputs for consistency
- Hands-on: Build a simple reliability evaluation loop
- Hands-on: Improve a support-ticket classifier iteratively
- Wrap-up and resources
1. Reliability in GenAI Systems
Reliability in traditional software usually means that the same input gives the same correct output every time.
In GenAI systems, reliability is more nuanced. LLM outputs can vary, and “correctness” may depend on:
- Factual accuracy
- Instruction following
- Format compliance
- Safety constraints
- Task completion quality
Reliability Dimensions
When building LLM-powered applications, consider these dimensions:
- Accuracy: Is the answer correct?
- Consistency: Does the model behave similarly across repeated or similar inputs?
- Format adherence: Does the output match the expected schema?
- Robustness: Does the system handle edge cases well?
- Safety: Does it avoid harmful or disallowed responses?
Key Principle
Reliability is usually improved iteratively, not all at once.
A common workflow:
- Define the task
- Collect representative examples
- Test the current behavior
- Analyze failures
- Improve prompt/schema/logic
- Re-test
- Repeat
2. Common Failure Modes
LLM applications commonly fail in predictable ways.
A. Ambiguous Instructions
If your prompt is vague, the model may produce outputs that are plausible but not useful.
Example: - Weak prompt: “Classify this ticket.” - Better prompt: “Classify this customer support ticket into exactly one of: billing, technical, account, or other.”
B. Output Format Drift
The model may return extra text, omit fields, or produce inconsistent structures.
Example problem: - Sometimes returns JSON - Sometimes returns prose - Sometimes includes explanations when only labels are needed
C. Edge Case Misclassification
Examples: - Billing issue framed as account access - Mixed intent in one ticket - Very short messages like “help” - Sarcasm or unclear language
D. Hallucination
The model may invent facts, details, or policies if asked beyond available context.
E. Overconfidence
A model may provide a confident answer when it should ask for clarification or indicate uncertainty.
3. Prompt Iteration Strategies
Improving reliability often starts with better prompting.
Strategy 1: Make the Task Explicit
Specify:
- The task
- Allowed outputs
- Decision criteria
- Constraints
- What to do when uncertain
Weak:
Summarize the message.
Better:
Summarize the customer message in one sentence under 20 words. Do not add information not present in the message.
Strategy 2: Define the Output Space
Constrain valid outputs as much as possible.
Example:
Return exactly one label from:
- billing
- technical
- account
- other
Strategy 3: Add Decision Rules
Tell the model how to choose between confusing categories.
Example:
If the message is about login, password reset, or account access, classify as account.
If the message is about charges, invoices, refunds, or payment methods, classify as billing.
If the message mentions bugs, crashes, performance, or errors, classify as technical.
If none apply, classify as other.
Strategy 4: Specify Behavior for Uncertainty
Example:
If the ticket is too vague to classify confidently, return "other".
Strategy 5: Separate Reasoning from Final Answer
In production, you may want either:
- internal reasoning not shown to users, or
- a structured final answer only
For reliability, output structure matters more than verbose prose.
4. Structured Outputs for Consistency
One of the easiest ways to improve reliability is to require structured output.
Instead of free-form text: - Ask for JSON-compatible fields - Validate the output in Python - Reject or retry invalid results
Why Structured Outputs Help
They reduce:
- formatting ambiguity
- downstream parsing errors
- accidental extra commentary
Example Target Structure
For support ticket classification:
{
"label": "billing",
"confidence": 0.92,
"short_reason": "Customer mentions duplicate charge."
}
Design Tips
Keep schemas:
- small
- focused
- easy to validate
- aligned with business needs
Avoid asking the model for unnecessary fields if you do not use them.
5. Hands-On Exercise 1: Build a Simple Reliability Evaluation Loop
Goal
Create a Python script that:
- sends several support tickets to
gpt-5.4-mini - classifies them
- compares predictions against expected labels
- prints accuracy and mismatches
Setup
Install the OpenAI Python SDK:
pip install openai
Set your API key:
export OPENAI_API_KEY="your_api_key_here"
Example Dataset
We will use a small labeled dataset:
- billing
- technical
- account
- other
Code: Baseline Classifier Evaluation
"""
Session 4 - Exercise 1
Baseline reliability evaluation for a support-ticket classifier.
This example uses the OpenAI Responses API with the Python SDK and:
- Sends one ticket at a time to the model
- Requests a structured JSON response
- Evaluates predicted labels against expected labels
Before running:
pip install openai
export OPENAI_API_KEY="your_api_key_here"
"""
import json
from openai import OpenAI
# Create the API client. The SDK reads OPENAI_API_KEY from the environment.
client = OpenAI()
# A small labeled dataset for evaluation.
DATASET = [
{
"text": "I was charged twice for my monthly subscription. Please refund one payment.",
"expected_label": "billing",
},
{
"text": "The app crashes every time I upload a photo.",
"expected_label": "technical",
},
{
"text": "I forgot my password and cannot log in to my account.",
"expected_label": "account",
},
{
"text": "Do you offer student discounts?",
"expected_label": "other",
},
{
"text": "My invoice shows a payment I do not recognize.",
"expected_label": "billing",
},
{
"text": "The website is extremely slow and sometimes gives a 500 error.",
"expected_label": "technical",
},
{
"text": "Please help me change the email address on my profile.",
"expected_label": "account",
},
{
"text": "Thanks for the great product!",
"expected_label": "other",
},
]
ALLOWED_LABELS = {"billing", "technical", "account", "other"}
def classify_ticket(ticket_text: str) -> dict:
"""
Classify a support ticket into one of the allowed labels.
Returns:
dict with keys:
- label
- confidence
- short_reason
"""
prompt = f"""
You are a support-ticket classifier.
Classify the ticket into exactly one of these labels:
- billing
- technical
- account
- other
Decision rules:
- billing: charges, invoices, refunds, subscriptions, payment methods
- technical: bugs, crashes, errors, performance, broken functionality
- account: login, password reset, email change, account access, profile access
- other: anything else or unclear requests
Return valid JSON with exactly these keys:
- label: one of billing, technical, account, other
- confidence: a number from 0.0 to 1.0
- short_reason: a short explanation under 15 words
Ticket:
{ticket_text}
""".strip()
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt,
)
# The model output is returned as text. We expect valid JSON.
raw_text = response.output_text.strip()
# Parse the JSON into a Python dictionary.
result = json.loads(raw_text)
# Basic validation for reliability.
if result["label"] not in ALLOWED_LABELS:
raise ValueError(f"Invalid label returned: {result['label']}")
return result
def evaluate(dataset: list[dict]) -> None:
"""
Evaluate classifier accuracy on the dataset and print mismatches.
"""
correct = 0
total = len(dataset)
mismatches = []
for row in dataset:
prediction = classify_ticket(row["text"])
predicted_label = prediction["label"]
expected_label = row["expected_label"]
if predicted_label == expected_label:
correct += 1
else:
mismatches.append(
{
"text": row["text"],
"expected": expected_label,
"predicted": predicted_label,
"reason": prediction.get("short_reason", ""),
"confidence": prediction.get("confidence"),
}
)
accuracy = correct / total if total else 0.0
print("=" * 60)
print(f"Total examples: {total}")
print(f"Correct: {correct}")
print(f"Accuracy: {accuracy:.2%}")
print("=" * 60)
if mismatches:
print("\nMismatches:")
for i, item in enumerate(mismatches, start=1):
print(f"\n{i}. Ticket: {item['text']}")
print(f" Expected: {item['expected']}")
print(f" Predicted: {item['predicted']}")
print(f" Confidence: {item['confidence']}")
print(f" Reason: {item['reason']}")
else:
print("\nNo mismatches found.")
if __name__ == "__main__":
evaluate(DATASET)
Example Output
============================================================
Total examples: 8
Correct: 7
Accuracy: 87.50%
============================================================
Mismatches:
1. Ticket: Please help me change the email address on my profile.
Expected: account
Predicted: other
Confidence: 0.61
Reason: Request about profile update.
Discussion
This baseline script already teaches several important reliability practices:
- explicit labels
- decision rules
- structured output
- basic validation
- evaluation on a test set
But there is still room to improve.
6. Failure Analysis and Iterative Improvement
After evaluation, do not immediately “guess” improvements. First inspect what failed.
Questions to Ask
- Which labels are most often confused?
- Are the instructions too broad or too narrow?
- Do the decision rules cover the edge cases?
- Is the schema clear?
- Are there ambiguous examples in the dataset?
Example Failure
Ticket:
“Please help me change the email address on my profile.”
If this was incorrectly labeled as other, possible causes:
- Prompt does not strongly associate profile/email changes with
account - Model sees “profile” as generic rather than account-related
- Rules need a more explicit tie-breaker
Prompt Revision Approach
We can improve the prompt by:
- emphasizing account-management requests
- adding examples or stronger rules
- making
otheronly a fallback
7. Hands-On Exercise 2: Iteratively Improve the Classifier
Goal
Update the prompt and evaluation loop to improve reliability.
Improvements We Will Add
- stronger and more explicit category rules
- a stricter output format
- better fallback handling
- evaluation reporting by label
Code: Improved Classifier with Better Reliability Reporting
"""
Session 4 - Exercise 2
Iteratively improve classifier reliability with:
- stronger prompt instructions
- explicit tie-breaking rules
- stricter validation
- per-label evaluation reporting
Before running:
pip install openai
export OPENAI_API_KEY="your_api_key_here"
"""
import json
from collections import Counter, defaultdict
from openai import OpenAI
client = OpenAI()
DATASET = [
{
"text": "I was charged twice for my monthly subscription. Please refund one payment.",
"expected_label": "billing",
},
{
"text": "The app crashes every time I upload a photo.",
"expected_label": "technical",
},
{
"text": "I forgot my password and cannot log in to my account.",
"expected_label": "account",
},
{
"text": "Do you offer student discounts?",
"expected_label": "other",
},
{
"text": "My invoice shows a payment I do not recognize.",
"expected_label": "billing",
},
{
"text": "The website is extremely slow and sometimes gives a 500 error.",
"expected_label": "technical",
},
{
"text": "Please help me change the email address on my profile.",
"expected_label": "account",
},
{
"text": "Thanks for the great product!",
"expected_label": "other",
},
{
"text": "I cannot access my profile after resetting my password.",
"expected_label": "account",
},
{
"text": "Your latest update introduced a bug in notifications.",
"expected_label": "technical",
},
]
ALLOWED_LABELS = {"billing", "technical", "account", "other"}
def classify_ticket_improved(ticket_text: str) -> dict:
"""
Improved classification prompt with stricter instructions and tie-breaking rules.
"""
prompt = f"""
You are a highly reliable support-ticket classifier.
Your task:
Classify the ticket into exactly one label from this list:
- billing
- technical
- account
- other
Definitions:
- billing: charges, invoices, refunds, subscriptions, payment methods, duplicate charges
- technical: bugs, crashes, errors, slow performance, broken features, server failures
- account: login issues, password resets, account access, email changes, profile changes, identity/access management
- other: product questions, compliments, sales questions, unclear requests, anything not covered above
Important rules:
1. Return exactly one label.
2. If a ticket involves account access, login, password reset, email change, or profile change, choose account.
3. If a ticket mentions software malfunction, crash, bug, slowness, or error codes, choose technical.
4. If a ticket mentions payment, invoice, subscription, charge, refund, or billing statement, choose billing.
5. Use other only if none of the above clearly apply.
6. Do not invent details not present in the ticket.
Return valid JSON only with exactly these keys:
- label
- confidence
- short_reason
Additional output requirements:
- label must be one of: billing, technical, account, other
- confidence must be a number between 0 and 1
- short_reason must be under 12 words
Ticket:
{ticket_text}
""".strip()
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt,
)
raw_text = response.output_text.strip()
result = json.loads(raw_text)
validate_result(result)
return result
def validate_result(result: dict) -> None:
"""
Validate the model output to catch schema or value issues early.
"""
required_keys = {"label", "confidence", "short_reason"}
if set(result.keys()) != required_keys:
raise ValueError(f"Unexpected keys in result: {set(result.keys())}")
if result["label"] not in ALLOWED_LABELS:
raise ValueError(f"Invalid label: {result['label']}")
if not isinstance(result["confidence"], (int, float)):
raise TypeError("confidence must be numeric")
if not (0.0 <= float(result["confidence"]) <= 1.0):
raise ValueError("confidence must be between 0 and 1")
if not isinstance(result["short_reason"], str):
raise TypeError("short_reason must be a string")
def evaluate(dataset: list[dict]) -> None:
"""
Evaluate the improved classifier and print:
- overall accuracy
- per-label counts
- confusion summary
- mismatches
"""
correct = 0
total = len(dataset)
expected_counts = Counter()
correct_counts = Counter()
confusion = defaultdict(Counter)
mismatches = []
for row in dataset:
expected = row["expected_label"]
expected_counts[expected] += 1
prediction = classify_ticket_improved(row["text"])
predicted = prediction["label"]
confusion[expected][predicted] += 1
if predicted == expected:
correct += 1
correct_counts[expected] += 1
else:
mismatches.append(
{
"text": row["text"],
"expected": expected,
"predicted": predicted,
"confidence": prediction["confidence"],
"reason": prediction["short_reason"],
}
)
accuracy = correct / total if total else 0.0
print("=" * 60)
print("OVERALL RESULTS")
print("=" * 60)
print(f"Total examples: {total}")
print(f"Correct: {correct}")
print(f"Accuracy: {accuracy:.2%}")
print("\n" + "=" * 60)
print("PER-LABEL RESULTS")
print("=" * 60)
for label in sorted(ALLOWED_LABELS):
total_for_label = expected_counts[label]
correct_for_label = correct_counts[label]
label_accuracy = (correct_for_label / total_for_label) if total_for_label else 0.0
print(
f"{label:10s} | total={total_for_label:2d} | "
f"correct={correct_for_label:2d} | accuracy={label_accuracy:.2%}"
)
print("\n" + "=" * 60)
print("CONFUSION SUMMARY")
print("=" * 60)
for expected_label in sorted(confusion.keys()):
print(f"{expected_label}: {dict(confusion[expected_label])}")
if mismatches:
print("\n" + "=" * 60)
print("MISMATCHES")
print("=" * 60)
for i, item in enumerate(mismatches, start=1):
print(f"\n{i}. Ticket: {item['text']}")
print(f" Expected: {item['expected']}")
print(f" Predicted: {item['predicted']}")
print(f" Confidence: {item['confidence']}")
print(f" Reason: {item['reason']}")
else:
print("\nNo mismatches found.")
if __name__ == "__main__":
evaluate(DATASET)
Example Output
============================================================
OVERALL RESULTS
============================================================
Total examples: 10
Correct: 10
Accuracy: 100.00%
============================================================
PER-LABEL RESULTS
============================================================
account | total= 3 | correct= 3 | accuracy=100.00%
billing | total= 2 | correct= 2 | accuracy=100.00%
other | total= 2 | correct= 2 | accuracy=100.00%
technical | total= 3 | correct= 3 | accuracy=100.00%
============================================================
CONFUSION SUMMARY
============================================================
account: {'account': 3}
billing: {'billing': 2}
other: {'other': 2}
technical: {'technical': 3}
No mismatches found.
8. Best Practices for Iterative Reliability Improvement
A. Use Representative Test Sets
Your evaluation data should include:
- common cases
- edge cases
- ambiguous cases
- short inputs
- messy real-world inputs
B. Improve One Thing at a Time
When reliability changes, you want to know why.
Good candidates for iteration:
- prompt wording
- output schema
- preprocessing
- postprocessing validation
- retry logic
- business rules
C. Log Failures
Store:
- input
- model output
- expected result
- mismatch type
- timestamp
- prompt version
This makes improvements measurable.
D. Prefer Narrow Tasks
A narrowly defined task is easier to make reliable than a broad one.
E. Validate Outputs
Always validate model outputs before using them in downstream systems.
Validation examples:
- required keys present
- values from allowed set
- numeric fields in range
- strings under length limits
9. Optional Extension Exercise
Task
Extend the classifier to support a second field:
needs_human_review: true/false
Suggested Rule
Set needs_human_review to true when:
- confidence is below 0.70
- the message is too vague
- the message contains multiple conflicting intents
Starter Prompt Addition
Also return:
- needs_human_review: true if the ticket is ambiguous, vague, or low confidence
Starter Validation Logic
if not isinstance(result["needs_human_review"], bool):
raise TypeError("needs_human_review must be a boolean")
This is a realistic pattern in agentic systems: the model decides whether to continue automatically or escalate.
10. Wrap-Up
Key Takeaways
- Reliability is improved iteratively through testing and refinement
- Prompt clarity has a major impact on output quality
- Structured outputs make LLM systems easier to validate and trust
- Evaluation loops help you measure progress instead of guessing
- Failure analysis is essential for systematic improvement
Practical Reliability Loop
Use this cycle in your projects:
- Define the task precisely
- Build a small labeled test set
- Measure baseline performance
- Inspect failures
- tighten prompt/schema/rules
- Re-run evaluation
- Repeat until acceptable
Useful Resources
- OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
- OpenAI API docs: https://platform.openai.com/docs
- OpenAI Python SDK: https://github.com/openai/openai-python
- JSON module in Python: https://docs.python.org/3/library/json.html
- Python collections module: https://docs.python.org/3/library/collections.html
Suggested Instructor Notes
Theory Time Allocation
- Reliability concepts: 10 minutes
- Failure modes: 8 minutes
- Prompt iteration and structure: 10 minutes
Hands-On Time Allocation
- Exercise 1 baseline evaluation: 8 minutes
- Exercise 2 iterative improvement: 7 minutes
- Discussion and recap: 2 minutes
End of Session
You now have a repeatable workflow for improving LLM application reliability: - define - evaluate - inspect - refine - validate - repeat
In the next session, this iterative mindset can be extended to more agentic workflows involving planning, tool use, and recovery from failure.