Session 2: Designing Test Cases and Evaluation Datasets
Synopsis
Covers how to create representative prompts, expected behaviors, edge cases, and failure-focused examples. Learners begin constructing repeatable evaluation sets for their own applications.
Session Content
Session 2: Designing Test Cases and Evaluation Datasets
Session Overview
Duration: ~45 minutes
Audience: Python developers with basic programming knowledge, learning GenAI and agentic development
Session Goal: Learn how to design effective test cases and evaluation datasets for LLM-powered applications, and implement a simple evaluation workflow using the OpenAI Responses API with gpt-5.4-mini.
Learning Outcomes
By the end of this session, learners will be able to:
- Explain why evaluation is essential in GenAI application development
- Distinguish between ad hoc prompting and systematic testing
- Design high-quality test cases for LLM-based systems
- Build a small evaluation dataset in Python
- Run model outputs against test prompts using the OpenAI Responses API
- Apply simple rubric-based and model-assisted evaluation strategies
1. Why Evaluation Matters in GenAI
LLM applications often appear to work well in demos but fail in production because they are not tested systematically. Unlike traditional software, LLM behavior is probabilistic, context-sensitive, and can vary with prompt wording, model version, or input complexity.
Common failure modes
- Incorrect factual answers
- Incomplete responses
- Hallucinations
- Poor formatting
- Ignoring constraints
- Unsafe or off-topic outputs
- Inconsistent behavior across similar inputs
Why test cases are important
Test cases help you answer:
- Does the model follow instructions reliably?
- Does it produce the expected output format?
- Does it fail on edge cases?
- Does it remain helpful across different user phrasings?
- Is performance improving or regressing after prompt changes?
Core idea
A good GenAI workflow includes:
- Define desired behavior
- Create representative test cases
- Run evaluations regularly
- Inspect failures
- Improve prompts, tools, or orchestration
- Re-evaluate
2. What Makes a Good Test Case?
A test case is more than just a prompt. It should define the scenario, expectations, and success criteria.
Components of a strong test case
- Input: The user message or task
- Expected behavior: What the model should do
- Acceptance criteria: Rules for pass/fail
- Category/tag: For grouping similar tests
- Difficulty/edge indication: Optional but useful
Example
Weak test case
- Prompt: “Summarize this email”
This is too vague. There is no expected style, length, or format.
Better test case
- Input: “Summarize this customer complaint email in 2 bullet points.”
- Expected behavior:
- Produces exactly 2 bullet points
- Captures the core complaint
- Uses neutral tone
- Category: summarization
- Acceptance criteria:
- Bullet count = 2
- No invented details
- Main issue mentioned
Test case design principles
1. Be specific
Define exactly what success looks like.
2. Cover normal and edge cases
Include typical tasks and difficult scenarios.
3. Test one behavior at a time
Avoid combining too many requirements into a single test.
4. Include realistic inputs
Use inputs similar to what real users provide.
5. Make evaluation practical
If humans cannot judge success consistently, your criteria may be too vague.
3. Types of Evaluation Datasets
An evaluation dataset is a collection of test cases used to measure system performance.
Common dataset categories
Smoke tests
Small, fast tests to catch obvious regressions.
- “Does the app answer at all?”
- “Does it return valid JSON?”
- “Does it avoid forbidden content?”
Functional tests
Check that expected capabilities work.
- Classification
- Extraction
- Summarization
- Rewriting
- Structured output
Edge case tests
Focus on difficult or unusual inputs.
- Ambiguous instructions
- Long inputs
- Contradictory information
- Misspellings
- Empty or malformed content
Safety/policy tests
Check whether the system responds appropriately to risky prompts.
Regression tests
Previously failing cases that should never break again.
4. Designing Evaluation Criteria
Evaluation criteria should map directly to the behavior you care about.
Evaluation approaches
Exact match
Best for deterministic outputs such as:
- labels
- categories
- boolean answers
- short structured fields
Rule-based checks
Useful for formatting and constraints:
- Must include 3 bullets
- Must be valid JSON
- Must be under 100 words
Rubric-based human judgment
Useful when quality is subjective:
- relevance
- completeness
- clarity
- tone
Model-assisted evaluation
Use an LLM to score another LLM’s output against a rubric.
This is practical, but should be used carefully because evaluators can also make mistakes.
Example rubric
For a customer support summary:
-
Relevance (0–2)
0 = misses key issue
1 = partially relevant
2 = captures key issue clearly -
Completeness (0–2)
0 = omits major details
1 = partial
2 = includes all major points -
Format compliance (0–1)
0 = wrong format
1 = correct format
Total score: 0–5
5. Dataset Structure in Python
A simple evaluation dataset can be stored as a list of dictionaries.
Hands-On Exercise 1: Build a Small Evaluation Dataset
Goal
Create a Python dataset with several test cases for a summarization assistant.
Example dataset design
eval_dataset = [
{
"id": "sum_001",
"category": "summarization",
"input": "Summarize this in one sentence: Our shipment was delayed by 4 days and the package arrived damaged.",
"expected_behavior": [
"Mentions delayed shipment",
"Mentions damaged package",
"Uses one sentence"
],
"checks": {
"max_sentences": 1,
"must_include": ["delayed", "damaged"]
}
},
{
"id": "sum_002",
"category": "summarization",
"input": "Summarize this customer note in 2 bullet points: The user cannot log in after resetting their password, and they are frustrated because support has not responded in 48 hours.",
"expected_behavior": [
"Exactly 2 bullet points",
"Mentions login issue",
"Mentions delayed support response"
],
"checks": {
"bullet_count": 2,
"must_include": ["log in", "48 hours"]
}
},
{
"id": "sum_003",
"category": "summarization_edge",
"input": "Summarize this in one short sentence: The customer says the product is excellent, except that it stopped working after two days.",
"expected_behavior": [
"Captures mixed sentiment",
"Mentions product failure",
"Uses one short sentence"
],
"checks": {
"max_sentences": 1,
"must_include": ["stopped working"]
}
}
]
Discussion points
- These tests define expectations explicitly.
- The checks are simple enough for automation.
- This dataset can grow over time as the application evolves.
6. Running Model Outputs Against Test Cases
Now we will use the OpenAI Python SDK and the Responses API to generate outputs for each test case.
Hands-On Exercise 2: Generate Responses with gpt-5.4-mini
Prerequisites
Install the OpenAI SDK:
pip install openai
Set your API key:
export OPENAI_API_KEY="your_api_key_here"
Python script
from openai import OpenAI
# Create a client using the API key from the OPENAI_API_KEY environment variable.
client = OpenAI()
# A small evaluation dataset for summarization tests.
eval_dataset = [
{
"id": "sum_001",
"category": "summarization",
"input": (
"Summarize this in one sentence: "
"Our shipment was delayed by 4 days and the package arrived damaged."
),
"checks": {
"max_sentences": 1,
"must_include": ["delayed", "damaged"]
}
},
{
"id": "sum_002",
"category": "summarization",
"input": (
"Summarize this customer note in 2 bullet points: "
"The user cannot log in after resetting their password, "
"and they are frustrated because support has not responded in 48 hours."
),
"checks": {
"bullet_count": 2,
"must_include": ["log in", "48 hours"]
}
},
]
def get_model_response(user_prompt: str) -> str:
"""
Send a prompt to the OpenAI Responses API and return the generated text.
The instruction is kept stable across test cases so changes in behavior
can be attributed mainly to the input prompt.
"""
response = client.responses.create(
model="gpt-5.4-mini",
input=[
{
"role": "system",
"content": "You are a concise assistant that follows formatting instructions exactly."
},
{
"role": "user",
"content": user_prompt
}
]
)
return response.output_text
for test_case in eval_dataset:
output = get_model_response(test_case["input"])
print("=" * 80)
print(f"Test ID: {test_case['id']}")
print(f"Prompt: {test_case['input']}")
print("Model Output:")
print(output)
print()
Example output
================================================================================
Test ID: sum_001
Prompt: Summarize this in one sentence: Our shipment was delayed by 4 days and the package arrived damaged.
Model Output:
The shipment arrived 4 days late and the package was damaged.
================================================================================
Test ID: sum_002
Prompt: Summarize this customer note in 2 bullet points: The user cannot log in after resetting their password, and they are frustrated because support has not responded in 48 hours.
Model Output:
- The user cannot log in after resetting their password.
- They are frustrated that support has not responded in 48 hours.
7. Automating Basic Checks
We now add rule-based evaluation logic to score outputs automatically.
Hands-On Exercise 3: Implement Rule-Based Evaluation
Goal
Automatically check whether outputs satisfy the basic constraints in the dataset.
Python script
from openai import OpenAI
client = OpenAI()
eval_dataset = [
{
"id": "sum_001",
"category": "summarization",
"input": (
"Summarize this in one sentence: "
"Our shipment was delayed by 4 days and the package arrived damaged."
),
"checks": {
"max_sentences": 1,
"must_include": ["delayed", "damaged"]
}
},
{
"id": "sum_002",
"category": "summarization",
"input": (
"Summarize this customer note in 2 bullet points: "
"The user cannot log in after resetting their password, "
"and they are frustrated because support has not responded in 48 hours."
),
"checks": {
"bullet_count": 2,
"must_include": ["log in", "48 hours"]
}
},
{
"id": "sum_003",
"category": "summarization_edge",
"input": (
"Summarize this in one short sentence: "
"The customer says the product is excellent, except that it stopped working after two days."
),
"checks": {
"max_sentences": 1,
"must_include": ["stopped working"]
}
}
]
def get_model_response(user_prompt: str) -> str:
"""
Generate a model response using the OpenAI Responses API.
"""
response = client.responses.create(
model="gpt-5.4-mini",
input=[
{
"role": "system",
"content": "You are a concise assistant that follows formatting instructions exactly."
},
{
"role": "user",
"content": user_prompt
}
]
)
return response.output_text.strip()
def count_sentences(text: str) -> int:
"""
A simple sentence counter based on punctuation.
This is intentionally lightweight for demo purposes.
"""
sentence_endings = [".", "!", "?"]
count = sum(text.count(mark) for mark in sentence_endings)
return max(count, 1 if text else 0)
def count_bullets(text: str) -> int:
"""
Count lines that appear to be bullet points.
"""
bullet_prefixes = ("- ", "* ")
return sum(
1 for line in text.splitlines()
if line.strip().startswith(bullet_prefixes)
)
def evaluate_output(output: str, checks: dict) -> dict:
"""
Apply simple rule-based checks to the model output.
Returns detailed results for transparency.
"""
results = {
"passed": True,
"details": []
}
# Check required substrings.
for phrase in checks.get("must_include", []):
found = phrase.lower() in output.lower()
results["details"].append({
"check": f"must_include('{phrase}')",
"passed": found
})
if not found:
results["passed"] = False
# Check maximum sentence count.
if "max_sentences" in checks:
sentence_count = count_sentences(output)
passed = sentence_count <= checks["max_sentences"]
results["details"].append({
"check": f"max_sentences <= {checks['max_sentences']}",
"actual": sentence_count,
"passed": passed
})
if not passed:
results["passed"] = False
# Check exact bullet count.
if "bullet_count" in checks:
bullet_count = count_bullets(output)
passed = bullet_count == checks["bullet_count"]
results["details"].append({
"check": f"bullet_count == {checks['bullet_count']}",
"actual": bullet_count,
"passed": passed
})
if not passed:
results["passed"] = False
return results
summary = {
"total": 0,
"passed": 0,
"failed": 0
}
for test_case in eval_dataset:
output = get_model_response(test_case["input"])
evaluation = evaluate_output(output, test_case["checks"])
summary["total"] += 1
if evaluation["passed"]:
summary["passed"] += 1
else:
summary["failed"] += 1
print("=" * 80)
print(f"Test ID: {test_case['id']}")
print(f"Category: {test_case['category']}")
print("Prompt:")
print(test_case["input"])
print("\nOutput:")
print(output)
print("\nEvaluation Details:")
for detail in evaluation["details"]:
print(detail)
print(f"\nOverall Result: {'PASS' if evaluation['passed'] else 'FAIL'}")
print("\n" + "=" * 80)
print("SUMMARY")
print(summary)
Example output
================================================================================
Test ID: sum_001
Category: summarization
Prompt:
Summarize this in one sentence: Our shipment was delayed by 4 days and the package arrived damaged.
Output:
The shipment was delayed by 4 days and the package arrived damaged.
Evaluation Details:
{'check': "must_include('delayed')", 'passed': True}
{'check': "must_include('damaged')", 'passed': True}
{'check': 'max_sentences <= 1', 'actual': 1, 'passed': True}
Overall Result: PASS
================================================================================
Test ID: sum_002
Category: summarization
Prompt:
Summarize this customer note in 2 bullet points: The user cannot log in after resetting their password, and they are frustrated because support has not responded in 48 hours.
Output:
- The user cannot log in after resetting their password.
- They are frustrated because support has not responded in 48 hours.
Evaluation Details:
{'check': "must_include('log in')", 'passed': True}
{'check': "must_include('48 hours')", 'passed': True}
{'check': 'bullet_count == 2', 'actual': 2, 'passed': True}
Overall Result: PASS
================================================================================
SUMMARY
{'total': 3, 'passed': 2, 'failed': 1}
8. Model-Assisted Evaluation
Rule-based checks are useful, but they do not capture qualities like completeness, tone, or faithfulness very well. For that, we can ask another model call to score the output using a rubric.
Hands-On Exercise 4: Use an LLM as an Evaluator
Goal
Evaluate a generated answer using a rubric and structured scoring instructions.
Important note
Model-based evaluation is helpful, but should be validated with spot checks by humans.
Python script
import json
from openai import OpenAI
client = OpenAI()
test_case = {
"id": "sum_004",
"input": (
"Summarize this support message in 2 bullet points: "
"The customer was charged twice for the same subscription, "
"and they want a refund processed immediately."
),
"expected_behavior": [
"Exactly 2 bullet points",
"Mentions duplicate charge",
"Mentions refund request"
]
}
def get_model_response(user_prompt: str) -> str:
"""
Generate the candidate answer to be evaluated.
"""
response = client.responses.create(
model="gpt-5.4-mini",
input=[
{
"role": "system",
"content": "You are a concise assistant that follows formatting instructions exactly."
},
{
"role": "user",
"content": user_prompt
}
]
)
return response.output_text.strip()
def judge_output(test_case: dict, model_output: str) -> dict:
"""
Ask the model to evaluate the candidate output against a simple rubric.
The evaluator is instructed to return JSON only so that the result can
be parsed programmatically.
"""
evaluator_prompt = f"""
You are an evaluation assistant.
Score the candidate output against the test case using this rubric:
- relevance: 0 to 2
- completeness: 0 to 2
- format_compliance: 0 to 1
Return JSON only with keys:
relevance, completeness, format_compliance, total_score, verdict, rationale
Test case input:
{test_case["input"]}
Expected behavior:
{json.dumps(test_case["expected_behavior"], indent=2)}
Candidate output:
{model_output}
""".strip()
response = client.responses.create(
model="gpt-5.4-mini",
input=[
{
"role": "system",
"content": (
"You are a strict evaluator. "
"Return valid JSON only with no markdown fences."
)
},
{
"role": "user",
"content": evaluator_prompt
}
]
)
raw_text = response.output_text.strip()
return json.loads(raw_text)
candidate_output = get_model_response(test_case["input"])
evaluation = judge_output(test_case, candidate_output)
print("Candidate Output:")
print(candidate_output)
print("\nEvaluation:")
print(json.dumps(evaluation, indent=2))
Example output
Candidate Output:
- The customer was charged twice for the same subscription.
- They want a refund processed immediately.
Evaluation:
{
"relevance": 2,
"completeness": 2,
"format_compliance": 1,
"total_score": 5,
"verdict": "pass",
"rationale": "The response captures both key issues and follows the requested 2-bullet format."
}
9. Best Practices for Evaluation Dataset Design
Start small, then expand
Begin with 10–20 high-value test cases before creating a large benchmark.
Include failures you have already seen
Real bugs are some of the best evaluation examples.
Version your datasets
Track changes to test cases over time, just as you would with code.
Keep prompts stable during comparisons
If you are evaluating prompt changes, avoid changing the dataset simultaneously.
Mix easy, medium, and hard cases
A dataset with only simple examples can create false confidence.
Separate generation from evaluation
Your application prompt and your evaluation rubric should serve different purposes.
Review failing cases manually
Automated scoring is useful, but inspection reveals deeper issues.
10. Common Mistakes
Vague success criteria
Bad: “Should be good”
Better: “Must mention refund request and use exactly 2 bullet points”
Overfitting to the eval set
If you optimize only for known tests, your system may not generalize.
Relying only on exact match
Many good outputs can be phrased differently.
Ignoring edge cases
Production failures often happen in rare but important scenarios.
Not storing test metadata
Without categories and notes, analyzing failures becomes harder.
11. Mini Project
Hands-On Exercise 5: Create and Evaluate Your Own Dataset
Task
Build a dataset with at least 5 test cases for one of these tasks:
- email summarization
- support ticket classification
- structured data extraction
- rewriting text in a specified tone
Requirements
Each test case should include:
idcategoryinputexpected_behaviorchecks
Then:
- Generate outputs with
gpt-5.4-mini - Run rule-based checks
- Print a summary table
- Identify at least one weak test and improve it
Starter template
from openai import OpenAI
client = OpenAI()
eval_dataset = [
{
"id": "custom_001",
"category": "classification",
"input": "Classify this ticket as billing, technical, or account: I was charged twice this month.",
"expected_behavior": [
"Returns one valid label",
"Chooses billing"
],
"checks": {
"allowed_labels": ["billing", "technical", "account"],
"expected_label": "billing"
}
}
]
def get_model_response(user_prompt: str) -> str:
"""
Generate a model response for a single test case.
"""
response = client.responses.create(
model="gpt-5.4-mini",
input=[
{"role": "system", "content": "You are a precise assistant."},
{"role": "user", "content": user_prompt}
]
)
return response.output_text.strip()
def evaluate_classification(output: str, checks: dict) -> dict:
"""
Evaluate a classification output using simple exact-match rules.
"""
normalized = output.lower().strip()
details = []
passed = True
in_allowed = normalized in [label.lower() for label in checks["allowed_labels"]]
details.append({
"check": "label_in_allowed_set",
"passed": in_allowed
})
if not in_allowed:
passed = False
matches_expected = normalized == checks["expected_label"].lower()
details.append({
"check": "matches_expected_label",
"passed": matches_expected
})
if not matches_expected:
passed = False
return {
"passed": passed,
"details": details
}
for test_case in eval_dataset:
output = get_model_response(test_case["input"])
result = evaluate_classification(output, test_case["checks"])
print("=" * 80)
print(f"Test ID: {test_case['id']}")
print(f"Prompt: {test_case['input']}")
print(f"Output: {output}")
print("Evaluation:")
for detail in result["details"]:
print(detail)
print(f"Result: {'PASS' if result['passed'] else 'FAIL'}")
Suggested extension
Add a CSV or JSON export step so results can be compared between runs.
12. Wrap-Up
Key takeaways
- Evaluation is essential for reliable GenAI systems
- Strong test cases define both input and success criteria
- Evaluation datasets should cover normal, edge, and regression cases
- Rule-based checks are great for structure and constraints
- Model-assisted evaluation helps score nuanced qualities
- The best evaluation workflows combine automation with human review
Useful Resources
- OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
- OpenAI API reference: https://platform.openai.com/docs/api-reference
- OpenAI Python SDK: https://github.com/openai/openai-python
- Prompt engineering guide: https://platform.openai.com/docs/guides/prompt-engineering
- JSON basics in Python: https://docs.python.org/3/library/json.html
Suggested Homework
- Expand the evaluation dataset from 3 test cases to 15.
- Add at least:
- 5 normal cases
- 5 edge cases
- 5 regression cases
- Implement one model-assisted evaluator with a scoring rubric.
- Compare results before and after changing your system prompt.
- Write a short reflection:
- Which tests were easiest to automate?
- Which required subjective judgment?
- Which failures would matter most in production?
Back to Chapter | Back to Master Plan | Previous Session | Next Session