Session 2: Designing Test Cases and Evaluation Datasets

Synopsis

Covers how to create representative prompts, expected behaviors, edge cases, and failure-focused examples. Learners begin constructing repeatable evaluation sets for their own applications.

Session Content

Session 2: Designing Test Cases and Evaluation Datasets

Session Overview

Duration: ~45 minutes
Audience: Python developers with basic programming knowledge, learning GenAI and agentic development
Session Goal: Learn how to design effective test cases and evaluation datasets for LLM-powered applications, and implement a simple evaluation workflow using the OpenAI Responses API with gpt-5.4-mini.

Learning Outcomes

By the end of this session, learners will be able to:

Explain why evaluation is essential in GenAI application development
Distinguish between ad hoc prompting and systematic testing
Design high-quality test cases for LLM-based systems
Build a small evaluation dataset in Python
Run model outputs against test prompts using the OpenAI Responses API
Apply simple rubric-based and model-assisted evaluation strategies

1. Why Evaluation Matters in GenAI

LLM applications often appear to work well in demos but fail in production because they are not tested systematically. Unlike traditional software, LLM behavior is probabilistic, context-sensitive, and can vary with prompt wording, model version, or input complexity.

Common failure modes

Incorrect factual answers
Incomplete responses
Hallucinations
Poor formatting
Ignoring constraints
Unsafe or off-topic outputs
Inconsistent behavior across similar inputs

Why test cases are important

Test cases help you answer:

Does the model follow instructions reliably?
Does it produce the expected output format?
Does it fail on edge cases?
Does it remain helpful across different user phrasings?
Is performance improving or regressing after prompt changes?

Core idea

A good GenAI workflow includes:

Define desired behavior
Create representative test cases
Run evaluations regularly
Inspect failures
Improve prompts, tools, or orchestration
Re-evaluate

2. What Makes a Good Test Case?

A test case is more than just a prompt. It should define the scenario, expectations, and success criteria.

Components of a strong test case

Input: The user message or task
Expected behavior: What the model should do
Acceptance criteria: Rules for pass/fail
Category/tag: For grouping similar tests
Difficulty/edge indication: Optional but useful

Example

Weak test case

Prompt: “Summarize this email”

This is too vague. There is no expected style, length, or format.

Better test case

Input: “Summarize this customer complaint email in 2 bullet points.”
Expected behavior:
Produces exactly 2 bullet points
Captures the core complaint
Uses neutral tone
Category: summarization
Acceptance criteria:
Bullet count = 2
No invented details
Main issue mentioned

Test case design principles

1. Be specific

Define exactly what success looks like.

2. Cover normal and edge cases

Include typical tasks and difficult scenarios.

3. Test one behavior at a time

Avoid combining too many requirements into a single test.

4. Include realistic inputs

Use inputs similar to what real users provide.

5. Make evaluation practical

If humans cannot judge success consistently, your criteria may be too vague.

3. Types of Evaluation Datasets

An evaluation dataset is a collection of test cases used to measure system performance.

Common dataset categories

Smoke tests

Small, fast tests to catch obvious regressions.

“Does the app answer at all?”
“Does it return valid JSON?”
“Does it avoid forbidden content?”

Functional tests

Check that expected capabilities work.

Classification
Extraction
Summarization
Rewriting
Structured output

Edge case tests

Focus on difficult or unusual inputs.

Ambiguous instructions
Long inputs
Contradictory information
Misspellings
Empty or malformed content

Safety/policy tests

Check whether the system responds appropriately to risky prompts.

Regression tests

Previously failing cases that should never break again.

4. Designing Evaluation Criteria

Evaluation criteria should map directly to the behavior you care about.

Evaluation approaches

Exact match

Best for deterministic outputs such as:

labels
categories
boolean answers
short structured fields

Rule-based checks

Useful for formatting and constraints:

Must include 3 bullets
Must be valid JSON
Must be under 100 words

Rubric-based human judgment

Useful when quality is subjective:

relevance
completeness
clarity
tone

Model-assisted evaluation

Use an LLM to score another LLM’s output against a rubric.

This is practical, but should be used carefully because evaluators can also make mistakes.

Example rubric

For a customer support summary:

Relevance (0–2)
0 = misses key issue
1 = partially relevant
2 = captures key issue clearly
Completeness (0–2)
0 = omits major details
1 = partial
2 = includes all major points
Format compliance (0–1)
0 = wrong format
1 = correct format

Total score: 0–5

5. Dataset Structure in Python

A simple evaluation dataset can be stored as a list of dictionaries.

Hands-On Exercise 1: Build a Small Evaluation Dataset

Goal

Create a Python dataset with several test cases for a summarization assistant.

Example dataset design

eval_dataset = [
    {
        "id": "sum_001",
        "category": "summarization",
        "input": "Summarize this in one sentence: Our shipment was delayed by 4 days and the package arrived damaged.",
        "expected_behavior": [
            "Mentions delayed shipment",
            "Mentions damaged package",
            "Uses one sentence"
        ],
        "checks": {
            "max_sentences": 1,
            "must_include": ["delayed", "damaged"]
        }
    },
    {
        "id": "sum_002",
        "category": "summarization",
        "input": "Summarize this customer note in 2 bullet points: The user cannot log in after resetting their password, and they are frustrated because support has not responded in 48 hours.",
        "expected_behavior": [
            "Exactly 2 bullet points",
            "Mentions login issue",
            "Mentions delayed support response"
        ],
        "checks": {
            "bullet_count": 2,
            "must_include": ["log in", "48 hours"]
        }
    },
    {
        "id": "sum_003",
        "category": "summarization_edge",
        "input": "Summarize this in one short sentence: The customer says the product is excellent, except that it stopped working after two days.",
        "expected_behavior": [
            "Captures mixed sentiment",
            "Mentions product failure",
            "Uses one short sentence"
        ],
        "checks": {
            "max_sentences": 1,
            "must_include": ["stopped working"]
        }
    }
]

Discussion points

These tests define expectations explicitly.
The checks are simple enough for automation.
This dataset can grow over time as the application evolves.

6. Running Model Outputs Against Test Cases

Now we will use the OpenAI Python SDK and the Responses API to generate outputs for each test case.

Hands-On Exercise 2: Generate Responses with `gpt-5.4-mini`

Prerequisites

Install the OpenAI SDK:

pip install openai

Set your API key:

export OPENAI_API_KEY="your_api_key_here"

Python script

from openai import OpenAI

# Create a client using the API key from the OPENAI_API_KEY environment variable.
client = OpenAI()

# A small evaluation dataset for summarization tests.
eval_dataset = [
    {
        "id": "sum_001",
        "category": "summarization",
        "input": (
            "Summarize this in one sentence: "
            "Our shipment was delayed by 4 days and the package arrived damaged."
        ),
        "checks": {
            "max_sentences": 1,
            "must_include": ["delayed", "damaged"]
        }
    },
    {
        "id": "sum_002",
        "category": "summarization",
        "input": (
            "Summarize this customer note in 2 bullet points: "
            "The user cannot log in after resetting their password, "
            "and they are frustrated because support has not responded in 48 hours."
        ),
        "checks": {
            "bullet_count": 2,
            "must_include": ["log in", "48 hours"]
        }
    },
]


def get_model_response(user_prompt: str) -> str:
    """
    Send a prompt to the OpenAI Responses API and return the generated text.

    The instruction is kept stable across test cases so changes in behavior
    can be attributed mainly to the input prompt.
    """
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {
                "role": "system",
                "content": "You are a concise assistant that follows formatting instructions exactly."
            },
            {
                "role": "user",
                "content": user_prompt
            }
        ]
    )
    return response.output_text


for test_case in eval_dataset:
    output = get_model_response(test_case["input"])
    print("=" * 80)
    print(f"Test ID: {test_case['id']}")
    print(f"Prompt: {test_case['input']}")
    print("Model Output:")
    print(output)
    print()

Example output

================================================================================
Test ID: sum_001
Prompt: Summarize this in one sentence: Our shipment was delayed by 4 days and the package arrived damaged.
Model Output:
The shipment arrived 4 days late and the package was damaged.

================================================================================
Test ID: sum_002
Prompt: Summarize this customer note in 2 bullet points: The user cannot log in after resetting their password, and they are frustrated because support has not responded in 48 hours.
Model Output:
- The user cannot log in after resetting their password.
- They are frustrated that support has not responded in 48 hours.

7. Automating Basic Checks

We now add rule-based evaluation logic to score outputs automatically.

Hands-On Exercise 3: Implement Rule-Based Evaluation

Goal

Automatically check whether outputs satisfy the basic constraints in the dataset.

Python script

from openai import OpenAI

client = OpenAI()

eval_dataset = [
    {
        "id": "sum_001",
        "category": "summarization",
        "input": (
            "Summarize this in one sentence: "
            "Our shipment was delayed by 4 days and the package arrived damaged."
        ),
        "checks": {
            "max_sentences": 1,
            "must_include": ["delayed", "damaged"]
        }
    },
    {
        "id": "sum_002",
        "category": "summarization",
        "input": (
            "Summarize this customer note in 2 bullet points: "
            "The user cannot log in after resetting their password, "
            "and they are frustrated because support has not responded in 48 hours."
        ),
        "checks": {
            "bullet_count": 2,
            "must_include": ["log in", "48 hours"]
        }
    },
    {
        "id": "sum_003",
        "category": "summarization_edge",
        "input": (
            "Summarize this in one short sentence: "
            "The customer says the product is excellent, except that it stopped working after two days."
        ),
        "checks": {
            "max_sentences": 1,
            "must_include": ["stopped working"]
        }
    }
]


def get_model_response(user_prompt: str) -> str:
    """
    Generate a model response using the OpenAI Responses API.
    """
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {
                "role": "system",
                "content": "You are a concise assistant that follows formatting instructions exactly."
            },
            {
                "role": "user",
                "content": user_prompt
            }
        ]
    )
    return response.output_text.strip()


def count_sentences(text: str) -> int:
    """
    A simple sentence counter based on punctuation.
    This is intentionally lightweight for demo purposes.
    """
    sentence_endings = [".", "!", "?"]
    count = sum(text.count(mark) for mark in sentence_endings)
    return max(count, 1 if text else 0)


def count_bullets(text: str) -> int:
    """
    Count lines that appear to be bullet points.
    """
    bullet_prefixes = ("- ", "* ")
    return sum(
        1 for line in text.splitlines()
        if line.strip().startswith(bullet_prefixes)
    )


def evaluate_output(output: str, checks: dict) -> dict:
    """
    Apply simple rule-based checks to the model output.
    Returns detailed results for transparency.
    """
    results = {
        "passed": True,
        "details": []
    }

    # Check required substrings.
    for phrase in checks.get("must_include", []):
        found = phrase.lower() in output.lower()
        results["details"].append({
            "check": f"must_include('{phrase}')",
            "passed": found
        })
        if not found:
            results["passed"] = False

    # Check maximum sentence count.
    if "max_sentences" in checks:
        sentence_count = count_sentences(output)
        passed = sentence_count <= checks["max_sentences"]
        results["details"].append({
            "check": f"max_sentences <= {checks['max_sentences']}",
            "actual": sentence_count,
            "passed": passed
        })
        if not passed:
            results["passed"] = False

    # Check exact bullet count.
    if "bullet_count" in checks:
        bullet_count = count_bullets(output)
        passed = bullet_count == checks["bullet_count"]
        results["details"].append({
            "check": f"bullet_count == {checks['bullet_count']}",
            "actual": bullet_count,
            "passed": passed
        })
        if not passed:
            results["passed"] = False

    return results


summary = {
    "total": 0,
    "passed": 0,
    "failed": 0
}

for test_case in eval_dataset:
    output = get_model_response(test_case["input"])
    evaluation = evaluate_output(output, test_case["checks"])

    summary["total"] += 1
    if evaluation["passed"]:
        summary["passed"] += 1
    else:
        summary["failed"] += 1

    print("=" * 80)
    print(f"Test ID: {test_case['id']}")
    print(f"Category: {test_case['category']}")
    print("Prompt:")
    print(test_case["input"])
    print("\nOutput:")
    print(output)
    print("\nEvaluation Details:")
    for detail in evaluation["details"]:
        print(detail)
    print(f"\nOverall Result: {'PASS' if evaluation['passed'] else 'FAIL'}")

print("\n" + "=" * 80)
print("SUMMARY")
print(summary)

Example output

================================================================================
Test ID: sum_001
Category: summarization
Prompt:
Summarize this in one sentence: Our shipment was delayed by 4 days and the package arrived damaged.

Output:
The shipment was delayed by 4 days and the package arrived damaged.

Evaluation Details:
{'check': "must_include('delayed')", 'passed': True}
{'check': "must_include('damaged')", 'passed': True}
{'check': 'max_sentences <= 1', 'actual': 1, 'passed': True}

Overall Result: PASS
================================================================================
Test ID: sum_002
Category: summarization
Prompt:
Summarize this customer note in 2 bullet points: The user cannot log in after resetting their password, and they are frustrated because support has not responded in 48 hours.

Output:
- The user cannot log in after resetting their password.
- They are frustrated because support has not responded in 48 hours.

Evaluation Details:
{'check': "must_include('log in')", 'passed': True}
{'check': "must_include('48 hours')", 'passed': True}
{'check': 'bullet_count == 2', 'actual': 2, 'passed': True}

Overall Result: PASS

================================================================================
SUMMARY
{'total': 3, 'passed': 2, 'failed': 1}

8. Model-Assisted Evaluation

Rule-based checks are useful, but they do not capture qualities like completeness, tone, or faithfulness very well. For that, we can ask another model call to score the output using a rubric.

Hands-On Exercise 4: Use an LLM as an Evaluator

Goal

Evaluate a generated answer using a rubric and structured scoring instructions.

Important note

Model-based evaluation is helpful, but should be validated with spot checks by humans.

Python script

import json
from openai import OpenAI

client = OpenAI()

test_case = {
    "id": "sum_004",
    "input": (
        "Summarize this support message in 2 bullet points: "
        "The customer was charged twice for the same subscription, "
        "and they want a refund processed immediately."
    ),
    "expected_behavior": [
        "Exactly 2 bullet points",
        "Mentions duplicate charge",
        "Mentions refund request"
    ]
}


def get_model_response(user_prompt: str) -> str:
    """
    Generate the candidate answer to be evaluated.
    """
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {
                "role": "system",
                "content": "You are a concise assistant that follows formatting instructions exactly."
            },
            {
                "role": "user",
                "content": user_prompt
            }
        ]
    )
    return response.output_text.strip()


def judge_output(test_case: dict, model_output: str) -> dict:
    """
    Ask the model to evaluate the candidate output against a simple rubric.

    The evaluator is instructed to return JSON only so that the result can
    be parsed programmatically.
    """
    evaluator_prompt = f"""
You are an evaluation assistant.

Score the candidate output against the test case using this rubric:
- relevance: 0 to 2
- completeness: 0 to 2
- format_compliance: 0 to 1

Return JSON only with keys:
relevance, completeness, format_compliance, total_score, verdict, rationale

Test case input:
{test_case["input"]}

Expected behavior:
{json.dumps(test_case["expected_behavior"], indent=2)}

Candidate output:
{model_output}
""".strip()

    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {
                "role": "system",
                "content": (
                    "You are a strict evaluator. "
                    "Return valid JSON only with no markdown fences."
                )
            },
            {
                "role": "user",
                "content": evaluator_prompt
            }
        ]
    )

    raw_text = response.output_text.strip()
    return json.loads(raw_text)


candidate_output = get_model_response(test_case["input"])
evaluation = judge_output(test_case, candidate_output)

print("Candidate Output:")
print(candidate_output)
print("\nEvaluation:")
print(json.dumps(evaluation, indent=2))

Example output

Candidate Output:
- The customer was charged twice for the same subscription.
- They want a refund processed immediately.

Evaluation:
{
  "relevance": 2,
  "completeness": 2,
  "format_compliance": 1,
  "total_score": 5,
  "verdict": "pass",
  "rationale": "The response captures both key issues and follows the requested 2-bullet format."
}

9. Best Practices for Evaluation Dataset Design

Start small, then expand

Begin with 10–20 high-value test cases before creating a large benchmark.

Include failures you have already seen

Real bugs are some of the best evaluation examples.

Version your datasets

Track changes to test cases over time, just as you would with code.

Keep prompts stable during comparisons

If you are evaluating prompt changes, avoid changing the dataset simultaneously.

Mix easy, medium, and hard cases

A dataset with only simple examples can create false confidence.

Separate generation from evaluation

Your application prompt and your evaluation rubric should serve different purposes.

Review failing cases manually

Automated scoring is useful, but inspection reveals deeper issues.

10. Common Mistakes

Vague success criteria

Bad: “Should be good”
Better: “Must mention refund request and use exactly 2 bullet points”

Overfitting to the eval set

If you optimize only for known tests, your system may not generalize.

Relying only on exact match

Many good outputs can be phrased differently.

Ignoring edge cases

Production failures often happen in rare but important scenarios.

Not storing test metadata

Without categories and notes, analyzing failures becomes harder.

11. Mini Project

Hands-On Exercise 5: Create and Evaluate Your Own Dataset

Task

Build a dataset with at least 5 test cases for one of these tasks:

email summarization
support ticket classification
structured data extraction
rewriting text in a specified tone

Requirements

Each test case should include:

id
category
input
expected_behavior
checks

Then:

Generate outputs with gpt-5.4-mini
Run rule-based checks
Print a summary table
Identify at least one weak test and improve it

Starter template

from openai import OpenAI

client = OpenAI()

eval_dataset = [
    {
        "id": "custom_001",
        "category": "classification",
        "input": "Classify this ticket as billing, technical, or account: I was charged twice this month.",
        "expected_behavior": [
            "Returns one valid label",
            "Chooses billing"
        ],
        "checks": {
            "allowed_labels": ["billing", "technical", "account"],
            "expected_label": "billing"
        }
    }
]


def get_model_response(user_prompt: str) -> str:
    """
    Generate a model response for a single test case.
    """
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {"role": "system", "content": "You are a precise assistant."},
            {"role": "user", "content": user_prompt}
        ]
    )
    return response.output_text.strip()


def evaluate_classification(output: str, checks: dict) -> dict:
    """
    Evaluate a classification output using simple exact-match rules.
    """
    normalized = output.lower().strip()

    details = []
    passed = True

    in_allowed = normalized in [label.lower() for label in checks["allowed_labels"]]
    details.append({
        "check": "label_in_allowed_set",
        "passed": in_allowed
    })
    if not in_allowed:
        passed = False

    matches_expected = normalized == checks["expected_label"].lower()
    details.append({
        "check": "matches_expected_label",
        "passed": matches_expected
    })
    if not matches_expected:
        passed = False

    return {
        "passed": passed,
        "details": details
    }


for test_case in eval_dataset:
    output = get_model_response(test_case["input"])
    result = evaluate_classification(output, test_case["checks"])

    print("=" * 80)
    print(f"Test ID: {test_case['id']}")
    print(f"Prompt: {test_case['input']}")
    print(f"Output: {output}")
    print("Evaluation:")
    for detail in result["details"]:
        print(detail)
    print(f"Result: {'PASS' if result['passed'] else 'FAIL'}")

Suggested extension

Add a CSV or JSON export step so results can be compared between runs.

12. Wrap-Up

Key takeaways

Evaluation is essential for reliable GenAI systems
Strong test cases define both input and success criteria
Evaluation datasets should cover normal, edge, and regression cases
Rule-based checks are great for structure and constraints
Model-assisted evaluation helps score nuanced qualities
The best evaluation workflows combine automation with human review

Useful Resources

OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
OpenAI API reference: https://platform.openai.com/docs/api-reference
OpenAI Python SDK: https://github.com/openai/openai-python
Prompt engineering guide: https://platform.openai.com/docs/guides/prompt-engineering
JSON basics in Python: https://docs.python.org/3/library/json.html

Suggested Homework

Expand the evaluation dataset from 3 test cases to 15.
Add at least:
5 normal cases
5 edge cases
5 regression cases
Implement one model-assisted evaluator with a scoring rubric.
Compare results before and after changing your system prompt.
Write a short reflection:
Which tests were easiest to automate?
Which required subjective judgment?
Which failures would matter most in production?

Back to Chapter | Back to Master Plan | Previous Session | Next Session

Session 2: Designing Test Cases and Evaluation Datasets

Synopsis

Session Content

Session 2: Designing Test Cases and Evaluation Datasets

Session Overview

Learning Outcomes

1. Why Evaluation Matters in GenAI

Common failure modes

Why test cases are important

Core idea

2. What Makes a Good Test Case?

Components of a strong test case

Example

Weak test case

Better test case

Test case design principles

1. Be specific

2. Cover normal and edge cases

3. Test one behavior at a time

4. Include realistic inputs

5. Make evaluation practical

3. Types of Evaluation Datasets

Common dataset categories

Smoke tests

Functional tests

Edge case tests

Safety/policy tests

Regression tests

4. Designing Evaluation Criteria

Evaluation approaches

Exact match

Rule-based checks

Rubric-based human judgment

Model-assisted evaluation

Example rubric

5. Dataset Structure in Python

Hands-On Exercise 1: Build a Small Evaluation Dataset

Goal

Example dataset design

Discussion points

6. Running Model Outputs Against Test Cases

Hands-On Exercise 2: Generate Responses with gpt-5.4-mini

Prerequisites

Python script

Example output

7. Automating Basic Checks

Hands-On Exercise 3: Implement Rule-Based Evaluation

Goal

Python script

Example output

8. Model-Assisted Evaluation

Hands-On Exercise 4: Use an LLM as an Evaluator

Goal

Important note

Python script

Example output

9. Best Practices for Evaluation Dataset Design

Start small, then expand

Include failures you have already seen

Version your datasets

Keep prompts stable during comparisons

Mix easy, medium, and hard cases

Separate generation from evaluation

Review failing cases manually

10. Common Mistakes

Vague success criteria

Overfitting to the eval set

Relying only on exact match

Ignoring edge cases

Not storing test metadata

11. Mini Project

Hands-On Exercise 5: Create and Evaluate Your Own Dataset

Task

Requirements

Starter template

Suggested extension

12. Wrap-Up

Key takeaways

Useful Resources

Suggested Homework

Hands-On Exercise 2: Generate Responses with `gpt-5.4-mini`