Skip to content

Session 2: Designing Test Cases and Evaluation Datasets

Synopsis

Covers how to create representative prompts, expected behaviors, edge cases, and failure-focused examples. Learners begin constructing repeatable evaluation sets for their own applications.

Session Content

Session 2: Designing Test Cases and Evaluation Datasets

Session Overview

Duration: ~45 minutes
Audience: Python developers with basic programming knowledge, learning GenAI and agentic development
Session Goal: Learn how to design effective test cases and evaluation datasets for LLM-powered applications, and implement a simple evaluation workflow using the OpenAI Responses API with gpt-5.4-mini.

Learning Outcomes

By the end of this session, learners will be able to:

  • Explain why evaluation is essential in GenAI application development
  • Distinguish between ad hoc prompting and systematic testing
  • Design high-quality test cases for LLM-based systems
  • Build a small evaluation dataset in Python
  • Run model outputs against test prompts using the OpenAI Responses API
  • Apply simple rubric-based and model-assisted evaluation strategies

1. Why Evaluation Matters in GenAI

LLM applications often appear to work well in demos but fail in production because they are not tested systematically. Unlike traditional software, LLM behavior is probabilistic, context-sensitive, and can vary with prompt wording, model version, or input complexity.

Common failure modes

  • Incorrect factual answers
  • Incomplete responses
  • Hallucinations
  • Poor formatting
  • Ignoring constraints
  • Unsafe or off-topic outputs
  • Inconsistent behavior across similar inputs

Why test cases are important

Test cases help you answer:

  • Does the model follow instructions reliably?
  • Does it produce the expected output format?
  • Does it fail on edge cases?
  • Does it remain helpful across different user phrasings?
  • Is performance improving or regressing after prompt changes?

Core idea

A good GenAI workflow includes:

  1. Define desired behavior
  2. Create representative test cases
  3. Run evaluations regularly
  4. Inspect failures
  5. Improve prompts, tools, or orchestration
  6. Re-evaluate

2. What Makes a Good Test Case?

A test case is more than just a prompt. It should define the scenario, expectations, and success criteria.

Components of a strong test case

  • Input: The user message or task
  • Expected behavior: What the model should do
  • Acceptance criteria: Rules for pass/fail
  • Category/tag: For grouping similar tests
  • Difficulty/edge indication: Optional but useful

Example

Weak test case

  • Prompt: “Summarize this email”

This is too vague. There is no expected style, length, or format.

Better test case

  • Input: “Summarize this customer complaint email in 2 bullet points.”
  • Expected behavior:
  • Produces exactly 2 bullet points
  • Captures the core complaint
  • Uses neutral tone
  • Category: summarization
  • Acceptance criteria:
  • Bullet count = 2
  • No invented details
  • Main issue mentioned

Test case design principles

1. Be specific

Define exactly what success looks like.

2. Cover normal and edge cases

Include typical tasks and difficult scenarios.

3. Test one behavior at a time

Avoid combining too many requirements into a single test.

4. Include realistic inputs

Use inputs similar to what real users provide.

5. Make evaluation practical

If humans cannot judge success consistently, your criteria may be too vague.


3. Types of Evaluation Datasets

An evaluation dataset is a collection of test cases used to measure system performance.

Common dataset categories

Smoke tests

Small, fast tests to catch obvious regressions.

  • “Does the app answer at all?”
  • “Does it return valid JSON?”
  • “Does it avoid forbidden content?”

Functional tests

Check that expected capabilities work.

  • Classification
  • Extraction
  • Summarization
  • Rewriting
  • Structured output

Edge case tests

Focus on difficult or unusual inputs.

  • Ambiguous instructions
  • Long inputs
  • Contradictory information
  • Misspellings
  • Empty or malformed content

Safety/policy tests

Check whether the system responds appropriately to risky prompts.

Regression tests

Previously failing cases that should never break again.


4. Designing Evaluation Criteria

Evaluation criteria should map directly to the behavior you care about.

Evaluation approaches

Exact match

Best for deterministic outputs such as:

  • labels
  • categories
  • boolean answers
  • short structured fields

Rule-based checks

Useful for formatting and constraints:

  • Must include 3 bullets
  • Must be valid JSON
  • Must be under 100 words

Rubric-based human judgment

Useful when quality is subjective:

  • relevance
  • completeness
  • clarity
  • tone

Model-assisted evaluation

Use an LLM to score another LLM’s output against a rubric.

This is practical, but should be used carefully because evaluators can also make mistakes.

Example rubric

For a customer support summary:

  • Relevance (0–2)
    0 = misses key issue
    1 = partially relevant
    2 = captures key issue clearly

  • Completeness (0–2)
    0 = omits major details
    1 = partial
    2 = includes all major points

  • Format compliance (0–1)
    0 = wrong format
    1 = correct format

Total score: 0–5


5. Dataset Structure in Python

A simple evaluation dataset can be stored as a list of dictionaries.

Hands-On Exercise 1: Build a Small Evaluation Dataset

Goal

Create a Python dataset with several test cases for a summarization assistant.

Example dataset design

eval_dataset = [
    {
        "id": "sum_001",
        "category": "summarization",
        "input": "Summarize this in one sentence: Our shipment was delayed by 4 days and the package arrived damaged.",
        "expected_behavior": [
            "Mentions delayed shipment",
            "Mentions damaged package",
            "Uses one sentence"
        ],
        "checks": {
            "max_sentences": 1,
            "must_include": ["delayed", "damaged"]
        }
    },
    {
        "id": "sum_002",
        "category": "summarization",
        "input": "Summarize this customer note in 2 bullet points: The user cannot log in after resetting their password, and they are frustrated because support has not responded in 48 hours.",
        "expected_behavior": [
            "Exactly 2 bullet points",
            "Mentions login issue",
            "Mentions delayed support response"
        ],
        "checks": {
            "bullet_count": 2,
            "must_include": ["log in", "48 hours"]
        }
    },
    {
        "id": "sum_003",
        "category": "summarization_edge",
        "input": "Summarize this in one short sentence: The customer says the product is excellent, except that it stopped working after two days.",
        "expected_behavior": [
            "Captures mixed sentiment",
            "Mentions product failure",
            "Uses one short sentence"
        ],
        "checks": {
            "max_sentences": 1,
            "must_include": ["stopped working"]
        }
    }
]

Discussion points

  • These tests define expectations explicitly.
  • The checks are simple enough for automation.
  • This dataset can grow over time as the application evolves.

6. Running Model Outputs Against Test Cases

Now we will use the OpenAI Python SDK and the Responses API to generate outputs for each test case.

Hands-On Exercise 2: Generate Responses with gpt-5.4-mini

Prerequisites

Install the OpenAI SDK:

pip install openai

Set your API key:

export OPENAI_API_KEY="your_api_key_here"

Python script

from openai import OpenAI

# Create a client using the API key from the OPENAI_API_KEY environment variable.
client = OpenAI()

# A small evaluation dataset for summarization tests.
eval_dataset = [
    {
        "id": "sum_001",
        "category": "summarization",
        "input": (
            "Summarize this in one sentence: "
            "Our shipment was delayed by 4 days and the package arrived damaged."
        ),
        "checks": {
            "max_sentences": 1,
            "must_include": ["delayed", "damaged"]
        }
    },
    {
        "id": "sum_002",
        "category": "summarization",
        "input": (
            "Summarize this customer note in 2 bullet points: "
            "The user cannot log in after resetting their password, "
            "and they are frustrated because support has not responded in 48 hours."
        ),
        "checks": {
            "bullet_count": 2,
            "must_include": ["log in", "48 hours"]
        }
    },
]


def get_model_response(user_prompt: str) -> str:
    """
    Send a prompt to the OpenAI Responses API and return the generated text.

    The instruction is kept stable across test cases so changes in behavior
    can be attributed mainly to the input prompt.
    """
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {
                "role": "system",
                "content": "You are a concise assistant that follows formatting instructions exactly."
            },
            {
                "role": "user",
                "content": user_prompt
            }
        ]
    )
    return response.output_text


for test_case in eval_dataset:
    output = get_model_response(test_case["input"])
    print("=" * 80)
    print(f"Test ID: {test_case['id']}")
    print(f"Prompt: {test_case['input']}")
    print("Model Output:")
    print(output)
    print()

Example output

================================================================================
Test ID: sum_001
Prompt: Summarize this in one sentence: Our shipment was delayed by 4 days and the package arrived damaged.
Model Output:
The shipment arrived 4 days late and the package was damaged.

================================================================================
Test ID: sum_002
Prompt: Summarize this customer note in 2 bullet points: The user cannot log in after resetting their password, and they are frustrated because support has not responded in 48 hours.
Model Output:
- The user cannot log in after resetting their password.
- They are frustrated that support has not responded in 48 hours.

7. Automating Basic Checks

We now add rule-based evaluation logic to score outputs automatically.

Hands-On Exercise 3: Implement Rule-Based Evaluation

Goal

Automatically check whether outputs satisfy the basic constraints in the dataset.

Python script

from openai import OpenAI

client = OpenAI()

eval_dataset = [
    {
        "id": "sum_001",
        "category": "summarization",
        "input": (
            "Summarize this in one sentence: "
            "Our shipment was delayed by 4 days and the package arrived damaged."
        ),
        "checks": {
            "max_sentences": 1,
            "must_include": ["delayed", "damaged"]
        }
    },
    {
        "id": "sum_002",
        "category": "summarization",
        "input": (
            "Summarize this customer note in 2 bullet points: "
            "The user cannot log in after resetting their password, "
            "and they are frustrated because support has not responded in 48 hours."
        ),
        "checks": {
            "bullet_count": 2,
            "must_include": ["log in", "48 hours"]
        }
    },
    {
        "id": "sum_003",
        "category": "summarization_edge",
        "input": (
            "Summarize this in one short sentence: "
            "The customer says the product is excellent, except that it stopped working after two days."
        ),
        "checks": {
            "max_sentences": 1,
            "must_include": ["stopped working"]
        }
    }
]


def get_model_response(user_prompt: str) -> str:
    """
    Generate a model response using the OpenAI Responses API.
    """
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {
                "role": "system",
                "content": "You are a concise assistant that follows formatting instructions exactly."
            },
            {
                "role": "user",
                "content": user_prompt
            }
        ]
    )
    return response.output_text.strip()


def count_sentences(text: str) -> int:
    """
    A simple sentence counter based on punctuation.
    This is intentionally lightweight for demo purposes.
    """
    sentence_endings = [".", "!", "?"]
    count = sum(text.count(mark) for mark in sentence_endings)
    return max(count, 1 if text else 0)


def count_bullets(text: str) -> int:
    """
    Count lines that appear to be bullet points.
    """
    bullet_prefixes = ("- ", "* ")
    return sum(
        1 for line in text.splitlines()
        if line.strip().startswith(bullet_prefixes)
    )


def evaluate_output(output: str, checks: dict) -> dict:
    """
    Apply simple rule-based checks to the model output.
    Returns detailed results for transparency.
    """
    results = {
        "passed": True,
        "details": []
    }

    # Check required substrings.
    for phrase in checks.get("must_include", []):
        found = phrase.lower() in output.lower()
        results["details"].append({
            "check": f"must_include('{phrase}')",
            "passed": found
        })
        if not found:
            results["passed"] = False

    # Check maximum sentence count.
    if "max_sentences" in checks:
        sentence_count = count_sentences(output)
        passed = sentence_count <= checks["max_sentences"]
        results["details"].append({
            "check": f"max_sentences <= {checks['max_sentences']}",
            "actual": sentence_count,
            "passed": passed
        })
        if not passed:
            results["passed"] = False

    # Check exact bullet count.
    if "bullet_count" in checks:
        bullet_count = count_bullets(output)
        passed = bullet_count == checks["bullet_count"]
        results["details"].append({
            "check": f"bullet_count == {checks['bullet_count']}",
            "actual": bullet_count,
            "passed": passed
        })
        if not passed:
            results["passed"] = False

    return results


summary = {
    "total": 0,
    "passed": 0,
    "failed": 0
}

for test_case in eval_dataset:
    output = get_model_response(test_case["input"])
    evaluation = evaluate_output(output, test_case["checks"])

    summary["total"] += 1
    if evaluation["passed"]:
        summary["passed"] += 1
    else:
        summary["failed"] += 1

    print("=" * 80)
    print(f"Test ID: {test_case['id']}")
    print(f"Category: {test_case['category']}")
    print("Prompt:")
    print(test_case["input"])
    print("\nOutput:")
    print(output)
    print("\nEvaluation Details:")
    for detail in evaluation["details"]:
        print(detail)
    print(f"\nOverall Result: {'PASS' if evaluation['passed'] else 'FAIL'}")

print("\n" + "=" * 80)
print("SUMMARY")
print(summary)

Example output

================================================================================
Test ID: sum_001
Category: summarization
Prompt:
Summarize this in one sentence: Our shipment was delayed by 4 days and the package arrived damaged.

Output:
The shipment was delayed by 4 days and the package arrived damaged.

Evaluation Details:
{'check': "must_include('delayed')", 'passed': True}
{'check': "must_include('damaged')", 'passed': True}
{'check': 'max_sentences <= 1', 'actual': 1, 'passed': True}

Overall Result: PASS
================================================================================
Test ID: sum_002
Category: summarization
Prompt:
Summarize this customer note in 2 bullet points: The user cannot log in after resetting their password, and they are frustrated because support has not responded in 48 hours.

Output:
- The user cannot log in after resetting their password.
- They are frustrated because support has not responded in 48 hours.

Evaluation Details:
{'check': "must_include('log in')", 'passed': True}
{'check': "must_include('48 hours')", 'passed': True}
{'check': 'bullet_count == 2', 'actual': 2, 'passed': True}

Overall Result: PASS

================================================================================
SUMMARY
{'total': 3, 'passed': 2, 'failed': 1}

8. Model-Assisted Evaluation

Rule-based checks are useful, but they do not capture qualities like completeness, tone, or faithfulness very well. For that, we can ask another model call to score the output using a rubric.

Hands-On Exercise 4: Use an LLM as an Evaluator

Goal

Evaluate a generated answer using a rubric and structured scoring instructions.

Important note

Model-based evaluation is helpful, but should be validated with spot checks by humans.

Python script

import json
from openai import OpenAI

client = OpenAI()

test_case = {
    "id": "sum_004",
    "input": (
        "Summarize this support message in 2 bullet points: "
        "The customer was charged twice for the same subscription, "
        "and they want a refund processed immediately."
    ),
    "expected_behavior": [
        "Exactly 2 bullet points",
        "Mentions duplicate charge",
        "Mentions refund request"
    ]
}


def get_model_response(user_prompt: str) -> str:
    """
    Generate the candidate answer to be evaluated.
    """
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {
                "role": "system",
                "content": "You are a concise assistant that follows formatting instructions exactly."
            },
            {
                "role": "user",
                "content": user_prompt
            }
        ]
    )
    return response.output_text.strip()


def judge_output(test_case: dict, model_output: str) -> dict:
    """
    Ask the model to evaluate the candidate output against a simple rubric.

    The evaluator is instructed to return JSON only so that the result can
    be parsed programmatically.
    """
    evaluator_prompt = f"""
You are an evaluation assistant.

Score the candidate output against the test case using this rubric:
- relevance: 0 to 2
- completeness: 0 to 2
- format_compliance: 0 to 1

Return JSON only with keys:
relevance, completeness, format_compliance, total_score, verdict, rationale

Test case input:
{test_case["input"]}

Expected behavior:
{json.dumps(test_case["expected_behavior"], indent=2)}

Candidate output:
{model_output}
""".strip()

    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {
                "role": "system",
                "content": (
                    "You are a strict evaluator. "
                    "Return valid JSON only with no markdown fences."
                )
            },
            {
                "role": "user",
                "content": evaluator_prompt
            }
        ]
    )

    raw_text = response.output_text.strip()
    return json.loads(raw_text)


candidate_output = get_model_response(test_case["input"])
evaluation = judge_output(test_case, candidate_output)

print("Candidate Output:")
print(candidate_output)
print("\nEvaluation:")
print(json.dumps(evaluation, indent=2))

Example output

Candidate Output:
- The customer was charged twice for the same subscription.
- They want a refund processed immediately.

Evaluation:
{
  "relevance": 2,
  "completeness": 2,
  "format_compliance": 1,
  "total_score": 5,
  "verdict": "pass",
  "rationale": "The response captures both key issues and follows the requested 2-bullet format."
}

9. Best Practices for Evaluation Dataset Design

Start small, then expand

Begin with 10–20 high-value test cases before creating a large benchmark.

Include failures you have already seen

Real bugs are some of the best evaluation examples.

Version your datasets

Track changes to test cases over time, just as you would with code.

Keep prompts stable during comparisons

If you are evaluating prompt changes, avoid changing the dataset simultaneously.

Mix easy, medium, and hard cases

A dataset with only simple examples can create false confidence.

Separate generation from evaluation

Your application prompt and your evaluation rubric should serve different purposes.

Review failing cases manually

Automated scoring is useful, but inspection reveals deeper issues.


10. Common Mistakes

Vague success criteria

Bad: “Should be good”
Better: “Must mention refund request and use exactly 2 bullet points”

Overfitting to the eval set

If you optimize only for known tests, your system may not generalize.

Relying only on exact match

Many good outputs can be phrased differently.

Ignoring edge cases

Production failures often happen in rare but important scenarios.

Not storing test metadata

Without categories and notes, analyzing failures becomes harder.


11. Mini Project

Hands-On Exercise 5: Create and Evaluate Your Own Dataset

Task

Build a dataset with at least 5 test cases for one of these tasks:

  • email summarization
  • support ticket classification
  • structured data extraction
  • rewriting text in a specified tone

Requirements

Each test case should include:

  • id
  • category
  • input
  • expected_behavior
  • checks

Then:

  1. Generate outputs with gpt-5.4-mini
  2. Run rule-based checks
  3. Print a summary table
  4. Identify at least one weak test and improve it

Starter template

from openai import OpenAI

client = OpenAI()

eval_dataset = [
    {
        "id": "custom_001",
        "category": "classification",
        "input": "Classify this ticket as billing, technical, or account: I was charged twice this month.",
        "expected_behavior": [
            "Returns one valid label",
            "Chooses billing"
        ],
        "checks": {
            "allowed_labels": ["billing", "technical", "account"],
            "expected_label": "billing"
        }
    }
]


def get_model_response(user_prompt: str) -> str:
    """
    Generate a model response for a single test case.
    """
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {"role": "system", "content": "You are a precise assistant."},
            {"role": "user", "content": user_prompt}
        ]
    )
    return response.output_text.strip()


def evaluate_classification(output: str, checks: dict) -> dict:
    """
    Evaluate a classification output using simple exact-match rules.
    """
    normalized = output.lower().strip()

    details = []
    passed = True

    in_allowed = normalized in [label.lower() for label in checks["allowed_labels"]]
    details.append({
        "check": "label_in_allowed_set",
        "passed": in_allowed
    })
    if not in_allowed:
        passed = False

    matches_expected = normalized == checks["expected_label"].lower()
    details.append({
        "check": "matches_expected_label",
        "passed": matches_expected
    })
    if not matches_expected:
        passed = False

    return {
        "passed": passed,
        "details": details
    }


for test_case in eval_dataset:
    output = get_model_response(test_case["input"])
    result = evaluate_classification(output, test_case["checks"])

    print("=" * 80)
    print(f"Test ID: {test_case['id']}")
    print(f"Prompt: {test_case['input']}")
    print(f"Output: {output}")
    print("Evaluation:")
    for detail in result["details"]:
        print(detail)
    print(f"Result: {'PASS' if result['passed'] else 'FAIL'}")

Suggested extension

Add a CSV or JSON export step so results can be compared between runs.


12. Wrap-Up

Key takeaways

  • Evaluation is essential for reliable GenAI systems
  • Strong test cases define both input and success criteria
  • Evaluation datasets should cover normal, edge, and regression cases
  • Rule-based checks are great for structure and constraints
  • Model-assisted evaluation helps score nuanced qualities
  • The best evaluation workflows combine automation with human review

Useful Resources

  • OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
  • OpenAI API reference: https://platform.openai.com/docs/api-reference
  • OpenAI Python SDK: https://github.com/openai/openai-python
  • Prompt engineering guide: https://platform.openai.com/docs/guides/prompt-engineering
  • JSON basics in Python: https://docs.python.org/3/library/json.html

Suggested Homework

  1. Expand the evaluation dataset from 3 test cases to 15.
  2. Add at least:
  3. 5 normal cases
  4. 5 edge cases
  5. 5 regression cases
  6. Implement one model-assisted evaluator with a scoring rubric.
  7. Compare results before and after changing your system prompt.
  8. Write a short reflection:
  9. Which tests were easiest to automate?
  10. Which required subjective judgment?
  11. Which failures would matter most in production?

Back to Chapter | Back to Master Plan | Previous Session | Next Session