Skip to content

Session 1: Why Evaluation Matters in GenAI Systems

Synopsis

Introduces the challenge of measuring model behavior and system quality when outputs are probabilistic and context-dependent. Learners understand the importance of defining task-specific success metrics.

Session Content

Session 1: Why Evaluation Matters in GenAI Systems

Session Overview

Duration: ~45 minutes
Audience: Python developers with basic programming knowledge
Focus: Understanding why evaluation is essential in GenAI systems, what can go wrong without it, and how to build a simple evaluation workflow using the OpenAI Responses API with gpt-5.4-mini.

Learning Objectives

By the end of this session, learners will be able to:

  • Explain why GenAI systems require evaluation beyond traditional software testing
  • Identify common failure modes in LLM-powered applications
  • Distinguish between qualitative and quantitative evaluation
  • Understand the difference between evaluating a model, a prompt, and a full GenAI system
  • Build a simple Python-based evaluation loop using the OpenAI Responses API
  • Use an LLM as a lightweight evaluator for structured scoring

1. Introduction: Why Evaluation Matters

Traditional software is usually deterministic:

  • Same input -> same output
  • Behavior can often be validated with fixed assertions
  • Unit and integration tests catch many issues

GenAI systems are different:

  • Outputs are probabilistic
  • Multiple answers may be acceptable
  • Quality depends on prompts, context, tools, retrieval, model choice, and system design
  • Failures are often subtle rather than binary

This means evaluation is not optional. It is a core engineering discipline for GenAI systems.

Why evaluation is essential

Evaluation helps you:

  • Detect regressions when prompts or models change
  • Compare prompt versions objectively
  • Measure quality before shipping
  • Understand tradeoffs between accuracy, cost, speed, and safety
  • Build trust in your application

Common misconception

A common mistake is:

“I tried my prompt a few times and it looked good.”

This is not evaluation. That is anecdotal inspection.

Real evaluation requires:

  • A representative dataset
  • Clear criteria
  • Consistent scoring
  • Repeatable workflows
  • Tracking results over time

2. What Can Go Wrong in GenAI Systems

A GenAI system can fail in many ways even when the output sounds convincing.

Common failure modes

2.1 Hallucination

The model produces false or unsupported information.

Example:
A support assistant invents a refund policy that does not exist.

2.2 Instruction non-compliance

The model ignores constraints.

Example:
You ask for a JSON response, but it returns prose.

2.3 Incomplete answers

The model answers only part of the question.

Example:
A summarizer omits key points from a legal or medical document.

2.4 Poor formatting

The content may be correct, but unusable by downstream systems.

Example:
A tool-calling pipeline expects a strict structure, but the output is free text.

2.5 Unsafe or biased content

The system generates harmful, discriminatory, or risky content.

2.6 Retrieval mismatch

In RAG systems, the model answers based on the wrong context or ignores the retrieved evidence.

2.7 Tool misuse

In agentic systems, the model selects the wrong tool, uses wrong arguments, or executes unnecessary actions.


3. What Are We Evaluating?

In GenAI, you may be evaluating different layers of the system.

3.1 Model evaluation

How capable is the model in general for your task?

Examples:

  • Accuracy on Q&A
  • Summarization quality
  • Reasoning performance
  • Cost and latency

3.2 Prompt evaluation

How well does a specific prompt perform?

Examples:

  • Does Prompt A produce more reliable JSON than Prompt B?
  • Does adding examples improve answer quality?

3.3 System evaluation

How well does the entire application perform?

Examples:

  • User query -> retrieval -> generation -> output formatting
  • Agent decides whether to call a tool -> tool returns data -> model responds

System evaluation is usually the most important in production because users experience the whole pipeline, not just the model in isolation.


4. Types of Evaluation

Evaluation can be broadly split into two categories.

4.1 Qualitative evaluation

This is manual inspection.

You review outputs and ask:

  • Is this helpful?
  • Is it accurate?
  • Is the tone appropriate?
  • Did it follow instructions?

Strengths:

  • Good for early-stage exploration
  • Helps uncover subtle issues
  • Useful for understanding failure patterns

Weaknesses:

  • Hard to scale
  • Subjective
  • Can be inconsistent

4.2 Quantitative evaluation

This assigns measurable scores.

Examples:

  • Exact match
  • Pass/fail on formatting
  • Helpfulness score from 1–5
  • Groundedness score
  • Safety violations count
  • Latency and token usage

Strengths:

  • Scalable
  • Repeatable
  • Useful for comparing versions

Weaknesses:

  • Requires careful rubric design
  • Some tasks are difficult to score automatically

Best practice

Use both:

  • Qualitative evaluation to discover issues
  • Quantitative evaluation to track them over time

5. Evaluation Criteria: What Should You Measure?

Your metrics should match the application.

Common criteria

Correctness

Is the answer factually or logically correct?

Relevance

Does the answer address the user’s request?

Completeness

Does it cover all important parts?

Faithfulness / groundedness

Is the answer supported by provided context?

Instruction adherence

Did it follow the format and constraints?

Safety

Did it avoid harmful or disallowed outputs?

Latency

How long did it take?

Cost

How many tokens were used?

Example: same output, different judgments

A concise answer may be:

  • Great for a chatbot
  • Bad for a legal summary
  • Dangerous for a medical assistant

Evaluation always depends on task context.


6. Designing a Simple Evaluation Dataset

A good evaluation starts with examples.

What an evaluation dataset should contain

For each test case, include:

  • Input: the user query or task
  • Expected behavior: what a good answer should do
  • Optional reference answer: one acceptable answer
  • Scoring rubric: how to assess the output

Example dataset shape

eval_cases = [
    {
        "id": "case_1",
        "user_input": "Summarize the benefits of unit testing in two bullet points.",
        "reference": "- Catches bugs early\n- Makes refactoring safer",
        "criteria": ["relevance", "format", "conciseness"]
    },
    {
        "id": "case_2",
        "user_input": "Return a JSON object with keys name and role for Ada Lovelace.",
        "reference": '{"name": "Ada Lovelace", "role": "Mathematician"}',
        "criteria": ["format", "correctness"]
    }
]

Tips for building useful eval sets

  • Start small: 10–20 high-value examples
  • Include edge cases
  • Include examples that previously failed
  • Keep the set versioned
  • Update it as product requirements evolve

7. Manual vs LLM-as-Judge Evaluation

There are several ways to score outputs.

7.1 Rule-based evaluation

Useful when the expected output is structured.

Examples:

  • Check whether the output is valid JSON
  • Check whether required fields exist
  • Check max length
  • Check if forbidden phrases appear

7.2 Human evaluation

Best when quality is nuanced.

Examples:

  • Creativity
  • Tone
  • Helpfulness
  • Domain-specific adequacy

7.3 LLM-as-judge

An LLM can evaluate another model output using a rubric.

Examples:

  • Score helpfulness from 1–5
  • Judge whether an answer is grounded in supplied context
  • Compare two prompt variants

Benefits:

  • Fast
  • Scalable
  • Great for early internal evaluation

Caution:

  • Judges can be biased or inconsistent
  • Use clear rubrics
  • Validate the evaluator on a sample with human review

8. Session Demo Architecture

In this session, we will build a small evaluation loop:

  1. Send prompts to gpt-5.4-mini
  2. Collect outputs
  3. Evaluate them using:
  4. simple rule-based checks
  5. an LLM judge with a structured rubric
  6. Print scores and summary statistics

This is intentionally simple, but it mirrors the foundations of real GenAI evaluation pipelines.


9. Setup

Prerequisites

  • Python 3.9+
  • OpenAI Python SDK installed
  • OpenAI API key available as an environment variable

Install dependencies

pip install openai

Set your API key

export OPENAI_API_KEY="your_api_key_here"

On Windows PowerShell:

setx OPENAI_API_KEY "your_api_key_here"

10. Hands-On Exercise 1: Generate Outputs for an Evaluation Set

Goal

Create a small evaluation dataset and use gpt-5.4-mini to produce outputs for each case.

What learners will practice

  • Using the OpenAI Responses API
  • Structuring evaluation cases in Python
  • Collecting model outputs for later scoring

Code

"""
Exercise 1: Generate outputs for a small evaluation dataset
using the OpenAI Responses API and gpt-5.4-mini.

Run:
    python exercise_1_generate_outputs.py
"""

import os
from openai import OpenAI

# Create a reusable client. The SDK reads OPENAI_API_KEY from the environment.
client = OpenAI()

# A small evaluation dataset with varied requirements.
eval_cases = [
    {
        "id": "case_1",
        "task_type": "bullet_summary",
        "user_input": "Summarize the benefits of unit testing in exactly two bullet points.",
    },
    {
        "id": "case_2",
        "task_type": "json_format",
        "user_input": "Return a JSON object with keys name and role for Ada Lovelace.",
    },
    {
        "id": "case_3",
        "task_type": "short_explanation",
        "user_input": "Explain what a Python list comprehension is in one sentence.",
    },
]

def generate_response(user_input: str) -> str:
    """
    Generate a response from gpt-5.4-mini using the Responses API.

    Args:
        user_input: The prompt to send to the model.

    Returns:
        The model's output text.
    """
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=user_input,
    )
    return response.output_text

def main() -> None:
    print("Generating model outputs...\n")

    for case in eval_cases:
        output = generate_response(case["user_input"])
        case["model_output"] = output

        print(f"--- {case['id']} ---")
        print(f"INPUT: {case['user_input']}")
        print("OUTPUT:")
        print(output)
        print()

if __name__ == "__main__":
    main()

Example Output

Generating model outputs...

--- case_1 ---
INPUT: Summarize the benefits of unit testing in exactly two bullet points.
OUTPUT:
- Helps catch bugs early in development.
- Makes code changes safer and easier to maintain.

--- case_2 ---
INPUT: Return a JSON object with keys name and role for Ada Lovelace.
OUTPUT:
{"name": "Ada Lovelace", "role": "Mathematician"}

--- case_3 ---
INPUT: Explain what a Python list comprehension is in one sentence.
OUTPUT:
A Python list comprehension is a concise way to create a new list by iterating over an existing iterable and optionally applying a condition.

Discussion

At this stage, we have generated outputs, but we do not yet know:

  • Whether the model followed the instructions exactly
  • Whether the formatting is valid
  • Whether one prompt version is better than another

That is where evaluation begins.


11. Hands-On Exercise 2: Add Rule-Based Evaluation

Goal

Score outputs using deterministic checks.

What learners will practice

  • Turning vague requirements into concrete checks
  • Building a simple evaluation function
  • Understanding the strengths and limits of rule-based scoring

Code

"""
Exercise 2: Add rule-based evaluation checks.

This script:
1. Generates outputs for a small eval dataset
2. Scores them with simple deterministic checks
3. Prints a summary

Run:
    python exercise_2_rule_based_eval.py
"""

import json
from openai import OpenAI

client = OpenAI()

eval_cases = [
    {
        "id": "case_1",
        "task_type": "bullet_summary",
        "user_input": "Summarize the benefits of unit testing in exactly two bullet points.",
    },
    {
        "id": "case_2",
        "task_type": "json_format",
        "user_input": "Return a JSON object with keys name and role for Ada Lovelace.",
    },
    {
        "id": "case_3",
        "task_type": "short_explanation",
        "user_input": "Explain what a Python list comprehension is in one sentence.",
    },
]

def generate_response(user_input: str) -> str:
    """Call the model and return plain text output."""
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=user_input,
    )
    return response.output_text.strip()

def score_case(case: dict, output: str) -> dict:
    """
    Score a model output with basic rule-based checks.

    Returns a dictionary of metric_name -> score (0 or 1).
    """
    scores = {}

    if case["task_type"] == "bullet_summary":
        lines = [line.strip() for line in output.splitlines() if line.strip()]
        bullet_lines = [line for line in lines if line.startswith("- ") or line.startswith("* ")]
        scores["exactly_two_bullets"] = 1 if len(bullet_lines) == 2 and len(lines) == 2 else 0

    elif case["task_type"] == "json_format":
        try:
            parsed = json.loads(output)
            has_keys = isinstance(parsed, dict) and "name" in parsed and "role" in parsed
            scores["valid_json_with_required_keys"] = 1 if has_keys else 0
        except json.JSONDecodeError:
            scores["valid_json_with_required_keys"] = 0

    elif case["task_type"] == "short_explanation":
        # A lightweight heuristic: one sentence ends with one sentence terminator
        sentence_endings = output.count(".") + output.count("!") + output.count("?")
        scores["one_sentence"] = 1 if sentence_endings == 1 else 0
        scores["non_empty"] = 1 if len(output.strip()) > 0 else 0

    return scores

def main() -> None:
    overall_score = 0
    total_checks = 0

    for case in eval_cases:
        output = generate_response(case["user_input"])
        scores = score_case(case, output)

        print(f"=== {case['id']} ===")
        print(f"Input: {case['user_input']}")
        print(f"Output: {output}")
        print(f"Scores: {scores}")
        print()

        overall_score += sum(scores.values())
        total_checks += len(scores)

    if total_checks > 0:
        print(f"Overall rule-based score: {overall_score}/{total_checks} = {overall_score / total_checks:.2%}")
    else:
        print("No evaluation checks were run.")

if __name__ == "__main__":
    main()

Example Output

=== case_1 ===
Input: Summarize the benefits of unit testing in exactly two bullet points.
Output: - Helps catch bugs early in development.
- Makes code changes safer and easier to maintain.
Scores: {'exactly_two_bullets': 1}

=== case_2 ===
Input: Return a JSON object with keys name and role for Ada Lovelace.
Output: {"name": "Ada Lovelace", "role": "Mathematician"}
Scores: {'valid_json_with_required_keys': 1}

=== case_3 ===
Input: Explain what a Python list comprehension is in one sentence.
Output: A Python list comprehension is a concise way to create a new list by iterating over an existing iterable and optionally applying a condition.
Scores: {'one_sentence': 1, 'non_empty': 1}

Overall rule-based score: 4/4 = 100.00%

Discussion

Rule-based evaluation is powerful when:

  • Outputs are structured
  • Constraints are explicit
  • Success criteria are machine-checkable

But rule-based checks cannot fully measure:

  • Helpfulness
  • Accuracy in complex domains
  • Nuance
  • Writing quality

For those, we often need human review or LLM-based evaluation.


12. Hands-On Exercise 3: Use an LLM as a Judge

Goal

Use gpt-5.4-mini to score model outputs according to a rubric.

What learners will practice

  • Writing a judging prompt
  • Requesting structured JSON output
  • Turning subjective quality into a repeatable scoring workflow

Design

We will ask the model to score each answer on:

  • relevance: Does it address the user request?
  • instruction_following: Does it obey formatting and constraints?
  • clarity: Is it easy to understand?

Each score will be from 1 to 5.

Code

"""
Exercise 3: LLM-as-judge evaluation using the Responses API.

This script:
1. Generates answers with gpt-5.4-mini
2. Uses gpt-5.4-mini again as a judge
3. Requests structured JSON for easy parsing

Run:
    python exercise_3_llm_judge.py
"""

import json
from openai import OpenAI

client = OpenAI()

eval_cases = [
    {
        "id": "case_1",
        "user_input": "Summarize the benefits of unit testing in exactly two bullet points.",
    },
    {
        "id": "case_2",
        "user_input": "Return a JSON object with keys name and role for Ada Lovelace.",
    },
    {
        "id": "case_3",
        "user_input": "Explain what a Python list comprehension is in one sentence.",
    },
]

def generate_response(user_input: str) -> str:
    """Generate a candidate answer from the model."""
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=user_input,
    )
    return response.output_text.strip()

def judge_response(user_input: str, model_output: str) -> dict:
    """
    Ask the model to judge the candidate output and return strict JSON.

    The prompt defines a clear rubric and asks for machine-readable output.
    """
    judge_prompt = f"""
You are an evaluator for GenAI outputs.

Score the candidate response using this rubric:
- relevance: Does the response address the user's request? Score 1 to 5.
- instruction_following: Does the response follow explicit instructions such as format or length? Score 1 to 5.
- clarity: Is the response clear and understandable? Score 1 to 5.

Return ONLY valid JSON with this exact schema:
{{
  "relevance": <integer 1-5>,
  "instruction_following": <integer 1-5>,
  "clarity": <integer 1-5>,
  "overall": <integer 1-5>,
  "reason": "<one short sentence>"
}}

User input:
{user_input}

Candidate response:
{model_output}
""".strip()

    response = client.responses.create(
        model="gpt-5.4-mini",
        input=judge_prompt,
    )

    text = response.output_text.strip()

    try:
        return json.loads(text)
    except json.JSONDecodeError:
        # Fallback result if parsing fails.
        return {
            "relevance": 1,
            "instruction_following": 1,
            "clarity": 1,
            "overall": 1,
            "reason": f"Judge output was not valid JSON: {text}",
        }

def main() -> None:
    all_scores = []

    for case in eval_cases:
        candidate = generate_response(case["user_input"])
        judgment = judge_response(case["user_input"], candidate)
        all_scores.append(judgment["overall"])

        print(f"=== {case['id']} ===")
        print(f"Input: {case['user_input']}")
        print(f"Candidate Output: {candidate}")
        print("Judge Result:")
        print(json.dumps(judgment, indent=2))
        print()

    if all_scores:
        avg_score = sum(all_scores) / len(all_scores)
        print(f"Average overall judge score: {avg_score:.2f}/5.00")

if __name__ == "__main__":
    main()

Example Output

=== case_1 ===
Input: Summarize the benefits of unit testing in exactly two bullet points.
Candidate Output: - Helps catch bugs early in development.
- Makes code changes safer and easier to maintain.
Judge Result:
{
  "relevance": 5,
  "instruction_following": 5,
  "clarity": 5,
  "overall": 5,
  "reason": "The response fully addresses the request in the required format."
}

=== case_2 ===
Input: Return a JSON object with keys name and role for Ada Lovelace.
Candidate Output: {"name": "Ada Lovelace", "role": "Mathematician"}
Judge Result:
{
  "relevance": 5,
  "instruction_following": 5,
  "clarity": 5,
  "overall": 5,
  "reason": "The response is correct, clear, and follows the requested JSON format."
}

=== case_3 ===
Input: Explain what a Python list comprehension is in one sentence.
Candidate Output: A Python list comprehension is a concise way to create a new list by iterating over an existing iterable and optionally applying a condition.
Judge Result:
{
  "relevance": 5,
  "instruction_following": 5,
  "clarity": 5,
  "overall": 5,
  "reason": "The response clearly explains the concept in a single sentence."
}

Average overall judge score: 5.00/5.00

Discussion

This gives us a more flexible evaluation signal than strict rules alone.

However, the judging model is still a model, so:

  • its decisions may vary
  • it may be overly generous
  • it can inherit biases from the rubric

This is why production evaluation usually combines:

  • deterministic checks
  • LLM judging
  • targeted human review

13. Hands-On Exercise 4: Compare Two Prompt Variants

Goal

Demonstrate how evaluation helps compare prompt designs instead of relying on intuition.

Scenario

We will compare:

  • Prompt A: direct request
  • Prompt B: request with added formatting guidance

Code

"""
Exercise 4: Compare two prompt variants using evaluation.

This script evaluates whether Prompt B improves instruction following
for structured output tasks.

Run:
    python exercise_4_prompt_comparison.py
"""

import json
from statistics import mean
from openai import OpenAI

client = OpenAI()

base_tasks = [
    "Return a JSON object with keys city and country for Tokyo.",
    "Return a JSON object with keys language and creator for Python.",
    "Return a JSON object with keys planet and type for Earth.",
]

def run_prompt(prompt: str) -> str:
    """Generate a response for a given prompt."""
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt,
    )
    return response.output_text.strip()

def is_valid_json_with_two_keys(output: str) -> int:
    """
    Return 1 if output is valid JSON with exactly two keys, else 0.
    """
    try:
        parsed = json.loads(output)
        return 1 if isinstance(parsed, dict) and len(parsed.keys()) == 2 else 0
    except json.JSONDecodeError:
        return 0

def main() -> None:
    prompt_a_scores = []
    prompt_b_scores = []

    for task in base_tasks:
        prompt_a = task
        prompt_b = (
            task
            + " Output only valid JSON. Do not include markdown, explanation, or extra text."
        )

        output_a = run_prompt(prompt_a)
        output_b = run_prompt(prompt_b)

        score_a = is_valid_json_with_two_keys(output_a)
        score_b = is_valid_json_with_two_keys(output_b)

        prompt_a_scores.append(score_a)
        prompt_b_scores.append(score_b)

        print("TASK:", task)
        print("Prompt A Output:", output_a)
        print("Prompt A Score:", score_a)
        print("Prompt B Output:", output_b)
        print("Prompt B Score:", score_b)
        print("-" * 60)

    print(f"Prompt A average JSON compliance: {mean(prompt_a_scores):.2%}")
    print(f"Prompt B average JSON compliance: {mean(prompt_b_scores):.2%}")

if __name__ == "__main__":
    main()

Example Output

TASK: Return a JSON object with keys city and country for Tokyo.
Prompt A Output: {"city": "Tokyo", "country": "Japan"}
Prompt A Score: 1
Prompt B Output: {"city": "Tokyo", "country": "Japan"}
Prompt B Score: 1
------------------------------------------------------------
TASK: Return a JSON object with keys language and creator for Python.
Prompt A Output: {"language": "Python", "creator": "Guido van Rossum"}
Prompt A Score: 1
Prompt B Output: {"language": "Python", "creator": "Guido van Rossum"}
Prompt B Score: 1
------------------------------------------------------------
TASK: Return a JSON object with keys planet and type for Earth.
Prompt A Output: {"planet": "Earth", "type": "Terrestrial"}
Prompt A Score: 1
Prompt B Output: {"planet": "Earth", "type": "Terrestrial"}
Prompt B Score: 1
------------------------------------------------------------
Prompt A average JSON compliance: 100.00%
Prompt B average JSON compliance: 100.00%

Discussion

In this tiny dataset, both prompts may perform equally well. That does not mean the prompts are equally good in all cases.

The purpose of evaluation is to help you answer questions like:

  • Which prompt is more reliable across edge cases?
  • Which one gives better formatting compliance?
  • Which one is cheaper or shorter?
  • Which one fails more gracefully?

Without evaluation, prompt comparison becomes guesswork.


14. Best Practices for GenAI Evaluation

14.1 Start with product goals

Ask:

  • What matters most to users?
  • What failures are unacceptable?
  • What does “good” mean for this task?

14.2 Evaluate the full system, not just the model

A great model can still produce a bad user experience if:

  • retrieval is poor
  • tool outputs are wrong
  • prompts are weak
  • formatting breaks downstream consumers

14.3 Use small, high-quality eval sets early

You do not need hundreds of cases to start.
A curated set of 10–20 cases can reveal major issues quickly.

14.4 Track regressions over time

Every prompt or model change should be testable against the same evaluation set.

14.5 Include edge cases

Examples:

  • ambiguous prompts
  • very short prompts
  • contradictory instructions
  • formatting-heavy requests
  • previously failed user scenarios

14.6 Combine metrics

Do not rely on a single score.
A system can improve in one dimension while getting worse in another.

Example tradeoff:

  • Better correctness
  • Worse latency
  • Higher cost

14.7 Keep humans in the loop

Especially for:

  • safety-sensitive applications
  • domain expertise
  • subjective quality judgments
  • calibration of LLM judges

15. Common Pitfalls

Pitfall 1: Evaluating only “happy path” prompts

This creates false confidence.

Pitfall 2: Using vague rubrics

If “good answer” is not defined, scoring becomes inconsistent.

Pitfall 3: Overfitting to the eval set

You may optimize for a small set of examples without improving real user experience.

Pitfall 4: Ignoring cost and latency

A slightly better output may not justify a much slower or more expensive system.

Pitfall 5: Trusting LLM judges blindly

They are useful, but not infallible.

Pitfall 6: Measuring only generation quality

For agentic systems, you must also evaluate:

  • tool selection
  • tool argument quality
  • state transitions
  • action ordering
  • recovery from failure

16. Mini Recap

Evaluation matters because GenAI systems are probabilistic, flexible, and failure-prone in ways that traditional software testing alone cannot capture.

A practical evaluation strategy often includes:

  • a curated dataset
  • rule-based checks
  • LLM-based judging
  • occasional human review
  • repeated measurement over time

If you can measure quality, you can improve it systematically.


17. Suggested 45-Minute Delivery Plan

Part 1: Theory and discussion (~20 minutes)

  • 5 min: Why GenAI needs evaluation
  • 5 min: Common failure modes
  • 5 min: What to evaluate: model vs prompt vs system
  • 5 min: Qualitative vs quantitative methods

Part 2: Hands-on coding (~20 minutes)

  • 7 min: Exercise 1 generate outputs
  • 6 min: Exercise 2 rule-based scoring
  • 7 min: Exercise 3 LLM-as-judge scoring

Part 3: Wrap-up (~5 minutes)

  • Prompt comparison concept
  • Best practices
  • Q&A

18. Useful Resources

  • OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
  • OpenAI API docs: https://platform.openai.com/docs
  • OpenAI Python SDK: https://github.com/openai/openai-python
  • Prompting guide: https://platform.openai.com/docs/guides/text
  • JSON basics in Python: https://docs.python.org/3/library/json.html

19. Practice Tasks for Learners

Try these after the session:

  1. Add 5 more evaluation cases, including at least 2 edge cases.
  2. Extend the rule-based evaluator to check:
  3. maximum word count
  4. presence of required keywords
  5. valid JSON schema-like structure
  6. Add a new LLM judge criterion:
  7. factuality
  8. conciseness
  9. tone
  10. Compare two different prompt versions and store the scores in a CSV file.
  11. Modify the evaluator so it prints failures first, making debugging easier.

20. Takeaway

In GenAI development, evaluation is how you move from “it seems to work” to “we know how well it works.”

That shift is foundational for building reliable, scalable, and trustworthy GenAI systems.


Back to Chapter | Back to Master Plan | Next Session