Session 1: Why Evaluation Matters in GenAI Systems
Synopsis
Introduces the challenge of measuring model behavior and system quality when outputs are probabilistic and context-dependent. Learners understand the importance of defining task-specific success metrics.
Session Content
Session 1: Why Evaluation Matters in GenAI Systems
Session Overview
Duration: ~45 minutes
Audience: Python developers with basic programming knowledge
Focus: Understanding why evaluation is essential in GenAI systems, what can go wrong without it, and how to build a simple evaluation workflow using the OpenAI Responses API with gpt-5.4-mini.
Learning Objectives
By the end of this session, learners will be able to:
- Explain why GenAI systems require evaluation beyond traditional software testing
- Identify common failure modes in LLM-powered applications
- Distinguish between qualitative and quantitative evaluation
- Understand the difference between evaluating a model, a prompt, and a full GenAI system
- Build a simple Python-based evaluation loop using the OpenAI Responses API
- Use an LLM as a lightweight evaluator for structured scoring
1. Introduction: Why Evaluation Matters
Traditional software is usually deterministic:
- Same input -> same output
- Behavior can often be validated with fixed assertions
- Unit and integration tests catch many issues
GenAI systems are different:
- Outputs are probabilistic
- Multiple answers may be acceptable
- Quality depends on prompts, context, tools, retrieval, model choice, and system design
- Failures are often subtle rather than binary
This means evaluation is not optional. It is a core engineering discipline for GenAI systems.
Why evaluation is essential
Evaluation helps you:
- Detect regressions when prompts or models change
- Compare prompt versions objectively
- Measure quality before shipping
- Understand tradeoffs between accuracy, cost, speed, and safety
- Build trust in your application
Common misconception
A common mistake is:
“I tried my prompt a few times and it looked good.”
This is not evaluation. That is anecdotal inspection.
Real evaluation requires:
- A representative dataset
- Clear criteria
- Consistent scoring
- Repeatable workflows
- Tracking results over time
2. What Can Go Wrong in GenAI Systems
A GenAI system can fail in many ways even when the output sounds convincing.
Common failure modes
2.1 Hallucination
The model produces false or unsupported information.
Example:
A support assistant invents a refund policy that does not exist.
2.2 Instruction non-compliance
The model ignores constraints.
Example:
You ask for a JSON response, but it returns prose.
2.3 Incomplete answers
The model answers only part of the question.
Example:
A summarizer omits key points from a legal or medical document.
2.4 Poor formatting
The content may be correct, but unusable by downstream systems.
Example:
A tool-calling pipeline expects a strict structure, but the output is free text.
2.5 Unsafe or biased content
The system generates harmful, discriminatory, or risky content.
2.6 Retrieval mismatch
In RAG systems, the model answers based on the wrong context or ignores the retrieved evidence.
2.7 Tool misuse
In agentic systems, the model selects the wrong tool, uses wrong arguments, or executes unnecessary actions.
3. What Are We Evaluating?
In GenAI, you may be evaluating different layers of the system.
3.1 Model evaluation
How capable is the model in general for your task?
Examples:
- Accuracy on Q&A
- Summarization quality
- Reasoning performance
- Cost and latency
3.2 Prompt evaluation
How well does a specific prompt perform?
Examples:
- Does Prompt A produce more reliable JSON than Prompt B?
- Does adding examples improve answer quality?
3.3 System evaluation
How well does the entire application perform?
Examples:
- User query -> retrieval -> generation -> output formatting
- Agent decides whether to call a tool -> tool returns data -> model responds
System evaluation is usually the most important in production because users experience the whole pipeline, not just the model in isolation.
4. Types of Evaluation
Evaluation can be broadly split into two categories.
4.1 Qualitative evaluation
This is manual inspection.
You review outputs and ask:
- Is this helpful?
- Is it accurate?
- Is the tone appropriate?
- Did it follow instructions?
Strengths:
- Good for early-stage exploration
- Helps uncover subtle issues
- Useful for understanding failure patterns
Weaknesses:
- Hard to scale
- Subjective
- Can be inconsistent
4.2 Quantitative evaluation
This assigns measurable scores.
Examples:
- Exact match
- Pass/fail on formatting
- Helpfulness score from 1–5
- Groundedness score
- Safety violations count
- Latency and token usage
Strengths:
- Scalable
- Repeatable
- Useful for comparing versions
Weaknesses:
- Requires careful rubric design
- Some tasks are difficult to score automatically
Best practice
Use both:
- Qualitative evaluation to discover issues
- Quantitative evaluation to track them over time
5. Evaluation Criteria: What Should You Measure?
Your metrics should match the application.
Common criteria
Correctness
Is the answer factually or logically correct?
Relevance
Does the answer address the user’s request?
Completeness
Does it cover all important parts?
Faithfulness / groundedness
Is the answer supported by provided context?
Instruction adherence
Did it follow the format and constraints?
Safety
Did it avoid harmful or disallowed outputs?
Latency
How long did it take?
Cost
How many tokens were used?
Example: same output, different judgments
A concise answer may be:
- Great for a chatbot
- Bad for a legal summary
- Dangerous for a medical assistant
Evaluation always depends on task context.
6. Designing a Simple Evaluation Dataset
A good evaluation starts with examples.
What an evaluation dataset should contain
For each test case, include:
- Input: the user query or task
- Expected behavior: what a good answer should do
- Optional reference answer: one acceptable answer
- Scoring rubric: how to assess the output
Example dataset shape
eval_cases = [
{
"id": "case_1",
"user_input": "Summarize the benefits of unit testing in two bullet points.",
"reference": "- Catches bugs early\n- Makes refactoring safer",
"criteria": ["relevance", "format", "conciseness"]
},
{
"id": "case_2",
"user_input": "Return a JSON object with keys name and role for Ada Lovelace.",
"reference": '{"name": "Ada Lovelace", "role": "Mathematician"}',
"criteria": ["format", "correctness"]
}
]
Tips for building useful eval sets
- Start small: 10–20 high-value examples
- Include edge cases
- Include examples that previously failed
- Keep the set versioned
- Update it as product requirements evolve
7. Manual vs LLM-as-Judge Evaluation
There are several ways to score outputs.
7.1 Rule-based evaluation
Useful when the expected output is structured.
Examples:
- Check whether the output is valid JSON
- Check whether required fields exist
- Check max length
- Check if forbidden phrases appear
7.2 Human evaluation
Best when quality is nuanced.
Examples:
- Creativity
- Tone
- Helpfulness
- Domain-specific adequacy
7.3 LLM-as-judge
An LLM can evaluate another model output using a rubric.
Examples:
- Score helpfulness from 1–5
- Judge whether an answer is grounded in supplied context
- Compare two prompt variants
Benefits:
- Fast
- Scalable
- Great for early internal evaluation
Caution:
- Judges can be biased or inconsistent
- Use clear rubrics
- Validate the evaluator on a sample with human review
8. Session Demo Architecture
In this session, we will build a small evaluation loop:
- Send prompts to
gpt-5.4-mini - Collect outputs
- Evaluate them using:
- simple rule-based checks
- an LLM judge with a structured rubric
- Print scores and summary statistics
This is intentionally simple, but it mirrors the foundations of real GenAI evaluation pipelines.
9. Setup
Prerequisites
- Python 3.9+
- OpenAI Python SDK installed
- OpenAI API key available as an environment variable
Install dependencies
pip install openai
Set your API key
export OPENAI_API_KEY="your_api_key_here"
On Windows PowerShell:
setx OPENAI_API_KEY "your_api_key_here"
10. Hands-On Exercise 1: Generate Outputs for an Evaluation Set
Goal
Create a small evaluation dataset and use gpt-5.4-mini to produce outputs for each case.
What learners will practice
- Using the OpenAI Responses API
- Structuring evaluation cases in Python
- Collecting model outputs for later scoring
Code
"""
Exercise 1: Generate outputs for a small evaluation dataset
using the OpenAI Responses API and gpt-5.4-mini.
Run:
python exercise_1_generate_outputs.py
"""
import os
from openai import OpenAI
# Create a reusable client. The SDK reads OPENAI_API_KEY from the environment.
client = OpenAI()
# A small evaluation dataset with varied requirements.
eval_cases = [
{
"id": "case_1",
"task_type": "bullet_summary",
"user_input": "Summarize the benefits of unit testing in exactly two bullet points.",
},
{
"id": "case_2",
"task_type": "json_format",
"user_input": "Return a JSON object with keys name and role for Ada Lovelace.",
},
{
"id": "case_3",
"task_type": "short_explanation",
"user_input": "Explain what a Python list comprehension is in one sentence.",
},
]
def generate_response(user_input: str) -> str:
"""
Generate a response from gpt-5.4-mini using the Responses API.
Args:
user_input: The prompt to send to the model.
Returns:
The model's output text.
"""
response = client.responses.create(
model="gpt-5.4-mini",
input=user_input,
)
return response.output_text
def main() -> None:
print("Generating model outputs...\n")
for case in eval_cases:
output = generate_response(case["user_input"])
case["model_output"] = output
print(f"--- {case['id']} ---")
print(f"INPUT: {case['user_input']}")
print("OUTPUT:")
print(output)
print()
if __name__ == "__main__":
main()
Example Output
Generating model outputs...
--- case_1 ---
INPUT: Summarize the benefits of unit testing in exactly two bullet points.
OUTPUT:
- Helps catch bugs early in development.
- Makes code changes safer and easier to maintain.
--- case_2 ---
INPUT: Return a JSON object with keys name and role for Ada Lovelace.
OUTPUT:
{"name": "Ada Lovelace", "role": "Mathematician"}
--- case_3 ---
INPUT: Explain what a Python list comprehension is in one sentence.
OUTPUT:
A Python list comprehension is a concise way to create a new list by iterating over an existing iterable and optionally applying a condition.
Discussion
At this stage, we have generated outputs, but we do not yet know:
- Whether the model followed the instructions exactly
- Whether the formatting is valid
- Whether one prompt version is better than another
That is where evaluation begins.
11. Hands-On Exercise 2: Add Rule-Based Evaluation
Goal
Score outputs using deterministic checks.
What learners will practice
- Turning vague requirements into concrete checks
- Building a simple evaluation function
- Understanding the strengths and limits of rule-based scoring
Code
"""
Exercise 2: Add rule-based evaluation checks.
This script:
1. Generates outputs for a small eval dataset
2. Scores them with simple deterministic checks
3. Prints a summary
Run:
python exercise_2_rule_based_eval.py
"""
import json
from openai import OpenAI
client = OpenAI()
eval_cases = [
{
"id": "case_1",
"task_type": "bullet_summary",
"user_input": "Summarize the benefits of unit testing in exactly two bullet points.",
},
{
"id": "case_2",
"task_type": "json_format",
"user_input": "Return a JSON object with keys name and role for Ada Lovelace.",
},
{
"id": "case_3",
"task_type": "short_explanation",
"user_input": "Explain what a Python list comprehension is in one sentence.",
},
]
def generate_response(user_input: str) -> str:
"""Call the model and return plain text output."""
response = client.responses.create(
model="gpt-5.4-mini",
input=user_input,
)
return response.output_text.strip()
def score_case(case: dict, output: str) -> dict:
"""
Score a model output with basic rule-based checks.
Returns a dictionary of metric_name -> score (0 or 1).
"""
scores = {}
if case["task_type"] == "bullet_summary":
lines = [line.strip() for line in output.splitlines() if line.strip()]
bullet_lines = [line for line in lines if line.startswith("- ") or line.startswith("* ")]
scores["exactly_two_bullets"] = 1 if len(bullet_lines) == 2 and len(lines) == 2 else 0
elif case["task_type"] == "json_format":
try:
parsed = json.loads(output)
has_keys = isinstance(parsed, dict) and "name" in parsed and "role" in parsed
scores["valid_json_with_required_keys"] = 1 if has_keys else 0
except json.JSONDecodeError:
scores["valid_json_with_required_keys"] = 0
elif case["task_type"] == "short_explanation":
# A lightweight heuristic: one sentence ends with one sentence terminator
sentence_endings = output.count(".") + output.count("!") + output.count("?")
scores["one_sentence"] = 1 if sentence_endings == 1 else 0
scores["non_empty"] = 1 if len(output.strip()) > 0 else 0
return scores
def main() -> None:
overall_score = 0
total_checks = 0
for case in eval_cases:
output = generate_response(case["user_input"])
scores = score_case(case, output)
print(f"=== {case['id']} ===")
print(f"Input: {case['user_input']}")
print(f"Output: {output}")
print(f"Scores: {scores}")
print()
overall_score += sum(scores.values())
total_checks += len(scores)
if total_checks > 0:
print(f"Overall rule-based score: {overall_score}/{total_checks} = {overall_score / total_checks:.2%}")
else:
print("No evaluation checks were run.")
if __name__ == "__main__":
main()
Example Output
=== case_1 ===
Input: Summarize the benefits of unit testing in exactly two bullet points.
Output: - Helps catch bugs early in development.
- Makes code changes safer and easier to maintain.
Scores: {'exactly_two_bullets': 1}
=== case_2 ===
Input: Return a JSON object with keys name and role for Ada Lovelace.
Output: {"name": "Ada Lovelace", "role": "Mathematician"}
Scores: {'valid_json_with_required_keys': 1}
=== case_3 ===
Input: Explain what a Python list comprehension is in one sentence.
Output: A Python list comprehension is a concise way to create a new list by iterating over an existing iterable and optionally applying a condition.
Scores: {'one_sentence': 1, 'non_empty': 1}
Overall rule-based score: 4/4 = 100.00%
Discussion
Rule-based evaluation is powerful when:
- Outputs are structured
- Constraints are explicit
- Success criteria are machine-checkable
But rule-based checks cannot fully measure:
- Helpfulness
- Accuracy in complex domains
- Nuance
- Writing quality
For those, we often need human review or LLM-based evaluation.
12. Hands-On Exercise 3: Use an LLM as a Judge
Goal
Use gpt-5.4-mini to score model outputs according to a rubric.
What learners will practice
- Writing a judging prompt
- Requesting structured JSON output
- Turning subjective quality into a repeatable scoring workflow
Design
We will ask the model to score each answer on:
- relevance: Does it address the user request?
- instruction_following: Does it obey formatting and constraints?
- clarity: Is it easy to understand?
Each score will be from 1 to 5.
Code
"""
Exercise 3: LLM-as-judge evaluation using the Responses API.
This script:
1. Generates answers with gpt-5.4-mini
2. Uses gpt-5.4-mini again as a judge
3. Requests structured JSON for easy parsing
Run:
python exercise_3_llm_judge.py
"""
import json
from openai import OpenAI
client = OpenAI()
eval_cases = [
{
"id": "case_1",
"user_input": "Summarize the benefits of unit testing in exactly two bullet points.",
},
{
"id": "case_2",
"user_input": "Return a JSON object with keys name and role for Ada Lovelace.",
},
{
"id": "case_3",
"user_input": "Explain what a Python list comprehension is in one sentence.",
},
]
def generate_response(user_input: str) -> str:
"""Generate a candidate answer from the model."""
response = client.responses.create(
model="gpt-5.4-mini",
input=user_input,
)
return response.output_text.strip()
def judge_response(user_input: str, model_output: str) -> dict:
"""
Ask the model to judge the candidate output and return strict JSON.
The prompt defines a clear rubric and asks for machine-readable output.
"""
judge_prompt = f"""
You are an evaluator for GenAI outputs.
Score the candidate response using this rubric:
- relevance: Does the response address the user's request? Score 1 to 5.
- instruction_following: Does the response follow explicit instructions such as format or length? Score 1 to 5.
- clarity: Is the response clear and understandable? Score 1 to 5.
Return ONLY valid JSON with this exact schema:
{{
"relevance": <integer 1-5>,
"instruction_following": <integer 1-5>,
"clarity": <integer 1-5>,
"overall": <integer 1-5>,
"reason": "<one short sentence>"
}}
User input:
{user_input}
Candidate response:
{model_output}
""".strip()
response = client.responses.create(
model="gpt-5.4-mini",
input=judge_prompt,
)
text = response.output_text.strip()
try:
return json.loads(text)
except json.JSONDecodeError:
# Fallback result if parsing fails.
return {
"relevance": 1,
"instruction_following": 1,
"clarity": 1,
"overall": 1,
"reason": f"Judge output was not valid JSON: {text}",
}
def main() -> None:
all_scores = []
for case in eval_cases:
candidate = generate_response(case["user_input"])
judgment = judge_response(case["user_input"], candidate)
all_scores.append(judgment["overall"])
print(f"=== {case['id']} ===")
print(f"Input: {case['user_input']}")
print(f"Candidate Output: {candidate}")
print("Judge Result:")
print(json.dumps(judgment, indent=2))
print()
if all_scores:
avg_score = sum(all_scores) / len(all_scores)
print(f"Average overall judge score: {avg_score:.2f}/5.00")
if __name__ == "__main__":
main()
Example Output
=== case_1 ===
Input: Summarize the benefits of unit testing in exactly two bullet points.
Candidate Output: - Helps catch bugs early in development.
- Makes code changes safer and easier to maintain.
Judge Result:
{
"relevance": 5,
"instruction_following": 5,
"clarity": 5,
"overall": 5,
"reason": "The response fully addresses the request in the required format."
}
=== case_2 ===
Input: Return a JSON object with keys name and role for Ada Lovelace.
Candidate Output: {"name": "Ada Lovelace", "role": "Mathematician"}
Judge Result:
{
"relevance": 5,
"instruction_following": 5,
"clarity": 5,
"overall": 5,
"reason": "The response is correct, clear, and follows the requested JSON format."
}
=== case_3 ===
Input: Explain what a Python list comprehension is in one sentence.
Candidate Output: A Python list comprehension is a concise way to create a new list by iterating over an existing iterable and optionally applying a condition.
Judge Result:
{
"relevance": 5,
"instruction_following": 5,
"clarity": 5,
"overall": 5,
"reason": "The response clearly explains the concept in a single sentence."
}
Average overall judge score: 5.00/5.00
Discussion
This gives us a more flexible evaluation signal than strict rules alone.
However, the judging model is still a model, so:
- its decisions may vary
- it may be overly generous
- it can inherit biases from the rubric
This is why production evaluation usually combines:
- deterministic checks
- LLM judging
- targeted human review
13. Hands-On Exercise 4: Compare Two Prompt Variants
Goal
Demonstrate how evaluation helps compare prompt designs instead of relying on intuition.
Scenario
We will compare:
- Prompt A: direct request
- Prompt B: request with added formatting guidance
Code
"""
Exercise 4: Compare two prompt variants using evaluation.
This script evaluates whether Prompt B improves instruction following
for structured output tasks.
Run:
python exercise_4_prompt_comparison.py
"""
import json
from statistics import mean
from openai import OpenAI
client = OpenAI()
base_tasks = [
"Return a JSON object with keys city and country for Tokyo.",
"Return a JSON object with keys language and creator for Python.",
"Return a JSON object with keys planet and type for Earth.",
]
def run_prompt(prompt: str) -> str:
"""Generate a response for a given prompt."""
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt,
)
return response.output_text.strip()
def is_valid_json_with_two_keys(output: str) -> int:
"""
Return 1 if output is valid JSON with exactly two keys, else 0.
"""
try:
parsed = json.loads(output)
return 1 if isinstance(parsed, dict) and len(parsed.keys()) == 2 else 0
except json.JSONDecodeError:
return 0
def main() -> None:
prompt_a_scores = []
prompt_b_scores = []
for task in base_tasks:
prompt_a = task
prompt_b = (
task
+ " Output only valid JSON. Do not include markdown, explanation, or extra text."
)
output_a = run_prompt(prompt_a)
output_b = run_prompt(prompt_b)
score_a = is_valid_json_with_two_keys(output_a)
score_b = is_valid_json_with_two_keys(output_b)
prompt_a_scores.append(score_a)
prompt_b_scores.append(score_b)
print("TASK:", task)
print("Prompt A Output:", output_a)
print("Prompt A Score:", score_a)
print("Prompt B Output:", output_b)
print("Prompt B Score:", score_b)
print("-" * 60)
print(f"Prompt A average JSON compliance: {mean(prompt_a_scores):.2%}")
print(f"Prompt B average JSON compliance: {mean(prompt_b_scores):.2%}")
if __name__ == "__main__":
main()
Example Output
TASK: Return a JSON object with keys city and country for Tokyo.
Prompt A Output: {"city": "Tokyo", "country": "Japan"}
Prompt A Score: 1
Prompt B Output: {"city": "Tokyo", "country": "Japan"}
Prompt B Score: 1
------------------------------------------------------------
TASK: Return a JSON object with keys language and creator for Python.
Prompt A Output: {"language": "Python", "creator": "Guido van Rossum"}
Prompt A Score: 1
Prompt B Output: {"language": "Python", "creator": "Guido van Rossum"}
Prompt B Score: 1
------------------------------------------------------------
TASK: Return a JSON object with keys planet and type for Earth.
Prompt A Output: {"planet": "Earth", "type": "Terrestrial"}
Prompt A Score: 1
Prompt B Output: {"planet": "Earth", "type": "Terrestrial"}
Prompt B Score: 1
------------------------------------------------------------
Prompt A average JSON compliance: 100.00%
Prompt B average JSON compliance: 100.00%
Discussion
In this tiny dataset, both prompts may perform equally well. That does not mean the prompts are equally good in all cases.
The purpose of evaluation is to help you answer questions like:
- Which prompt is more reliable across edge cases?
- Which one gives better formatting compliance?
- Which one is cheaper or shorter?
- Which one fails more gracefully?
Without evaluation, prompt comparison becomes guesswork.
14. Best Practices for GenAI Evaluation
14.1 Start with product goals
Ask:
- What matters most to users?
- What failures are unacceptable?
- What does “good” mean for this task?
14.2 Evaluate the full system, not just the model
A great model can still produce a bad user experience if:
- retrieval is poor
- tool outputs are wrong
- prompts are weak
- formatting breaks downstream consumers
14.3 Use small, high-quality eval sets early
You do not need hundreds of cases to start.
A curated set of 10–20 cases can reveal major issues quickly.
14.4 Track regressions over time
Every prompt or model change should be testable against the same evaluation set.
14.5 Include edge cases
Examples:
- ambiguous prompts
- very short prompts
- contradictory instructions
- formatting-heavy requests
- previously failed user scenarios
14.6 Combine metrics
Do not rely on a single score.
A system can improve in one dimension while getting worse in another.
Example tradeoff:
- Better correctness
- Worse latency
- Higher cost
14.7 Keep humans in the loop
Especially for:
- safety-sensitive applications
- domain expertise
- subjective quality judgments
- calibration of LLM judges
15. Common Pitfalls
Pitfall 1: Evaluating only “happy path” prompts
This creates false confidence.
Pitfall 2: Using vague rubrics
If “good answer” is not defined, scoring becomes inconsistent.
Pitfall 3: Overfitting to the eval set
You may optimize for a small set of examples without improving real user experience.
Pitfall 4: Ignoring cost and latency
A slightly better output may not justify a much slower or more expensive system.
Pitfall 5: Trusting LLM judges blindly
They are useful, but not infallible.
Pitfall 6: Measuring only generation quality
For agentic systems, you must also evaluate:
- tool selection
- tool argument quality
- state transitions
- action ordering
- recovery from failure
16. Mini Recap
Evaluation matters because GenAI systems are probabilistic, flexible, and failure-prone in ways that traditional software testing alone cannot capture.
A practical evaluation strategy often includes:
- a curated dataset
- rule-based checks
- LLM-based judging
- occasional human review
- repeated measurement over time
If you can measure quality, you can improve it systematically.
17. Suggested 45-Minute Delivery Plan
Part 1: Theory and discussion (~20 minutes)
- 5 min: Why GenAI needs evaluation
- 5 min: Common failure modes
- 5 min: What to evaluate: model vs prompt vs system
- 5 min: Qualitative vs quantitative methods
Part 2: Hands-on coding (~20 minutes)
- 7 min: Exercise 1 generate outputs
- 6 min: Exercise 2 rule-based scoring
- 7 min: Exercise 3 LLM-as-judge scoring
Part 3: Wrap-up (~5 minutes)
- Prompt comparison concept
- Best practices
- Q&A
18. Useful Resources
- OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
- OpenAI API docs: https://platform.openai.com/docs
- OpenAI Python SDK: https://github.com/openai/openai-python
- Prompting guide: https://platform.openai.com/docs/guides/text
- JSON basics in Python: https://docs.python.org/3/library/json.html
19. Practice Tasks for Learners
Try these after the session:
- Add 5 more evaluation cases, including at least 2 edge cases.
- Extend the rule-based evaluator to check:
- maximum word count
- presence of required keywords
- valid JSON schema-like structure
- Add a new LLM judge criterion:
- factuality
- conciseness
- tone
- Compare two different prompt versions and store the scores in a CSV file.
- Modify the evaluator so it prints failures first, making debugging easier.
20. Takeaway
In GenAI development, evaluation is how you move from “it seems to work” to “we know how well it works.”
That shift is foundational for building reliable, scalable, and trustworthy GenAI systems.