Session 4: Testing and Iterating on Prompts

Synopsis

Introduces a disciplined process for comparing prompt variations, identifying failure cases, and improving consistency. Learners start thinking like engineers who validate prompt performance rather than relying on one-off success.

Session Content

Session 4: Testing and Iterating on Prompts

Session Overview

Duration: ~45 minutes
Audience: Python developers with basic programming knowledge
Focus: Learning how to systematically test, evaluate, and improve prompts for GenAI applications using the OpenAI Responses API and Python SDK.

Learning Objectives

By the end of this session, learners will be able to:

Explain why prompt iteration is necessary in real-world GenAI applications.
Identify common prompt failure modes.
Create simple prompt test cases in Python.
Compare prompt variants using repeatable evaluation criteria.
Use the OpenAI Responses API with gpt-5.4-mini to run prompt experiments.
Improve prompts based on observed outputs.

1. Why Prompt Iteration Matters

Prompting is not a one-shot activity. Even when a prompt “works” for one example, it may fail for:

different user inputs
ambiguous wording
edge cases
formatting constraints
tone/style requirements
factual reliability expectations

Key Idea

A good prompt is:

clear
specific
testable
robust across examples

Typical Prompt Failure Modes

Too vague
The model gives broad or inconsistent answers.
Missing output format
The model responds in unexpected structure.
Insufficient constraints
The answer is too long, too short, too technical, or off-topic.
No examples
The model misunderstands the desired style or task.
Conflicting instructions
The prompt asks for mutually incompatible behavior.
Edge-case fragility
The prompt works for normal inputs but breaks on unusual ones.

Example

Weak prompt:

Summarize this email.

Improved prompt:

Summarize the following email in 3 bullet points.
Focus on:
1. the main request
2. deadlines
3. any action items

Email:
...

2. A Simple Prompt Iteration Workflow

A practical workflow for prompt improvement:

Step 1: Define the task clearly

Examples:

summarize support tickets
classify sentiment
rewrite text in simpler language
extract structured fields

Step 2: Write an initial prompt

Start simple, but explicit.

Step 3: Build a small test set

Use 5–10 representative examples:

typical inputs
tricky inputs
edge cases

Step 4: Evaluate outputs

Check for:

correctness
consistency
format compliance
relevance
brevity or detail as required

Step 5: Refine the prompt

Adjust based on failures:

add constraints
clarify goal
specify output schema
provide examples
narrow scope

Step 6: Re-test

Prompt engineering is iterative.

3. Designing Better Prompt Tests

When testing prompts, avoid relying on a single “looks good” example.

Good Prompt Test Sets Include

Happy path examples
Ambiguous examples
Noisy/realistic examples
Boundary cases
Adversarial or confusing inputs

Example Task: Classify Customer Feedback

Possible labels:

positive
negative
neutral

Test inputs:

"The app is fast and easy to use."
"It crashes every time I upload a file."
"The UI changed after the update."
"Great features, but the setup was frustrating."

Notice that #4 may expose ambiguity.

Evaluation Questions

Does the model always return one of the expected labels?
Does it misclassify mixed sentiment?
Does it include extra explanation when only a label is desired?

4. Hands-On Exercise 1: Run a Baseline Prompt

In this exercise, learners will test an initial prompt for sentiment classification.

Goal

Use gpt-5.4-mini and the Responses API to classify user feedback.

Setup

Install dependencies:

pip install openai python-dotenv

Create a .env file:

OPENAI_API_KEY=your_api_key_here

Python Script: Baseline Prompt Test

from openai import OpenAI
from dotenv import load_dotenv
import os

# Load environment variables from .env
load_dotenv()

# Create the OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# A small test set for sentiment classification
feedback_examples = [
    "The app is fast and easy to use.",
    "It crashes every time I upload a file.",
    "The UI changed after the update.",
    "Great features, but the setup was frustrating.",
]

# Initial baseline prompt template
def build_prompt(feedback: str) -> str:
    return f"""
Classify the sentiment of the following customer feedback as positive, negative, or neutral.

Feedback: "{feedback}"
""".strip()

# Run the prompt on each example and print the result
for text in feedback_examples:
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=build_prompt(text)
    )

    # Output text from the Responses API
    result = response.output_text.strip()

    print(f"Feedback: {text}")
    print(f"Model output: {result}")
    print("-" * 50)

Example Output

Feedback: The app is fast and easy to use.
Model output: positive
--------------------------------------------------
Feedback: It crashes every time I upload a file.
Model output: negative
--------------------------------------------------
Feedback: The UI changed after the update.
Model output: neutral
--------------------------------------------------
Feedback: Great features, but the setup was frustrating.
Model output: neutral
--------------------------------------------------

Discussion

This may seem acceptable at first, but several issues can appear:

extra explanation instead of a single label
inconsistent handling of mixed sentiment
non-standard outputs like “somewhat negative”

That means the prompt still needs improvement.

5. Improving Prompt Specificity

One of the easiest improvements is to constrain the output more tightly.

Better Prompt Design Principles

State the exact allowed outputs.
Tell the model what to do if the input is mixed or ambiguous.
Specify formatting rules.
Keep task instructions concise and unambiguous.

Improved Version

Classify the customer feedback into exactly one of these labels:
positive, negative, neutral

Rules:
- Return only the label.
- If the feedback contains both praise and criticism, choose the dominant sentiment.
- If no clear sentiment is expressed, return neutral.

Feedback: "Great features, but the setup was frustrating."

This prompt is more testable because it reduces ambiguity.

6. Hands-On Exercise 2: Compare Two Prompt Variants

Goal

Compare a baseline prompt with an improved prompt across the same examples.

Python Script: Prompt Comparison

from openai import OpenAI
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

# Create API client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Test examples
feedback_examples = [
    "The app is fast and easy to use.",
    "It crashes every time I upload a file.",
    "The UI changed after the update.",
    "Great features, but the setup was frustrating.",
    "I love the design, but performance is terrible.",
]

def baseline_prompt(feedback: str) -> str:
    """A simple, under-specified prompt."""
    return f"""
Classify the sentiment of the following customer feedback as positive, negative, or neutral.

Feedback: "{feedback}"
""".strip()

def improved_prompt(feedback: str) -> str:
    """A more constrained and testable prompt."""
    return f"""
Classify the customer feedback into exactly one of these labels:
positive, negative, neutral

Rules:
- Return only one label.
- Return no additional text.
- If the feedback contains both positive and negative sentiment, choose the dominant sentiment.
- If no clear sentiment is expressed, return neutral.

Feedback: "{feedback}"
""".strip()

def run_prompt(prompt_text: str) -> str:
    """Send a prompt to the model and return the text output."""
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt_text
    )
    return response.output_text.strip()

# Compare both prompt versions
for feedback in feedback_examples:
    baseline_result = run_prompt(baseline_prompt(feedback))
    improved_result = run_prompt(improved_prompt(feedback))

    print(f"Feedback: {feedback}")
    print(f"Baseline : {baseline_result}")
    print(f"Improved : {improved_result}")
    print("-" * 60)

Example Output

Feedback: The app is fast and easy to use.
Baseline : positive
Improved : positive
------------------------------------------------------------
Feedback: It crashes every time I upload a file.
Baseline : negative
Improved : negative
------------------------------------------------------------
Feedback: The UI changed after the update.
Baseline : neutral
Improved : neutral
------------------------------------------------------------
Feedback: Great features, but the setup was frustrating.
Baseline : mixed sentiment
Improved : negative
------------------------------------------------------------
Feedback: I love the design, but performance is terrible.
Baseline : mixed
Improved : negative
------------------------------------------------------------

What Improved?

The second prompt is better because it:

limits valid outputs
reduces formatting variance
handles mixed sentiment explicitly

7. Using Structured Evaluation Criteria

Prompt iteration becomes more effective when you score outputs systematically.

Common Evaluation Dimensions

Correctness — Is the answer right?
Format compliance — Does it match the required structure?
Completeness — Does it include all required parts?
Conciseness — Is it appropriately brief?
Consistency — Does it behave similarly across similar cases?

Example Scoring Table

Test Case	Expected	Actual	Correct?	Format OK?	Notes
Fast and easy to use	positive	positive	Yes	Yes	Good
Crashes every time	negative	negative	Yes	Yes	Good
UI changed after update	neutral	neutral	Yes	Yes	Good
Great features, but frustrating	negative	mixed sentiment	No	No	Needs refinement

Even a manual table like this is useful.

8. Hands-On Exercise 3: Build a Mini Prompt Evaluation Harness

Goal

Create a small Python script that tests a prompt against expected results and reports accuracy.

Python Script: Evaluation Harness

from openai import OpenAI
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

# Create the API client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Test dataset with expected labels
test_cases = [
    {
        "input": "The app is fast and easy to use.",
        "expected": "positive",
    },
    {
        "input": "It crashes every time I upload a file.",
        "expected": "negative",
    },
    {
        "input": "The UI changed after the update.",
        "expected": "neutral",
    },
    {
        "input": "Great features, but the setup was frustrating.",
        "expected": "negative",
    },
    {
        "input": "I love the design, but performance is terrible.",
        "expected": "negative",
    },
]

def build_prompt(feedback: str) -> str:
    """Return the improved classification prompt."""
    return f"""
Classify the customer feedback into exactly one of these labels:
positive, negative, neutral

Rules:
- Return only one label.
- Return no additional text.
- If the feedback contains both positive and negative sentiment, choose the dominant sentiment.
- If no clear sentiment is expressed, return neutral.

Feedback: "{feedback}"
""".strip()

def get_model_label(feedback: str) -> str:
    """Call the model and normalize the returned label."""
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=build_prompt(feedback)
    )
    return response.output_text.strip().lower()

def evaluate_prompt(cases: list[dict]) -> None:
    """Evaluate prompt performance on a list of test cases."""
    correct = 0

    print("Running evaluation...\n")

    for i, case in enumerate(cases, start=1):
        prediction = get_model_label(case["input"])
        expected = case["expected"]
        is_correct = prediction == expected

        if is_correct:
            correct += 1

        print(f"Test case {i}")
        print(f"Input     : {case['input']}")
        print(f"Expected  : {expected}")
        print(f"Predicted : {prediction}")
        print(f"Correct   : {is_correct}")
        print("-" * 60)

    accuracy = correct / len(cases) * 100
    print(f"\nFinal accuracy: {accuracy:.1f}%")

if __name__ == "__main__":
    evaluate_prompt(test_cases)

Example Output

Running evaluation...

Test case 1
Input     : The app is fast and easy to use.
Expected  : positive
Predicted : positive
Correct   : True
------------------------------------------------------------
Test case 2
Input     : It crashes every time I upload a file.
Expected  : negative
Predicted : negative
Correct   : True
------------------------------------------------------------
Test case 3
Input     : The UI changed after the update.
Expected  : neutral
Predicted : neutral
Correct   : True
------------------------------------------------------------
Test case 4
Input     : Great features, but the setup was frustrating.
Expected  : negative
Predicted : negative
Correct   : True
------------------------------------------------------------
Test case 5
Input     : I love the design, but performance is terrible.
Expected  : negative
Predicted : negative
Correct   : True
------------------------------------------------------------

Final accuracy: 100.0%

Why This Matters

This is the beginning of prompt evaluation engineering:

define expectations
run consistent tests
measure quality
refine based on evidence

9. Iteration Strategies That Work Well

When a prompt underperforms, use these practical strategies.

1. Tighten the output format

Bad:

Tell me the sentiment.

Better:

Return exactly one word: positive, negative, or neutral.

2. Add decision rules

Useful when inputs are ambiguous.

Example:

If the text contains both praise and criticism, choose the stronger sentiment.

3. Add examples

Sometimes showing desired behavior helps.

Example:

Example:
Feedback: "The app works well."
Label: positive

Feedback: "It fails to load."
Label: negative

4. Reduce unnecessary wording

Overly long prompts can introduce confusion.

5. Test edge cases intentionally

Examples:

sarcasm
mixed sentiment
minimal text
unclear statements
irrelevant input

10. Common Mistakes in Prompt Iteration

Mistake 1: Changing too many things at once

If you rewrite the entire prompt, it becomes hard to know what caused improvement.

Better: change one dimension at a time.

Mistake 2: Testing on too few examples

A single success case proves very little.

Mistake 3: Ignoring formatting failures

Even if the content is correct, formatting errors can break downstream systems.

Mistake 4: Not defining what “good” means

Before testing, decide:

correct answer
expected format
length constraints
acceptable variation

Mistake 5: Assuming the first good result is production-ready

Real applications require repeatability.

11. Mini Challenge: Improve a Summarization Prompt

Scenario

You want the model to summarize customer support emails for an internal dashboard.

Weak Prompt

Summarize this email:
{email_text}

Problems

no length guidance
no structure
may omit action items
may return inconsistent formats

Better Version

Summarize the following customer support email in exactly 3 bullet points.

Include:
- the main issue
- the customer’s requested outcome
- any urgency or deadline mentioned

Return only the bullet points.

Email:
{email_text}

Reflection Questions

What output inconsistencies does this prevent?
What edge cases should be tested?
Would examples improve reliability?

12. Hands-On Exercise 4: Iterate on a Summarization Prompt

Goal

Test two summarization prompts and inspect output quality.

Python Script: Summarization Prompt Iteration

from openai import OpenAI
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

# Create client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

email_text = """
Hi support team,

I upgraded to the Pro plan yesterday, but I still cannot access the reporting dashboard.
I need this fixed before Friday because I have to present results to my manager.
Please let me know if you need any account details from me.

Thanks,
Jordan
""".strip()

def weak_prompt(email: str) -> str:
    """A vague summarization prompt."""
    return f"""
Summarize this email:

{email}
""".strip()

def improved_prompt(email: str) -> str:
    """A more structured summarization prompt."""
    return f"""
Summarize the following customer support email in exactly 3 bullet points.

Include:
- the main issue
- the customer’s requested outcome
- any urgency or deadline mentioned

Return only the bullet points.

Email:
{email}
""".strip()

def run_prompt(prompt_text: str) -> str:
    """Call the model and return the output text."""
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt_text
    )
    return response.output_text.strip()

weak_result = run_prompt(weak_prompt(email_text))
improved_result = run_prompt(improved_prompt(email_text))

print("=== Weak Prompt Output ===")
print(weak_result)
print("\n=== Improved Prompt Output ===")
print(improved_result)

Example Output

=== Weak Prompt Output ===
The customer says they upgraded to the Pro plan but still cannot access the reporting dashboard. They need the issue resolved soon and are willing to provide account details.

=== Improved Prompt Output ===
- Customer upgraded to the Pro plan but still cannot access the reporting dashboard.
- They want support to resolve the access issue and will provide account details if needed.
- The issue is urgent because they need it fixed before Friday for a presentation to their manager.

Discussion

The improved prompt is superior because it is:

structured
predictable
easier to consume downstream
aligned to business needs

13. Practical Guidance for Real Projects

In real systems, prompt iteration should be treated like software iteration.

Recommended Practices

keep prompts in version-controlled files
maintain test datasets
compare prompt versions side by side
document why changes were made
evaluate with representative user inputs
monitor production failures and add them to your test set

Simple Versioning Example

PROMPT_V1 = """
Classify the sentiment of the following feedback.
""".strip()

PROMPT_V2 = """
Classify the sentiment into exactly one label: positive, negative, neutral.
Return only the label.
""".strip()

This makes experiments more reproducible.

14. Recap

In this session, you learned that effective prompting requires:

repeated testing
representative examples
clear evaluation criteria
controlled iteration

You also practiced:

running prompts with the OpenAI Responses API
comparing prompt variants
creating a small evaluation harness
improving summarization and classification prompts

15. Useful Resources

OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
OpenAI API docs: https://platform.openai.com/docs
OpenAI Python SDK: https://github.com/openai/openai-python
Prompt engineering guide: https://platform.openai.com/docs/guides/prompt-engineering
python-dotenv: https://pypi.org/project/python-dotenv/

16. Suggested Practice After Class

Create a test set of 10 examples for a task you care about.
Write one baseline prompt and two improved variants.
Evaluate all three prompts using a Python script.
Record:
accuracy
formatting consistency
common failure cases
Refine again based on failures.

17. End-of-Session Takeaway

Prompting improves fastest when you stop guessing and start testing.

Back to Chapter | Back to Master Plan | Previous Session