Skip to content

Session 4: Testing and Iterating on Prompts

Synopsis

Introduces a disciplined process for comparing prompt variations, identifying failure cases, and improving consistency. Learners start thinking like engineers who validate prompt performance rather than relying on one-off success.

Session Content

Session 4: Testing and Iterating on Prompts

Session Overview

Duration: ~45 minutes
Audience: Python developers with basic programming knowledge
Focus: Learning how to systematically test, evaluate, and improve prompts for GenAI applications using the OpenAI Responses API and Python SDK.

Learning Objectives

By the end of this session, learners will be able to:

  • Explain why prompt iteration is necessary in real-world GenAI applications.
  • Identify common prompt failure modes.
  • Create simple prompt test cases in Python.
  • Compare prompt variants using repeatable evaluation criteria.
  • Use the OpenAI Responses API with gpt-5.4-mini to run prompt experiments.
  • Improve prompts based on observed outputs.

1. Why Prompt Iteration Matters

Prompting is not a one-shot activity. Even when a prompt “works” for one example, it may fail for:

  • different user inputs
  • ambiguous wording
  • edge cases
  • formatting constraints
  • tone/style requirements
  • factual reliability expectations

Key Idea

A good prompt is:

  • clear
  • specific
  • testable
  • robust across examples

Typical Prompt Failure Modes

  1. Too vague
  2. The model gives broad or inconsistent answers.

  3. Missing output format

  4. The model responds in unexpected structure.

  5. Insufficient constraints

  6. The answer is too long, too short, too technical, or off-topic.

  7. No examples

  8. The model misunderstands the desired style or task.

  9. Conflicting instructions

  10. The prompt asks for mutually incompatible behavior.

  11. Edge-case fragility

  12. The prompt works for normal inputs but breaks on unusual ones.

Example

Weak prompt:

Summarize this email.

Improved prompt:

Summarize the following email in 3 bullet points.
Focus on:
1. the main request
2. deadlines
3. any action items

Email:
...

2. A Simple Prompt Iteration Workflow

A practical workflow for prompt improvement:

Step 1: Define the task clearly

Examples:

  • summarize support tickets
  • classify sentiment
  • rewrite text in simpler language
  • extract structured fields

Step 2: Write an initial prompt

Start simple, but explicit.

Step 3: Build a small test set

Use 5–10 representative examples:

  • typical inputs
  • tricky inputs
  • edge cases

Step 4: Evaluate outputs

Check for:

  • correctness
  • consistency
  • format compliance
  • relevance
  • brevity or detail as required

Step 5: Refine the prompt

Adjust based on failures:

  • add constraints
  • clarify goal
  • specify output schema
  • provide examples
  • narrow scope

Step 6: Re-test

Prompt engineering is iterative.


3. Designing Better Prompt Tests

When testing prompts, avoid relying on a single “looks good” example.

Good Prompt Test Sets Include

  • Happy path examples
  • Ambiguous examples
  • Noisy/realistic examples
  • Boundary cases
  • Adversarial or confusing inputs

Example Task: Classify Customer Feedback

Possible labels:

  • positive
  • negative
  • neutral

Test inputs:

  1. "The app is fast and easy to use."
  2. "It crashes every time I upload a file."
  3. "The UI changed after the update."
  4. "Great features, but the setup was frustrating."

Notice that #4 may expose ambiguity.

Evaluation Questions

  • Does the model always return one of the expected labels?
  • Does it misclassify mixed sentiment?
  • Does it include extra explanation when only a label is desired?

4. Hands-On Exercise 1: Run a Baseline Prompt

In this exercise, learners will test an initial prompt for sentiment classification.

Goal

Use gpt-5.4-mini and the Responses API to classify user feedback.

Setup

Install dependencies:

pip install openai python-dotenv

Create a .env file:

OPENAI_API_KEY=your_api_key_here

Python Script: Baseline Prompt Test

from openai import OpenAI
from dotenv import load_dotenv
import os

# Load environment variables from .env
load_dotenv()

# Create the OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# A small test set for sentiment classification
feedback_examples = [
    "The app is fast and easy to use.",
    "It crashes every time I upload a file.",
    "The UI changed after the update.",
    "Great features, but the setup was frustrating.",
]

# Initial baseline prompt template
def build_prompt(feedback: str) -> str:
    return f"""
Classify the sentiment of the following customer feedback as positive, negative, or neutral.

Feedback: "{feedback}"
""".strip()

# Run the prompt on each example and print the result
for text in feedback_examples:
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=build_prompt(text)
    )

    # Output text from the Responses API
    result = response.output_text.strip()

    print(f"Feedback: {text}")
    print(f"Model output: {result}")
    print("-" * 50)

Example Output

Feedback: The app is fast and easy to use.
Model output: positive
--------------------------------------------------
Feedback: It crashes every time I upload a file.
Model output: negative
--------------------------------------------------
Feedback: The UI changed after the update.
Model output: neutral
--------------------------------------------------
Feedback: Great features, but the setup was frustrating.
Model output: neutral
--------------------------------------------------

Discussion

This may seem acceptable at first, but several issues can appear:

  • extra explanation instead of a single label
  • inconsistent handling of mixed sentiment
  • non-standard outputs like “somewhat negative”

That means the prompt still needs improvement.


5. Improving Prompt Specificity

One of the easiest improvements is to constrain the output more tightly.

Better Prompt Design Principles

  • State the exact allowed outputs.
  • Tell the model what to do if the input is mixed or ambiguous.
  • Specify formatting rules.
  • Keep task instructions concise and unambiguous.

Improved Version

Classify the customer feedback into exactly one of these labels:
positive, negative, neutral

Rules:
- Return only the label.
- If the feedback contains both praise and criticism, choose the dominant sentiment.
- If no clear sentiment is expressed, return neutral.

Feedback: "Great features, but the setup was frustrating."

This prompt is more testable because it reduces ambiguity.


6. Hands-On Exercise 2: Compare Two Prompt Variants

Goal

Compare a baseline prompt with an improved prompt across the same examples.

Python Script: Prompt Comparison

from openai import OpenAI
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

# Create API client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Test examples
feedback_examples = [
    "The app is fast and easy to use.",
    "It crashes every time I upload a file.",
    "The UI changed after the update.",
    "Great features, but the setup was frustrating.",
    "I love the design, but performance is terrible.",
]

def baseline_prompt(feedback: str) -> str:
    """A simple, under-specified prompt."""
    return f"""
Classify the sentiment of the following customer feedback as positive, negative, or neutral.

Feedback: "{feedback}"
""".strip()

def improved_prompt(feedback: str) -> str:
    """A more constrained and testable prompt."""
    return f"""
Classify the customer feedback into exactly one of these labels:
positive, negative, neutral

Rules:
- Return only one label.
- Return no additional text.
- If the feedback contains both positive and negative sentiment, choose the dominant sentiment.
- If no clear sentiment is expressed, return neutral.

Feedback: "{feedback}"
""".strip()

def run_prompt(prompt_text: str) -> str:
    """Send a prompt to the model and return the text output."""
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt_text
    )
    return response.output_text.strip()

# Compare both prompt versions
for feedback in feedback_examples:
    baseline_result = run_prompt(baseline_prompt(feedback))
    improved_result = run_prompt(improved_prompt(feedback))

    print(f"Feedback: {feedback}")
    print(f"Baseline : {baseline_result}")
    print(f"Improved : {improved_result}")
    print("-" * 60)

Example Output

Feedback: The app is fast and easy to use.
Baseline : positive
Improved : positive
------------------------------------------------------------
Feedback: It crashes every time I upload a file.
Baseline : negative
Improved : negative
------------------------------------------------------------
Feedback: The UI changed after the update.
Baseline : neutral
Improved : neutral
------------------------------------------------------------
Feedback: Great features, but the setup was frustrating.
Baseline : mixed sentiment
Improved : negative
------------------------------------------------------------
Feedback: I love the design, but performance is terrible.
Baseline : mixed
Improved : negative
------------------------------------------------------------

What Improved?

The second prompt is better because it:

  • limits valid outputs
  • reduces formatting variance
  • handles mixed sentiment explicitly

7. Using Structured Evaluation Criteria

Prompt iteration becomes more effective when you score outputs systematically.

Common Evaluation Dimensions

  • Correctness — Is the answer right?
  • Format compliance — Does it match the required structure?
  • Completeness — Does it include all required parts?
  • Conciseness — Is it appropriately brief?
  • Consistency — Does it behave similarly across similar cases?

Example Scoring Table

Test Case Expected Actual Correct? Format OK? Notes
Fast and easy to use positive positive Yes Yes Good
Crashes every time negative negative Yes Yes Good
UI changed after update neutral neutral Yes Yes Good
Great features, but frustrating negative mixed sentiment No No Needs refinement

Even a manual table like this is useful.


8. Hands-On Exercise 3: Build a Mini Prompt Evaluation Harness

Goal

Create a small Python script that tests a prompt against expected results and reports accuracy.

Python Script: Evaluation Harness

from openai import OpenAI
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

# Create the API client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Test dataset with expected labels
test_cases = [
    {
        "input": "The app is fast and easy to use.",
        "expected": "positive",
    },
    {
        "input": "It crashes every time I upload a file.",
        "expected": "negative",
    },
    {
        "input": "The UI changed after the update.",
        "expected": "neutral",
    },
    {
        "input": "Great features, but the setup was frustrating.",
        "expected": "negative",
    },
    {
        "input": "I love the design, but performance is terrible.",
        "expected": "negative",
    },
]

def build_prompt(feedback: str) -> str:
    """Return the improved classification prompt."""
    return f"""
Classify the customer feedback into exactly one of these labels:
positive, negative, neutral

Rules:
- Return only one label.
- Return no additional text.
- If the feedback contains both positive and negative sentiment, choose the dominant sentiment.
- If no clear sentiment is expressed, return neutral.

Feedback: "{feedback}"
""".strip()

def get_model_label(feedback: str) -> str:
    """Call the model and normalize the returned label."""
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=build_prompt(feedback)
    )
    return response.output_text.strip().lower()

def evaluate_prompt(cases: list[dict]) -> None:
    """Evaluate prompt performance on a list of test cases."""
    correct = 0

    print("Running evaluation...\n")

    for i, case in enumerate(cases, start=1):
        prediction = get_model_label(case["input"])
        expected = case["expected"]
        is_correct = prediction == expected

        if is_correct:
            correct += 1

        print(f"Test case {i}")
        print(f"Input     : {case['input']}")
        print(f"Expected  : {expected}")
        print(f"Predicted : {prediction}")
        print(f"Correct   : {is_correct}")
        print("-" * 60)

    accuracy = correct / len(cases) * 100
    print(f"\nFinal accuracy: {accuracy:.1f}%")

if __name__ == "__main__":
    evaluate_prompt(test_cases)

Example Output

Running evaluation...

Test case 1
Input     : The app is fast and easy to use.
Expected  : positive
Predicted : positive
Correct   : True
------------------------------------------------------------
Test case 2
Input     : It crashes every time I upload a file.
Expected  : negative
Predicted : negative
Correct   : True
------------------------------------------------------------
Test case 3
Input     : The UI changed after the update.
Expected  : neutral
Predicted : neutral
Correct   : True
------------------------------------------------------------
Test case 4
Input     : Great features, but the setup was frustrating.
Expected  : negative
Predicted : negative
Correct   : True
------------------------------------------------------------
Test case 5
Input     : I love the design, but performance is terrible.
Expected  : negative
Predicted : negative
Correct   : True
------------------------------------------------------------

Final accuracy: 100.0%

Why This Matters

This is the beginning of prompt evaluation engineering:

  • define expectations
  • run consistent tests
  • measure quality
  • refine based on evidence

9. Iteration Strategies That Work Well

When a prompt underperforms, use these practical strategies.

1. Tighten the output format

Bad:

Tell me the sentiment.

Better:

Return exactly one word: positive, negative, or neutral.

2. Add decision rules

Useful when inputs are ambiguous.

Example:

If the text contains both praise and criticism, choose the stronger sentiment.

3. Add examples

Sometimes showing desired behavior helps.

Example:

Example:
Feedback: "The app works well."
Label: positive

Feedback: "It fails to load."
Label: negative

4. Reduce unnecessary wording

Overly long prompts can introduce confusion.

5. Test edge cases intentionally

Examples:

  • sarcasm
  • mixed sentiment
  • minimal text
  • unclear statements
  • irrelevant input

10. Common Mistakes in Prompt Iteration

Mistake 1: Changing too many things at once

If you rewrite the entire prompt, it becomes hard to know what caused improvement.

Better: change one dimension at a time.

Mistake 2: Testing on too few examples

A single success case proves very little.

Mistake 3: Ignoring formatting failures

Even if the content is correct, formatting errors can break downstream systems.

Mistake 4: Not defining what “good” means

Before testing, decide:

  • correct answer
  • expected format
  • length constraints
  • acceptable variation

Mistake 5: Assuming the first good result is production-ready

Real applications require repeatability.


11. Mini Challenge: Improve a Summarization Prompt

Scenario

You want the model to summarize customer support emails for an internal dashboard.

Weak Prompt

Summarize this email:
{email_text}

Problems

  • no length guidance
  • no structure
  • may omit action items
  • may return inconsistent formats

Better Version

Summarize the following customer support email in exactly 3 bullet points.

Include:
- the main issue
- the customer’s requested outcome
- any urgency or deadline mentioned

Return only the bullet points.

Email:
{email_text}

Reflection Questions

  • What output inconsistencies does this prevent?
  • What edge cases should be tested?
  • Would examples improve reliability?

12. Hands-On Exercise 4: Iterate on a Summarization Prompt

Goal

Test two summarization prompts and inspect output quality.

Python Script: Summarization Prompt Iteration

from openai import OpenAI
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

# Create client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

email_text = """
Hi support team,

I upgraded to the Pro plan yesterday, but I still cannot access the reporting dashboard.
I need this fixed before Friday because I have to present results to my manager.
Please let me know if you need any account details from me.

Thanks,
Jordan
""".strip()

def weak_prompt(email: str) -> str:
    """A vague summarization prompt."""
    return f"""
Summarize this email:

{email}
""".strip()

def improved_prompt(email: str) -> str:
    """A more structured summarization prompt."""
    return f"""
Summarize the following customer support email in exactly 3 bullet points.

Include:
- the main issue
- the customer’s requested outcome
- any urgency or deadline mentioned

Return only the bullet points.

Email:
{email}
""".strip()

def run_prompt(prompt_text: str) -> str:
    """Call the model and return the output text."""
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt_text
    )
    return response.output_text.strip()

weak_result = run_prompt(weak_prompt(email_text))
improved_result = run_prompt(improved_prompt(email_text))

print("=== Weak Prompt Output ===")
print(weak_result)
print("\n=== Improved Prompt Output ===")
print(improved_result)

Example Output

=== Weak Prompt Output ===
The customer says they upgraded to the Pro plan but still cannot access the reporting dashboard. They need the issue resolved soon and are willing to provide account details.

=== Improved Prompt Output ===
- Customer upgraded to the Pro plan but still cannot access the reporting dashboard.
- They want support to resolve the access issue and will provide account details if needed.
- The issue is urgent because they need it fixed before Friday for a presentation to their manager.

Discussion

The improved prompt is superior because it is:

  • structured
  • predictable
  • easier to consume downstream
  • aligned to business needs

13. Practical Guidance for Real Projects

In real systems, prompt iteration should be treated like software iteration.

  • keep prompts in version-controlled files
  • maintain test datasets
  • compare prompt versions side by side
  • document why changes were made
  • evaluate with representative user inputs
  • monitor production failures and add them to your test set

Simple Versioning Example

PROMPT_V1 = """
Classify the sentiment of the following feedback.
""".strip()

PROMPT_V2 = """
Classify the sentiment into exactly one label: positive, negative, neutral.
Return only the label.
""".strip()

This makes experiments more reproducible.


14. Recap

In this session, you learned that effective prompting requires:

  • repeated testing
  • representative examples
  • clear evaluation criteria
  • controlled iteration

You also practiced:

  • running prompts with the OpenAI Responses API
  • comparing prompt variants
  • creating a small evaluation harness
  • improving summarization and classification prompts

15. Useful Resources

  • OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
  • OpenAI API docs: https://platform.openai.com/docs
  • OpenAI Python SDK: https://github.com/openai/openai-python
  • Prompt engineering guide: https://platform.openai.com/docs/guides/prompt-engineering
  • python-dotenv: https://pypi.org/project/python-dotenv/

16. Suggested Practice After Class

  1. Create a test set of 10 examples for a task you care about.
  2. Write one baseline prompt and two improved variants.
  3. Evaluate all three prompts using a Python script.
  4. Record:
  5. accuracy
  6. formatting consistency
  7. common failure cases
  8. Refine again based on failures.

17. End-of-Session Takeaway

Prompting improves fastest when you stop guessing and start testing.


Back to Chapter | Back to Master Plan | Previous Session