Session 4: Testing and Iterating on Prompts
Synopsis
Introduces a disciplined process for comparing prompt variations, identifying failure cases, and improving consistency. Learners start thinking like engineers who validate prompt performance rather than relying on one-off success.
Session Content
Session 4: Testing and Iterating on Prompts
Session Overview
Duration: ~45 minutes
Audience: Python developers with basic programming knowledge
Focus: Learning how to systematically test, evaluate, and improve prompts for GenAI applications using the OpenAI Responses API and Python SDK.
Learning Objectives
By the end of this session, learners will be able to:
- Explain why prompt iteration is necessary in real-world GenAI applications.
- Identify common prompt failure modes.
- Create simple prompt test cases in Python.
- Compare prompt variants using repeatable evaluation criteria.
- Use the OpenAI Responses API with
gpt-5.4-minito run prompt experiments. - Improve prompts based on observed outputs.
1. Why Prompt Iteration Matters
Prompting is not a one-shot activity. Even when a prompt “works” for one example, it may fail for:
- different user inputs
- ambiguous wording
- edge cases
- formatting constraints
- tone/style requirements
- factual reliability expectations
Key Idea
A good prompt is:
- clear
- specific
- testable
- robust across examples
Typical Prompt Failure Modes
- Too vague
-
The model gives broad or inconsistent answers.
-
Missing output format
-
The model responds in unexpected structure.
-
Insufficient constraints
-
The answer is too long, too short, too technical, or off-topic.
-
No examples
-
The model misunderstands the desired style or task.
-
Conflicting instructions
-
The prompt asks for mutually incompatible behavior.
-
Edge-case fragility
- The prompt works for normal inputs but breaks on unusual ones.
Example
Weak prompt:
Summarize this email.
Improved prompt:
Summarize the following email in 3 bullet points.
Focus on:
1. the main request
2. deadlines
3. any action items
Email:
...
2. A Simple Prompt Iteration Workflow
A practical workflow for prompt improvement:
Step 1: Define the task clearly
Examples:
- summarize support tickets
- classify sentiment
- rewrite text in simpler language
- extract structured fields
Step 2: Write an initial prompt
Start simple, but explicit.
Step 3: Build a small test set
Use 5–10 representative examples:
- typical inputs
- tricky inputs
- edge cases
Step 4: Evaluate outputs
Check for:
- correctness
- consistency
- format compliance
- relevance
- brevity or detail as required
Step 5: Refine the prompt
Adjust based on failures:
- add constraints
- clarify goal
- specify output schema
- provide examples
- narrow scope
Step 6: Re-test
Prompt engineering is iterative.
3. Designing Better Prompt Tests
When testing prompts, avoid relying on a single “looks good” example.
Good Prompt Test Sets Include
- Happy path examples
- Ambiguous examples
- Noisy/realistic examples
- Boundary cases
- Adversarial or confusing inputs
Example Task: Classify Customer Feedback
Possible labels:
positivenegativeneutral
Test inputs:
"The app is fast and easy to use.""It crashes every time I upload a file.""The UI changed after the update.""Great features, but the setup was frustrating."
Notice that #4 may expose ambiguity.
Evaluation Questions
- Does the model always return one of the expected labels?
- Does it misclassify mixed sentiment?
- Does it include extra explanation when only a label is desired?
4. Hands-On Exercise 1: Run a Baseline Prompt
In this exercise, learners will test an initial prompt for sentiment classification.
Goal
Use gpt-5.4-mini and the Responses API to classify user feedback.
Setup
Install dependencies:
pip install openai python-dotenv
Create a .env file:
OPENAI_API_KEY=your_api_key_here
Python Script: Baseline Prompt Test
from openai import OpenAI
from dotenv import load_dotenv
import os
# Load environment variables from .env
load_dotenv()
# Create the OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# A small test set for sentiment classification
feedback_examples = [
"The app is fast and easy to use.",
"It crashes every time I upload a file.",
"The UI changed after the update.",
"Great features, but the setup was frustrating.",
]
# Initial baseline prompt template
def build_prompt(feedback: str) -> str:
return f"""
Classify the sentiment of the following customer feedback as positive, negative, or neutral.
Feedback: "{feedback}"
""".strip()
# Run the prompt on each example and print the result
for text in feedback_examples:
response = client.responses.create(
model="gpt-5.4-mini",
input=build_prompt(text)
)
# Output text from the Responses API
result = response.output_text.strip()
print(f"Feedback: {text}")
print(f"Model output: {result}")
print("-" * 50)
Example Output
Feedback: The app is fast and easy to use.
Model output: positive
--------------------------------------------------
Feedback: It crashes every time I upload a file.
Model output: negative
--------------------------------------------------
Feedback: The UI changed after the update.
Model output: neutral
--------------------------------------------------
Feedback: Great features, but the setup was frustrating.
Model output: neutral
--------------------------------------------------
Discussion
This may seem acceptable at first, but several issues can appear:
- extra explanation instead of a single label
- inconsistent handling of mixed sentiment
- non-standard outputs like “somewhat negative”
That means the prompt still needs improvement.
5. Improving Prompt Specificity
One of the easiest improvements is to constrain the output more tightly.
Better Prompt Design Principles
- State the exact allowed outputs.
- Tell the model what to do if the input is mixed or ambiguous.
- Specify formatting rules.
- Keep task instructions concise and unambiguous.
Improved Version
Classify the customer feedback into exactly one of these labels:
positive, negative, neutral
Rules:
- Return only the label.
- If the feedback contains both praise and criticism, choose the dominant sentiment.
- If no clear sentiment is expressed, return neutral.
Feedback: "Great features, but the setup was frustrating."
This prompt is more testable because it reduces ambiguity.
6. Hands-On Exercise 2: Compare Two Prompt Variants
Goal
Compare a baseline prompt with an improved prompt across the same examples.
Python Script: Prompt Comparison
from openai import OpenAI
from dotenv import load_dotenv
import os
# Load environment variables
load_dotenv()
# Create API client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# Test examples
feedback_examples = [
"The app is fast and easy to use.",
"It crashes every time I upload a file.",
"The UI changed after the update.",
"Great features, but the setup was frustrating.",
"I love the design, but performance is terrible.",
]
def baseline_prompt(feedback: str) -> str:
"""A simple, under-specified prompt."""
return f"""
Classify the sentiment of the following customer feedback as positive, negative, or neutral.
Feedback: "{feedback}"
""".strip()
def improved_prompt(feedback: str) -> str:
"""A more constrained and testable prompt."""
return f"""
Classify the customer feedback into exactly one of these labels:
positive, negative, neutral
Rules:
- Return only one label.
- Return no additional text.
- If the feedback contains both positive and negative sentiment, choose the dominant sentiment.
- If no clear sentiment is expressed, return neutral.
Feedback: "{feedback}"
""".strip()
def run_prompt(prompt_text: str) -> str:
"""Send a prompt to the model and return the text output."""
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt_text
)
return response.output_text.strip()
# Compare both prompt versions
for feedback in feedback_examples:
baseline_result = run_prompt(baseline_prompt(feedback))
improved_result = run_prompt(improved_prompt(feedback))
print(f"Feedback: {feedback}")
print(f"Baseline : {baseline_result}")
print(f"Improved : {improved_result}")
print("-" * 60)
Example Output
Feedback: The app is fast and easy to use.
Baseline : positive
Improved : positive
------------------------------------------------------------
Feedback: It crashes every time I upload a file.
Baseline : negative
Improved : negative
------------------------------------------------------------
Feedback: The UI changed after the update.
Baseline : neutral
Improved : neutral
------------------------------------------------------------
Feedback: Great features, but the setup was frustrating.
Baseline : mixed sentiment
Improved : negative
------------------------------------------------------------
Feedback: I love the design, but performance is terrible.
Baseline : mixed
Improved : negative
------------------------------------------------------------
What Improved?
The second prompt is better because it:
- limits valid outputs
- reduces formatting variance
- handles mixed sentiment explicitly
7. Using Structured Evaluation Criteria
Prompt iteration becomes more effective when you score outputs systematically.
Common Evaluation Dimensions
- Correctness — Is the answer right?
- Format compliance — Does it match the required structure?
- Completeness — Does it include all required parts?
- Conciseness — Is it appropriately brief?
- Consistency — Does it behave similarly across similar cases?
Example Scoring Table
| Test Case | Expected | Actual | Correct? | Format OK? | Notes |
|---|---|---|---|---|---|
| Fast and easy to use | positive | positive | Yes | Yes | Good |
| Crashes every time | negative | negative | Yes | Yes | Good |
| UI changed after update | neutral | neutral | Yes | Yes | Good |
| Great features, but frustrating | negative | mixed sentiment | No | No | Needs refinement |
Even a manual table like this is useful.
8. Hands-On Exercise 3: Build a Mini Prompt Evaluation Harness
Goal
Create a small Python script that tests a prompt against expected results and reports accuracy.
Python Script: Evaluation Harness
from openai import OpenAI
from dotenv import load_dotenv
import os
# Load environment variables
load_dotenv()
# Create the API client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# Test dataset with expected labels
test_cases = [
{
"input": "The app is fast and easy to use.",
"expected": "positive",
},
{
"input": "It crashes every time I upload a file.",
"expected": "negative",
},
{
"input": "The UI changed after the update.",
"expected": "neutral",
},
{
"input": "Great features, but the setup was frustrating.",
"expected": "negative",
},
{
"input": "I love the design, but performance is terrible.",
"expected": "negative",
},
]
def build_prompt(feedback: str) -> str:
"""Return the improved classification prompt."""
return f"""
Classify the customer feedback into exactly one of these labels:
positive, negative, neutral
Rules:
- Return only one label.
- Return no additional text.
- If the feedback contains both positive and negative sentiment, choose the dominant sentiment.
- If no clear sentiment is expressed, return neutral.
Feedback: "{feedback}"
""".strip()
def get_model_label(feedback: str) -> str:
"""Call the model and normalize the returned label."""
response = client.responses.create(
model="gpt-5.4-mini",
input=build_prompt(feedback)
)
return response.output_text.strip().lower()
def evaluate_prompt(cases: list[dict]) -> None:
"""Evaluate prompt performance on a list of test cases."""
correct = 0
print("Running evaluation...\n")
for i, case in enumerate(cases, start=1):
prediction = get_model_label(case["input"])
expected = case["expected"]
is_correct = prediction == expected
if is_correct:
correct += 1
print(f"Test case {i}")
print(f"Input : {case['input']}")
print(f"Expected : {expected}")
print(f"Predicted : {prediction}")
print(f"Correct : {is_correct}")
print("-" * 60)
accuracy = correct / len(cases) * 100
print(f"\nFinal accuracy: {accuracy:.1f}%")
if __name__ == "__main__":
evaluate_prompt(test_cases)
Example Output
Running evaluation...
Test case 1
Input : The app is fast and easy to use.
Expected : positive
Predicted : positive
Correct : True
------------------------------------------------------------
Test case 2
Input : It crashes every time I upload a file.
Expected : negative
Predicted : negative
Correct : True
------------------------------------------------------------
Test case 3
Input : The UI changed after the update.
Expected : neutral
Predicted : neutral
Correct : True
------------------------------------------------------------
Test case 4
Input : Great features, but the setup was frustrating.
Expected : negative
Predicted : negative
Correct : True
------------------------------------------------------------
Test case 5
Input : I love the design, but performance is terrible.
Expected : negative
Predicted : negative
Correct : True
------------------------------------------------------------
Final accuracy: 100.0%
Why This Matters
This is the beginning of prompt evaluation engineering:
- define expectations
- run consistent tests
- measure quality
- refine based on evidence
9. Iteration Strategies That Work Well
When a prompt underperforms, use these practical strategies.
1. Tighten the output format
Bad:
Tell me the sentiment.
Better:
Return exactly one word: positive, negative, or neutral.
2. Add decision rules
Useful when inputs are ambiguous.
Example:
If the text contains both praise and criticism, choose the stronger sentiment.
3. Add examples
Sometimes showing desired behavior helps.
Example:
Example:
Feedback: "The app works well."
Label: positive
Feedback: "It fails to load."
Label: negative
4. Reduce unnecessary wording
Overly long prompts can introduce confusion.
5. Test edge cases intentionally
Examples:
- sarcasm
- mixed sentiment
- minimal text
- unclear statements
- irrelevant input
10. Common Mistakes in Prompt Iteration
Mistake 1: Changing too many things at once
If you rewrite the entire prompt, it becomes hard to know what caused improvement.
Better: change one dimension at a time.
Mistake 2: Testing on too few examples
A single success case proves very little.
Mistake 3: Ignoring formatting failures
Even if the content is correct, formatting errors can break downstream systems.
Mistake 4: Not defining what “good” means
Before testing, decide:
- correct answer
- expected format
- length constraints
- acceptable variation
Mistake 5: Assuming the first good result is production-ready
Real applications require repeatability.
11. Mini Challenge: Improve a Summarization Prompt
Scenario
You want the model to summarize customer support emails for an internal dashboard.
Weak Prompt
Summarize this email:
{email_text}
Problems
- no length guidance
- no structure
- may omit action items
- may return inconsistent formats
Better Version
Summarize the following customer support email in exactly 3 bullet points.
Include:
- the main issue
- the customer’s requested outcome
- any urgency or deadline mentioned
Return only the bullet points.
Email:
{email_text}
Reflection Questions
- What output inconsistencies does this prevent?
- What edge cases should be tested?
- Would examples improve reliability?
12. Hands-On Exercise 4: Iterate on a Summarization Prompt
Goal
Test two summarization prompts and inspect output quality.
Python Script: Summarization Prompt Iteration
from openai import OpenAI
from dotenv import load_dotenv
import os
# Load environment variables
load_dotenv()
# Create client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
email_text = """
Hi support team,
I upgraded to the Pro plan yesterday, but I still cannot access the reporting dashboard.
I need this fixed before Friday because I have to present results to my manager.
Please let me know if you need any account details from me.
Thanks,
Jordan
""".strip()
def weak_prompt(email: str) -> str:
"""A vague summarization prompt."""
return f"""
Summarize this email:
{email}
""".strip()
def improved_prompt(email: str) -> str:
"""A more structured summarization prompt."""
return f"""
Summarize the following customer support email in exactly 3 bullet points.
Include:
- the main issue
- the customer’s requested outcome
- any urgency or deadline mentioned
Return only the bullet points.
Email:
{email}
""".strip()
def run_prompt(prompt_text: str) -> str:
"""Call the model and return the output text."""
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt_text
)
return response.output_text.strip()
weak_result = run_prompt(weak_prompt(email_text))
improved_result = run_prompt(improved_prompt(email_text))
print("=== Weak Prompt Output ===")
print(weak_result)
print("\n=== Improved Prompt Output ===")
print(improved_result)
Example Output
=== Weak Prompt Output ===
The customer says they upgraded to the Pro plan but still cannot access the reporting dashboard. They need the issue resolved soon and are willing to provide account details.
=== Improved Prompt Output ===
- Customer upgraded to the Pro plan but still cannot access the reporting dashboard.
- They want support to resolve the access issue and will provide account details if needed.
- The issue is urgent because they need it fixed before Friday for a presentation to their manager.
Discussion
The improved prompt is superior because it is:
- structured
- predictable
- easier to consume downstream
- aligned to business needs
13. Practical Guidance for Real Projects
In real systems, prompt iteration should be treated like software iteration.
Recommended Practices
- keep prompts in version-controlled files
- maintain test datasets
- compare prompt versions side by side
- document why changes were made
- evaluate with representative user inputs
- monitor production failures and add them to your test set
Simple Versioning Example
PROMPT_V1 = """
Classify the sentiment of the following feedback.
""".strip()
PROMPT_V2 = """
Classify the sentiment into exactly one label: positive, negative, neutral.
Return only the label.
""".strip()
This makes experiments more reproducible.
14. Recap
In this session, you learned that effective prompting requires:
- repeated testing
- representative examples
- clear evaluation criteria
- controlled iteration
You also practiced:
- running prompts with the OpenAI Responses API
- comparing prompt variants
- creating a small evaluation harness
- improving summarization and classification prompts
15. Useful Resources
- OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
- OpenAI API docs: https://platform.openai.com/docs
- OpenAI Python SDK: https://github.com/openai/openai-python
- Prompt engineering guide: https://platform.openai.com/docs/guides/prompt-engineering
- python-dotenv: https://pypi.org/project/python-dotenv/
16. Suggested Practice After Class
- Create a test set of 10 examples for a task you care about.
- Write one baseline prompt and two improved variants.
- Evaluate all three prompts using a Python script.
- Record:
- accuracy
- formatting consistency
- common failure cases
- Refine again based on failures.
17. End-of-Session Takeaway
Prompting improves fastest when you stop guessing and start testing.