Session 4: Versioning Prompts, Models, and Agent Behaviors
Synopsis
Introduces disciplined change management for prompts, retrieval settings, tools, workflows, and models. Learners understand how to evolve systems safely while preserving reproducibility and auditability.
Session Content
Session 4: Versioning Prompts, Models, and Agent Behaviors
Session Overview
In this session, you will learn how to systematically version and evaluate the moving parts of a GenAI application:
- Prompts: instructions, message structure, and reusable prompt templates
- Models: which model version you use and how model changes affect outputs
- Agent behaviors: tool usage policy, routing logic, memory strategy, and guardrails
By the end of this session, you will be able to:
- Explain why versioning matters in GenAI systems
- Design a simple versioning strategy for prompts, model configs, and agent behavior
- Store prompt and behavior configurations in code-friendly formats
- Run side-by-side comparisons using the OpenAI Responses API
- Build a lightweight regression testing workflow in Python
Learning Objectives
After this session, learners should be able to:
- Define what should be versioned in an LLM-powered system
- Separate prompt content from application logic
- Track model and parameter changes in a reproducible way
- Compare agent behavior versions using representative test cases
- Create a small evaluation harness to detect behavior drift
Agenda for a 45-Minute Session
- 0–10 min: Why versioning matters in GenAI systems
- 10–20 min: What to version: prompts, models, and agent behavior
- 20–30 min: Hands-on Exercise 1 — prompt and model version comparison
- 30–40 min: Hands-on Exercise 2 — agent behavior configs and regression checks
- 40–45 min: Wrap-up and key takeaways
1. Why Versioning Matters
Traditional software versioning usually tracks code changes. In GenAI systems, application behavior depends on more than code:
- The prompt
- The model
- The tooling and agent policy
- The system instructions
- The input formatting
- The post-processing logic
A small wording change in a system prompt can produce very different answers. Changing the model can alter:
- style
- factuality
- tool-use tendencies
- verbosity
- reliability on edge cases
If these changes are not tracked, teams often run into:
- “Why did the agent suddenly become more verbose?”
- “Why did this customer support answer change?”
- “Why is the model now calling tools more often?”
- “Why did tests pass last week but fail today?”
Key idea
In GenAI applications, behavior is configuration plus model plus instructions, not just code.
2. What Should Be Versioned
2.1 Prompt Versioning
Prompts should be treated as first-class artifacts.
Examples:
- system prompts
- reusable task instructions
- response style templates
- tool usage instructions
- safety and refusal policies
A good prompt version should include:
- a name
- a version
- the prompt text
- optionally: metadata such as owner, purpose, and last updated date
Example prompt versions
v1 - “Summarize the customer message in 3 bullet points.”
v2 - “Summarize the customer message in exactly 3 concise bullet points. Include urgency if present.”
That is a behavior change. It should be versioned.
2.2 Model Versioning
You should record:
- model name
- relevant generation settings
- date of change
- reason for change
Typical config fields:
modeltemperatureif applicable in your workflow- max output length settings
- tool configuration
- reasoning settings if used
- response format expectations
Even if you keep the same prompt, switching models may change output quality and behavior.
2.3 Agent Behavior Versioning
For an agent, prompt versioning alone is not enough.
You should also version:
- tool call policy
- routing rules
- memory strategy
- fallback behavior
- escalation conditions
- formatting policy
- refusal policy
Example
Behavior v1 - Always answer directly unless user explicitly asks for external data
Behavior v2 - Use tools whenever a question requires current or verifiable information
This can fundamentally change agent behavior.
3. Principles for Good Versioning
3.1 Keep Config Separate from Code
Instead of hardcoding prompts deep inside functions, define versioned configs in:
- JSON
- YAML
- Python dictionaries/classes
- a dedicated config module
This makes diffs and comparisons easier.
3.2 Use Stable Names
Use clear identifiers such as:
support_summary_prompt:v1support_summary_prompt:v2agent_behavior:v1agent_behavior:v2
3.3 Track Why a Change Was Made
Each version should have a short rationale.
Example:
- v2 adds urgency detection because support team needs prioritization
- v3 reduces verbosity for mobile app UI
3.4 Evaluate Before Replacing
Do not update prompts or model settings in production without testing representative examples.
You need:
- a benchmark set of inputs
- expected qualities or outputs
- a comparison process
4. A Simple Versioning Design
Below is a lightweight Python-based structure for versioning prompts and behaviors.
5. Hands-on Exercise 1: Compare Prompt Versions with the Responses API
Goal
Create two prompt versions and compare their outputs on the same input using gpt-5.4-mini.
What you will build
A small Python script that:
- defines prompt versions
- calls the OpenAI Responses API
- prints side-by-side results
Setup
Install the OpenAI Python SDK:
pip install openai
Set your API key:
export OPENAI_API_KEY="your_api_key_here"
Code: Compare Two Prompt Versions
"""
Exercise 1: Compare prompt versions with the OpenAI Responses API.
This script demonstrates how to:
1. Define versioned prompts in Python dictionaries
2. Invoke the OpenAI Responses API
3. Compare outputs across prompt versions
Model used:
- gpt-5.4-mini
"""
from openai import OpenAI
# Create the OpenAI client.
# The SDK automatically reads OPENAI_API_KEY from the environment.
client = OpenAI()
# Versioned prompt definitions.
PROMPTS = {
"support_summary:v1": {
"name": "support_summary",
"version": "v1",
"description": "Basic customer message summarization",
"system_prompt": (
"You are a customer support assistant. "
"Summarize the customer's message in 3 bullet points."
),
},
"support_summary:v2": {
"name": "support_summary",
"version": "v2",
"description": "Summarization with urgency detection",
"system_prompt": (
"You are a customer support assistant. "
"Summarize the customer's message in exactly 3 concise bullet points. "
"If urgency is present, mention it clearly."
),
},
}
# Shared user input for comparison.
customer_message = """
Hi, I placed an order five days ago and paid extra for express shipping,
but it still hasn't arrived. I need the package before tomorrow because it contains
materials for a client presentation. If it won't arrive in time, I need a refund.
"""
def run_prompt(prompt_config: dict, user_text: str) -> str:
"""
Call the Responses API using a versioned prompt configuration.
Args:
prompt_config: A dictionary containing the prompt definition.
user_text: The customer message to summarize.
Returns:
The model's text output.
"""
response = client.responses.create(
model="gpt-5.4-mini",
input=[
{
"role": "system",
"content": prompt_config["system_prompt"],
},
{
"role": "user",
"content": user_text,
},
],
)
# output_text is the easiest way to retrieve the final text response.
return response.output_text
if __name__ == "__main__":
for prompt_id, prompt_config in PROMPTS.items():
print("=" * 80)
print(f"Prompt ID: {prompt_id}")
print(f"Description: {prompt_config['description']}")
print("-" * 80)
output = run_prompt(prompt_config, customer_message)
print(output)
print()
Example Output
================================================================================
Prompt ID: support_summary:v1
Description: Basic customer message summarization
--------------------------------------------------------------------------------
- The customer placed an order five days ago with express shipping, but it has not arrived.
- The package is needed before tomorrow for a client presentation.
- The customer wants a refund if the package will not arrive in time.
================================================================================
Prompt ID: support_summary:v2
Description: Summarization with urgency detection
--------------------------------------------------------------------------------
- The customer ordered a package five days ago with express shipping, but it still has not arrived.
- The package is urgently needed before tomorrow for materials required in a client presentation.
- The customer requests a refund if delivery cannot happen in time.
Discussion
Notice how v2 is not radically different, but it encodes a clearer operational behavior:
- exactly 3 concise bullets
- explicit urgency handling
That is precisely the kind of change worth versioning.
6. Versioning Model Configurations
Prompt version alone is incomplete. You also want model configuration versioning.
Example configuration structure
MODEL_CONFIGS = {
"summary_model:v1": {
"model": "gpt-5.4-mini",
"notes": "Default summary model",
},
"summary_model:v2": {
"model": "gpt-5.4-mini",
"notes": "Same model, reserved for future config changes",
},
}
In real projects, you may include more settings if your application uses them. The key point is: - store them explicitly - name them clearly - record when they change
7. Agent Behavior as Versioned Configuration
Agent behavior is more than the prompt text.
Consider a support triage agent. Its behavior may include:
- whether to classify urgency
- whether to suggest escalation
- whether to ask clarifying questions
- whether to prioritize brevity
Example behavior configs
AGENT_BEHAVIORS = {
"triage_agent:v1": {
"prompt_id": "support_summary:v1",
"requires_urgency_detection": False,
"suggest_escalation": False,
"style": "brief",
},
"triage_agent:v2": {
"prompt_id": "support_summary:v2",
"requires_urgency_detection": True,
"suggest_escalation": True,
"style": "brief",
},
}
This lets you distinguish: - prompt changes - behavior policy changes - model changes
8. Hands-on Exercise 2: Build a Small Regression Harness for Agent Behavior
Goal
Create a simple testing loop that compares two behavior versions on multiple test cases.
What you will build
A script that:
- stores test cases
- runs an agent behavior version on each case
- prints outputs
- performs lightweight checks
Code: Versioned Agent Behavior Runner
"""
Exercise 2: Regression testing for versioned agent behaviors.
This script demonstrates how to:
1. Define prompt versions and agent behavior versions
2. Run multiple test cases through the Responses API
3. Perform lightweight regression checks
Model used:
- gpt-5.4-mini
"""
from openai import OpenAI
client = OpenAI()
PROMPTS = {
"support_summary:v1": {
"system_prompt": (
"You are a customer support triage assistant. "
"Summarize the user issue in 3 bullet points."
)
},
"support_summary:v2": {
"system_prompt": (
"You are a customer support triage assistant. "
"Summarize the user issue in exactly 3 concise bullet points. "
"If urgency is present, explicitly label it. "
"If the issue could affect a deadline, mention that."
)
},
}
AGENT_BEHAVIORS = {
"triage_agent:v1": {
"prompt_id": "support_summary:v1",
"expect_urgency_signal": False,
},
"triage_agent:v2": {
"prompt_id": "support_summary:v2",
"expect_urgency_signal": True,
},
}
TEST_CASES = [
{
"id": "case_001",
"input": (
"My package was supposed to arrive today, but the tracking hasn't updated. "
"I need it for an event tomorrow morning."
),
"should_contain": ["tomorrow"],
},
{
"id": "case_002",
"input": (
"I was charged twice for my subscription renewal. "
"Please fix the billing issue."
),
"should_contain": ["charged twice", "billing"],
},
{
"id": "case_003",
"input": (
"The app crashes when I upload a PDF from my phone. "
"This started after yesterday's update."
),
"should_contain": ["crashes", "update"],
},
]
def run_behavior(behavior_id: str, user_text: str) -> str:
"""
Run a specific behavior version on a user input.
Args:
behavior_id: The behavior version identifier.
user_text: The raw customer message.
Returns:
The model-generated response text.
"""
behavior = AGENT_BEHAVIORS[behavior_id]
prompt = PROMPTS[behavior["prompt_id"]]
response = client.responses.create(
model="gpt-5.4-mini",
input=[
{"role": "system", "content": prompt["system_prompt"]},
{"role": "user", "content": user_text},
],
)
return response.output_text
def simple_check(output: str, expected_terms: list[str]) -> dict:
"""
Perform a basic keyword presence check.
Args:
output: Model output text.
expected_terms: Terms we hope appear in the response.
Returns:
A dictionary with passed status and missing terms.
"""
normalized = output.lower()
missing = [term for term in expected_terms if term.lower() not in normalized]
return {
"passed": len(missing) == 0,
"missing_terms": missing,
}
if __name__ == "__main__":
for behavior_id in AGENT_BEHAVIORS:
print("\n" + "#" * 80)
print(f"Running behavior: {behavior_id}")
print("#" * 80)
for case in TEST_CASES:
print("\n" + "-" * 80)
print(f"Test case: {case['id']}")
print(f"Input: {case['input']}")
output = run_behavior(behavior_id, case["input"])
result = simple_check(output, case["should_contain"])
print("\nOutput:")
print(output)
print("\nRegression Check:")
print(f"Passed: {result['passed']}")
if result["missing_terms"]:
print(f"Missing terms: {result['missing_terms']}")
else:
print("All expected terms found.")
Example Output
################################################################################
Running behavior: triage_agent:v1
################################################################################
--------------------------------------------------------------------------------
Test case: case_001
Input: My package was supposed to arrive today, but the tracking hasn't updated. I need it for an event tomorrow morning.
Output:
- The customer reports that a package expected today has not arrived.
- The tracking information has not been updated.
- The package is needed for an event tomorrow morning.
Regression Check:
Passed: True
All expected terms found.
################################################################################
Running behavior: triage_agent:v2
################################################################################
--------------------------------------------------------------------------------
Test case: case_001
Input: My package was supposed to arrive today, but the tracking hasn't updated. I need it for an event tomorrow morning.
Output:
- Delivery issue: the package expected today has not arrived and tracking is not updating.
- Urgency: the package is needed by tomorrow morning for an event.
- Deadline impact: failure to deliver on time affects a near-term commitment.
Regression Check:
Passed: True
All expected terms found.
9. Designing Better Regression Tests
The previous example used simple keyword checks. In real projects, evaluation can be stronger.
Better evaluation dimensions
You can evaluate for:
- factual coverage
- format compliance
- safety compliance
- tool usage correctness
- escalation correctness
- brevity or verbosity
- presence of required labels such as “Urgency”
Example criteria table
| Criterion | Description | Example Check |
|---|---|---|
| Format | Must return exactly 3 bullets | Count lines beginning with - |
| Urgency detection | Must mention urgency when deadline exists | Search for “urgency” or “urgent” |
| Billing recognition | Must identify payment issue | Search for billing-related concepts |
| Escalation policy | Must recommend escalation for severe issues | Search for “escalate” |
10. Practical Versioning Strategies
Strategy A: Store versions in Python constants
Good for: - tutorials - prototypes - small internal apps
Pros: - simple - readable - easy to diff in Git
Cons: - less friendly for non-developers
Strategy B: Store versions in JSON or YAML
Good for: - teams with prompt editors - externalized configuration - CI/CD pipelines
Example YAML shape:
name: support_summary
version: v2
system_prompt: >
You are a customer support assistant.
Summarize the customer's message in exactly 3 concise bullet points.
If urgency is present, mention it clearly.
metadata:
owner: support-ai-team
reason: add urgency detection
Strategy C: Version by Directory Structure
Example:
prompts/
support_summary/
v1.txt
v2.txt
behaviors/
triage_agent/
v1.json
v2.json
This works especially well with Git.
11. Recommended Workflow for Teams
A practical workflow:
- Create a new version instead of editing the old one silently
- Record the reason for the change
- Run benchmark cases across old and new versions
- Compare outputs
- Approve and promote only after review
Suggested release note template
Artifact: triage_agent
New version: v2
Changed from: v1
Reason:
- Add urgency detection
- Better mention deadline-related risk
Expected improvements:
- More consistent handling of time-sensitive delivery issues
Risks:
- Slightly more verbose summaries
12. Common Pitfalls
Pitfall 1: Changing multiple variables at once
If you change: - prompt - model - behavior policy
all at once, you will not know what caused the result.
Better: change one major variable at a time when possible.
Pitfall 2: No representative test set
If you only test one happy-path example, your version comparison is weak.
Better: include: - straightforward examples - ambiguous examples - edge cases - failure-prone inputs
Pitfall 3: Overfitting prompts to tests
If you optimize too narrowly for benchmark cases, the agent may degrade on real-world data.
Better: maintain a growing and diverse evaluation set.
Pitfall 4: Versioning prompts but not behavior policy
For agents, important behavior often lives outside the prompt: - routing - tool use policy - fallback logic
These should also be versioned.
13. Mini Project Challenge
Task
Build a small “prompt lab” for a support summarization assistant.
Requirements
- Create 2 prompt versions
- Create 2 behavior versions
- Use
gpt-5.4-mini - Add at least 5 test cases
- Print:
- version id
- input
- output
- simple regression result
Stretch Goal
Add a format check that verifies the model returns exactly 3 bullets.
Example helper function
def count_bullets(text: str) -> int:
"""
Count lines that appear to be markdown-style bullet points.
"""
return sum(1 for line in text.splitlines() if line.strip().startswith("-"))
14. Wrap-up
Versioning in GenAI systems is about making behavior reproducible, reviewable, and testable.
Key takeaways
- Prompts should be versioned like code
- Model configuration changes should be tracked explicitly
- Agent behavior includes more than prompts
- Evaluation is essential before promoting a new version
- A simple Python harness can go a long way in preventing regressions
Useful Resources
- OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
- OpenAI API docs: https://platform.openai.com/docs
- OpenAI Python SDK: https://github.com/openai/openai-python
- Prompt engineering guide: https://platform.openai.com/docs/guides/prompt-engineering
End-of-Session Checklist
By the end of this session, you should be able to:
- [ ] Explain why prompts, models, and agent behaviors should be versioned
- [ ] Store prompt versions separately from core application logic
- [ ] Compare two prompt versions using the Responses API
- [ ] Build a simple regression harness for agent behaviors
- [ ] Identify and reduce behavior drift in an LLM application
Suggested Homework
- Take an existing GenAI script you have written and extract the prompt into a versioned config.
- Create
v1andv2variants of the prompt. - Write 5 test inputs and compare outputs across versions.
- Document which version you would ship and why.