Session 4: Versioning Prompts, Models, and Agent Behaviors

Synopsis

Introduces disciplined change management for prompts, retrieval settings, tools, workflows, and models. Learners understand how to evolve systems safely while preserving reproducibility and auditability.

Session Content

Session 4: Versioning Prompts, Models, and Agent Behaviors

Session Overview

In this session, you will learn how to systematically version and evaluate the moving parts of a GenAI application:

Prompts: instructions, message structure, and reusable prompt templates
Models: which model version you use and how model changes affect outputs
Agent behaviors: tool usage policy, routing logic, memory strategy, and guardrails

By the end of this session, you will be able to:

Explain why versioning matters in GenAI systems
Design a simple versioning strategy for prompts, model configs, and agent behavior
Store prompt and behavior configurations in code-friendly formats
Run side-by-side comparisons using the OpenAI Responses API
Build a lightweight regression testing workflow in Python

Learning Objectives

After this session, learners should be able to:

Define what should be versioned in an LLM-powered system
Separate prompt content from application logic
Track model and parameter changes in a reproducible way
Compare agent behavior versions using representative test cases
Create a small evaluation harness to detect behavior drift

Agenda for a 45-Minute Session

0–10 min: Why versioning matters in GenAI systems
10–20 min: What to version: prompts, models, and agent behavior
20–30 min: Hands-on Exercise 1 — prompt and model version comparison
30–40 min: Hands-on Exercise 2 — agent behavior configs and regression checks
40–45 min: Wrap-up and key takeaways

1. Why Versioning Matters

Traditional software versioning usually tracks code changes. In GenAI systems, application behavior depends on more than code:

The prompt
The model
The tooling and agent policy
The system instructions
The input formatting
The post-processing logic

A small wording change in a system prompt can produce very different answers. Changing the model can alter:

style
factuality
tool-use tendencies
verbosity
reliability on edge cases

If these changes are not tracked, teams often run into:

“Why did the agent suddenly become more verbose?”
“Why did this customer support answer change?”
“Why is the model now calling tools more often?”
“Why did tests pass last week but fail today?”

Key idea

In GenAI applications, behavior is configuration plus model plus instructions, not just code.

2. What Should Be Versioned

2.1 Prompt Versioning

Prompts should be treated as first-class artifacts.

Examples:

system prompts
reusable task instructions
response style templates
tool usage instructions
safety and refusal policies

A good prompt version should include:

a name
a version
the prompt text
optionally: metadata such as owner, purpose, and last updated date

Example prompt versions

v1 - “Summarize the customer message in 3 bullet points.”

v2 - “Summarize the customer message in exactly 3 concise bullet points. Include urgency if present.”

That is a behavior change. It should be versioned.

2.2 Model Versioning

You should record:

model name
relevant generation settings
date of change
reason for change

Typical config fields:

model
temperature if applicable in your workflow
max output length settings
tool configuration
reasoning settings if used
response format expectations

Even if you keep the same prompt, switching models may change output quality and behavior.

2.3 Agent Behavior Versioning

For an agent, prompt versioning alone is not enough.

You should also version:

tool call policy
routing rules
memory strategy
fallback behavior
escalation conditions
formatting policy
refusal policy

Example

Behavior v1 - Always answer directly unless user explicitly asks for external data

Behavior v2 - Use tools whenever a question requires current or verifiable information

This can fundamentally change agent behavior.

3. Principles for Good Versioning

3.1 Keep Config Separate from Code

Instead of hardcoding prompts deep inside functions, define versioned configs in:

JSON
YAML
Python dictionaries/classes
a dedicated config module

This makes diffs and comparisons easier.

3.2 Use Stable Names

Use clear identifiers such as:

support_summary_prompt:v1
support_summary_prompt:v2
agent_behavior:v1
agent_behavior:v2

3.3 Track Why a Change Was Made

Each version should have a short rationale.

Example:

v2 adds urgency detection because support team needs prioritization
v3 reduces verbosity for mobile app UI

3.4 Evaluate Before Replacing

Do not update prompts or model settings in production without testing representative examples.

You need:

a benchmark set of inputs
expected qualities or outputs
a comparison process

4. A Simple Versioning Design

Below is a lightweight Python-based structure for versioning prompts and behaviors.

5. Hands-on Exercise 1: Compare Prompt Versions with the Responses API

Goal

Create two prompt versions and compare their outputs on the same input using gpt-5.4-mini.

What you will build

A small Python script that:

defines prompt versions
calls the OpenAI Responses API
prints side-by-side results

Setup

Install the OpenAI Python SDK:

pip install openai

Set your API key:

export OPENAI_API_KEY="your_api_key_here"

Code: Compare Two Prompt Versions

"""
Exercise 1: Compare prompt versions with the OpenAI Responses API.

This script demonstrates how to:
1. Define versioned prompts in Python dictionaries
2. Invoke the OpenAI Responses API
3. Compare outputs across prompt versions

Model used:
- gpt-5.4-mini
"""

from openai import OpenAI

# Create the OpenAI client.
# The SDK automatically reads OPENAI_API_KEY from the environment.
client = OpenAI()

# Versioned prompt definitions.
PROMPTS = {
    "support_summary:v1": {
        "name": "support_summary",
        "version": "v1",
        "description": "Basic customer message summarization",
        "system_prompt": (
            "You are a customer support assistant. "
            "Summarize the customer's message in 3 bullet points."
        ),
    },
    "support_summary:v2": {
        "name": "support_summary",
        "version": "v2",
        "description": "Summarization with urgency detection",
        "system_prompt": (
            "You are a customer support assistant. "
            "Summarize the customer's message in exactly 3 concise bullet points. "
            "If urgency is present, mention it clearly."
        ),
    },
}

# Shared user input for comparison.
customer_message = """
Hi, I placed an order five days ago and paid extra for express shipping,
but it still hasn't arrived. I need the package before tomorrow because it contains
materials for a client presentation. If it won't arrive in time, I need a refund.
"""


def run_prompt(prompt_config: dict, user_text: str) -> str:
    """
    Call the Responses API using a versioned prompt configuration.

    Args:
        prompt_config: A dictionary containing the prompt definition.
        user_text: The customer message to summarize.

    Returns:
        The model's text output.
    """
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {
                "role": "system",
                "content": prompt_config["system_prompt"],
            },
            {
                "role": "user",
                "content": user_text,
            },
        ],
    )

    # output_text is the easiest way to retrieve the final text response.
    return response.output_text


if __name__ == "__main__":
    for prompt_id, prompt_config in PROMPTS.items():
        print("=" * 80)
        print(f"Prompt ID: {prompt_id}")
        print(f"Description: {prompt_config['description']}")
        print("-" * 80)

        output = run_prompt(prompt_config, customer_message)
        print(output)
        print()

Example Output

================================================================================
Prompt ID: support_summary:v1
Description: Basic customer message summarization
--------------------------------------------------------------------------------
- The customer placed an order five days ago with express shipping, but it has not arrived.
- The package is needed before tomorrow for a client presentation.
- The customer wants a refund if the package will not arrive in time.

================================================================================
Prompt ID: support_summary:v2
Description: Summarization with urgency detection
--------------------------------------------------------------------------------
- The customer ordered a package five days ago with express shipping, but it still has not arrived.
- The package is urgently needed before tomorrow for materials required in a client presentation.
- The customer requests a refund if delivery cannot happen in time.

Discussion

Notice how v2 is not radically different, but it encodes a clearer operational behavior: - exactly 3 concise bullets - explicit urgency handling

That is precisely the kind of change worth versioning.

6. Versioning Model Configurations

Prompt version alone is incomplete. You also want model configuration versioning.

Example configuration structure

MODEL_CONFIGS = {
    "summary_model:v1": {
        "model": "gpt-5.4-mini",
        "notes": "Default summary model",
    },
    "summary_model:v2": {
        "model": "gpt-5.4-mini",
        "notes": "Same model, reserved for future config changes",
    },
}

In real projects, you may include more settings if your application uses them. The key point is: - store them explicitly - name them clearly - record when they change

7. Agent Behavior as Versioned Configuration

Agent behavior is more than the prompt text.

Consider a support triage agent. Its behavior may include:

whether to classify urgency
whether to suggest escalation
whether to ask clarifying questions
whether to prioritize brevity

Example behavior configs

AGENT_BEHAVIORS = {
    "triage_agent:v1": {
        "prompt_id": "support_summary:v1",
        "requires_urgency_detection": False,
        "suggest_escalation": False,
        "style": "brief",
    },
    "triage_agent:v2": {
        "prompt_id": "support_summary:v2",
        "requires_urgency_detection": True,
        "suggest_escalation": True,
        "style": "brief",
    },
}

This lets you distinguish: - prompt changes - behavior policy changes - model changes

8. Hands-on Exercise 2: Build a Small Regression Harness for Agent Behavior

Goal

Create a simple testing loop that compares two behavior versions on multiple test cases.

What you will build

A script that:

stores test cases
runs an agent behavior version on each case
prints outputs
performs lightweight checks

Code: Versioned Agent Behavior Runner

"""
Exercise 2: Regression testing for versioned agent behaviors.

This script demonstrates how to:
1. Define prompt versions and agent behavior versions
2. Run multiple test cases through the Responses API
3. Perform lightweight regression checks

Model used:
- gpt-5.4-mini
"""

from openai import OpenAI

client = OpenAI()

PROMPTS = {
    "support_summary:v1": {
        "system_prompt": (
            "You are a customer support triage assistant. "
            "Summarize the user issue in 3 bullet points."
        )
    },
    "support_summary:v2": {
        "system_prompt": (
            "You are a customer support triage assistant. "
            "Summarize the user issue in exactly 3 concise bullet points. "
            "If urgency is present, explicitly label it. "
            "If the issue could affect a deadline, mention that."
        )
    },
}

AGENT_BEHAVIORS = {
    "triage_agent:v1": {
        "prompt_id": "support_summary:v1",
        "expect_urgency_signal": False,
    },
    "triage_agent:v2": {
        "prompt_id": "support_summary:v2",
        "expect_urgency_signal": True,
    },
}

TEST_CASES = [
    {
        "id": "case_001",
        "input": (
            "My package was supposed to arrive today, but the tracking hasn't updated. "
            "I need it for an event tomorrow morning."
        ),
        "should_contain": ["tomorrow"],
    },
    {
        "id": "case_002",
        "input": (
            "I was charged twice for my subscription renewal. "
            "Please fix the billing issue."
        ),
        "should_contain": ["charged twice", "billing"],
    },
    {
        "id": "case_003",
        "input": (
            "The app crashes when I upload a PDF from my phone. "
            "This started after yesterday's update."
        ),
        "should_contain": ["crashes", "update"],
    },
]


def run_behavior(behavior_id: str, user_text: str) -> str:
    """
    Run a specific behavior version on a user input.

    Args:
        behavior_id: The behavior version identifier.
        user_text: The raw customer message.

    Returns:
        The model-generated response text.
    """
    behavior = AGENT_BEHAVIORS[behavior_id]
    prompt = PROMPTS[behavior["prompt_id"]]

    response = client.responses.create(
        model="gpt-5.4-mini",
        input=[
            {"role": "system", "content": prompt["system_prompt"]},
            {"role": "user", "content": user_text},
        ],
    )
    return response.output_text


def simple_check(output: str, expected_terms: list[str]) -> dict:
    """
    Perform a basic keyword presence check.

    Args:
        output: Model output text.
        expected_terms: Terms we hope appear in the response.

    Returns:
        A dictionary with passed status and missing terms.
    """
    normalized = output.lower()
    missing = [term for term in expected_terms if term.lower() not in normalized]
    return {
        "passed": len(missing) == 0,
        "missing_terms": missing,
    }


if __name__ == "__main__":
    for behavior_id in AGENT_BEHAVIORS:
        print("\n" + "#" * 80)
        print(f"Running behavior: {behavior_id}")
        print("#" * 80)

        for case in TEST_CASES:
            print("\n" + "-" * 80)
            print(f"Test case: {case['id']}")
            print(f"Input: {case['input']}")

            output = run_behavior(behavior_id, case["input"])
            result = simple_check(output, case["should_contain"])

            print("\nOutput:")
            print(output)

            print("\nRegression Check:")
            print(f"Passed: {result['passed']}")
            if result["missing_terms"]:
                print(f"Missing terms: {result['missing_terms']}")
            else:
                print("All expected terms found.")

Example Output

################################################################################
Running behavior: triage_agent:v1
################################################################################

--------------------------------------------------------------------------------
Test case: case_001
Input: My package was supposed to arrive today, but the tracking hasn't updated. I need it for an event tomorrow morning.

Output:
- The customer reports that a package expected today has not arrived.
- The tracking information has not been updated.
- The package is needed for an event tomorrow morning.

Regression Check:
Passed: True
All expected terms found.

################################################################################
Running behavior: triage_agent:v2
################################################################################

--------------------------------------------------------------------------------
Test case: case_001
Input: My package was supposed to arrive today, but the tracking hasn't updated. I need it for an event tomorrow morning.

Output:
- Delivery issue: the package expected today has not arrived and tracking is not updating.
- Urgency: the package is needed by tomorrow morning for an event.
- Deadline impact: failure to deliver on time affects a near-term commitment.

Regression Check:
Passed: True
All expected terms found.

9. Designing Better Regression Tests

The previous example used simple keyword checks. In real projects, evaluation can be stronger.

Better evaluation dimensions

You can evaluate for:

factual coverage
format compliance
safety compliance
tool usage correctness
escalation correctness
brevity or verbosity
presence of required labels such as “Urgency”

Example criteria table

Criterion	Description	Example Check
Format	Must return exactly 3 bullets	Count lines beginning with `-`
Urgency detection	Must mention urgency when deadline exists	Search for “urgency” or “urgent”
Billing recognition	Must identify payment issue	Search for billing-related concepts
Escalation policy	Must recommend escalation for severe issues	Search for “escalate”

10. Practical Versioning Strategies

Strategy A: Store versions in Python constants

Good for: - tutorials - prototypes - small internal apps

Pros: - simple - readable - easy to diff in Git

Cons: - less friendly for non-developers

Strategy B: Store versions in JSON or YAML

Good for: - teams with prompt editors - externalized configuration - CI/CD pipelines

Example YAML shape:

name: support_summary
version: v2
system_prompt: >
  You are a customer support assistant.
  Summarize the customer's message in exactly 3 concise bullet points.
  If urgency is present, mention it clearly.
metadata:
  owner: support-ai-team
  reason: add urgency detection

Strategy C: Version by Directory Structure

Example:

prompts/
  support_summary/
    v1.txt
    v2.txt
behaviors/
  triage_agent/
    v1.json
    v2.json

This works especially well with Git.

11. Recommended Workflow for Teams

A practical workflow:

Create a new version instead of editing the old one silently
Record the reason for the change
Run benchmark cases across old and new versions
Compare outputs
Approve and promote only after review

Suggested release note template

Artifact: triage_agent
New version: v2
Changed from: v1
Reason:
- Add urgency detection
- Better mention deadline-related risk

Expected improvements:
- More consistent handling of time-sensitive delivery issues

Risks:
- Slightly more verbose summaries

12. Common Pitfalls

Pitfall 1: Changing multiple variables at once

If you change: - prompt - model - behavior policy

all at once, you will not know what caused the result.

Better: change one major variable at a time when possible.

Pitfall 2: No representative test set

If you only test one happy-path example, your version comparison is weak.

Better: include: - straightforward examples - ambiguous examples - edge cases - failure-prone inputs

Pitfall 3: Overfitting prompts to tests

If you optimize too narrowly for benchmark cases, the agent may degrade on real-world data.

Better: maintain a growing and diverse evaluation set.

Pitfall 4: Versioning prompts but not behavior policy

For agents, important behavior often lives outside the prompt: - routing - tool use policy - fallback logic

These should also be versioned.

13. Mini Project Challenge

Task

Build a small “prompt lab” for a support summarization assistant.

Requirements

Create 2 prompt versions
Create 2 behavior versions
Use gpt-5.4-mini
Add at least 5 test cases
Print:
version id
input
output
simple regression result

Stretch Goal

Add a format check that verifies the model returns exactly 3 bullets.

Example helper function

def count_bullets(text: str) -> int:
    """
    Count lines that appear to be markdown-style bullet points.
    """
    return sum(1 for line in text.splitlines() if line.strip().startswith("-"))

14. Wrap-up

Versioning in GenAI systems is about making behavior reproducible, reviewable, and testable.

Key takeaways

Prompts should be versioned like code
Model configuration changes should be tracked explicitly
Agent behavior includes more than prompts
Evaluation is essential before promoting a new version
A simple Python harness can go a long way in preventing regressions

Useful Resources

OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
OpenAI API docs: https://platform.openai.com/docs
OpenAI Python SDK: https://github.com/openai/openai-python
Prompt engineering guide: https://platform.openai.com/docs/guides/prompt-engineering

End-of-Session Checklist

By the end of this session, you should be able to:

[ ] Explain why prompts, models, and agent behaviors should be versioned
[ ] Store prompt versions separately from core application logic
[ ] Compare two prompt versions using the Responses API
[ ] Build a simple regression harness for agent behaviors
[ ] Identify and reduce behavior drift in an LLM application

Suggested Homework

Take an existing GenAI script you have written and extract the prompt into a versioned config.
Create v1 and v2 variants of the prompt.
Write 5 test inputs and compare outputs across versions.
Document which version you would ship and why.

Back to Chapter | Back to Master Plan | Previous Session