Session 4: Balancing Flexibility, Cost, and Control

Synopsis

Examines the tradeoffs between open-ended autonomous behavior and tightly controlled workflows. Learners evaluate agent architectures in terms of reliability, latency, traceability, and operational cost.

Session Content

Session 4: Balancing Flexibility, Cost, and Control

Session Overview

In this session, learners will explore one of the most important practical trade-offs in GenAI application development: how to balance model capability, latency, cost, and output control. By the end of the session, learners will understand how prompt design, model configuration, response constraints, and application architecture affect both user experience and operating cost.

This session combines theory with hands-on Python exercises using the OpenAI Python SDK and the Responses API.

Duration

~45 minutes

Learning Objectives

By the end of this session, learners should be able to:

Explain the trade-offs between flexibility, cost, and control in LLM-powered systems
Choose appropriate model settings for a given task
Use prompt constraints to improve reliability and reduce unnecessary output
Build Python scripts that compare different generation strategies
Structure prompts and outputs to support predictable application behavior
Evaluate when to use stricter output formats versus open-ended generation

Agenda

Why trade-offs matter in GenAI systems
Flexibility vs. control in prompt and output design
Cost drivers: tokens, verbosity, retries, and model selection
Practical strategies for balancing these dimensions
Hands-on Exercise 1: Compare loose vs. constrained prompting
Hands-on Exercise 2: Build a cost-aware summarization utility
Recap and discussion

1. Why Trade-offs Matter in GenAI Systems

When building with LLMs, you are rarely optimizing for just one thing.

A highly flexible system may: - generate creative and rich responses - adapt well to varied user input - be useful in exploratory workflows

But it may also: - produce inconsistent formats - be harder to validate programmatically - use more tokens than necessary - require retries or post-processing

A highly controlled system may: - return predictable outputs - be easier to integrate with software systems - reduce parsing failures - lower downstream engineering complexity

But it may also: - feel less natural - be too rigid for ambiguous tasks - omit nuance or useful context

A cost-optimized system may: - keep token usage lower - reduce latency in some workflows - support production scalability

But if pushed too far, it may: - truncate useful information - reduce answer quality - increase implementation complexity through over-optimization

Key Design Question

For each AI feature, ask:

How much flexibility do users need, how much control does the application need, and what cost profile is acceptable?

2. Flexibility vs. Control in Prompt and Output Design

Flexible Prompting

Flexible prompting is useful when: - users are brainstorming - the task is creative - multiple good answers are acceptable - the output is intended for humans, not software systems

Example:

Summarize this article for a product team.

This is simple and useful, but the output may vary in: - length - structure - tone - level of detail

Controlled Prompting

Controlled prompting is useful when: - output must be parsed by code - consistent display format is required - business rules must be followed - risk of ambiguity must be reduced

Example:

Summarize this article in exactly 3 bullet points:
- Main problem
- Proposed solution
- Key risk

Keep each bullet under 20 words.

This improves consistency and often reduces tokens.

Common Control Mechanisms

You can increase control using:

explicit output instructions
step-by-step task framing
format restrictions
length constraints
role framing
examples
schema-driven output strategies

Important Principle

More control in the prompt usually means: - easier downstream handling - fewer retries - lower post-processing cost

But sometimes also: - less expressive results - more prompt engineering effort upfront

3. Cost Drivers in LLM Applications

Cost in LLM systems is driven by more than just “one API call.”

Major Cost Factors

1. Input tokens

Large prompts, long instructions, and excessive context increase cost.

2. Output tokens

Verbose answers often cost more than necessary.

3. Retries

If outputs are inconsistent and need regeneration, cost multiplies quickly.

4. Chained calls

Pipelines with classification, extraction, summarization, and validation may improve quality but increase overall spend.

5. Model selection

Different models have different capability/cost/latency trade-offs.

6. Over-contexting

Passing too much irrelevant context makes the model slower, costlier, and sometimes less accurate.

Typical Cost Mistakes

asking for long explanations when short answers are enough
using open-ended prompts for structured tasks
repeating large instruction blocks in every request unnecessarily
failing to constrain output length
making separate calls where one carefully designed call would work

4. Practical Strategies for Balancing Flexibility, Cost, and Control

Strategy 1: Match Prompt Style to Task Type

Task Type	Recommended Style
Brainstorming	Flexible
Classification	Controlled
Data extraction	Highly controlled
Summarization for internal users	Moderately controlled
JSON/API integration	Strictly controlled

Strategy 2: Ask for Only What You Need

Bad:

Please provide a comprehensive analysis, detailed explanation, contextual background, strategic insight, examples, and implementation notes.

Better:

Provide a 2-sentence summary and 3 action items.

Strategy 3: Reduce Output Variability

If your application expects a specific shape, say so clearly.

Examples: - “Return exactly 3 bullets” - “Use one sentence per item” - “Respond with a JSON object” - “If unknown, say ‘unknown’”

Strategy 4: Design for Fewer Retries

Retries are expensive. Better prompting often saves more than model downgrades.

Add constraints like: - required sections - max length - allowed labels - fallback instructions for uncertain cases

Strategy 5: Separate Human-Facing and Machine-Facing Outputs

If users need rich prose but your system needs structure, generate both intentionally.

For example: - short structured summary for the system - optional longer explanation for the user

Strategy 6: Evaluate Trade-offs with Real Examples

The right balance depends on: - task complexity - user expectations - UI constraints - budget - acceptable error rate

Do not optimize blindly. Measure.

5. Hands-on Exercise 1: Compare Loose vs. Constrained Prompting

Goal

Learn how prompt constraints affect consistency, output shape, and likely token usage.

What You Will Build

A Python script that sends the same source text to the model twice: 1. once with a loose summarization prompt 2. once with a constrained summarization prompt

You will compare the outputs.

Step 1: Install Dependencies

pip install openai python-dotenv

Step 2: Set Your API Key

Create a .env file:

OPENAI_API_KEY=your_api_key_here

Step 3: Python Script

"""
Exercise 1: Compare loose vs. constrained prompting using the OpenAI Responses API.

This script demonstrates how prompt specificity affects output style,
consistency, and likely token usage.

Requirements:
    pip install openai python-dotenv

Environment:
    OPENAI_API_KEY must be set in a .env file or environment variable.
"""

from dotenv import load_dotenv
from openai import OpenAI


# Load environment variables from .env
load_dotenv()

# Create the OpenAI client.
# The SDK automatically reads OPENAI_API_KEY from the environment.
client = OpenAI()

# Use the model required for this course session.
MODEL_NAME = "gpt-5.4-mini"

# Example source text for summarization.
ARTICLE_TEXT = """
Acme Analytics launched a new internal reporting assistant for business teams.
The first version was highly flexible and could answer a wide variety of questions,
but users reported inconsistent formatting and occasionally overly long responses.
The engineering team then introduced prompt constraints, output templates, and
length guidance. This improved reliability and reduced the need for manual cleanup.
However, some users felt the system became less expressive for exploratory analysis.
The team now plans to use different prompting strategies depending on whether the
task is reporting, brainstorming, or data extraction.
""".strip()


def get_text_response(response) -> str:
    """
    Safely extract text output from a Responses API result.

    Many examples expose `response.output_text`, which is the simplest way
    to access the model's text when available.
    """
    return getattr(response, "output_text", "").strip()


def run_prompt(prompt_name: str, prompt_text: str) -> str:
    """
    Send a single prompt to the model and return the generated text.
    """
    response = client.responses.create(
        model=MODEL_NAME,
        input=prompt_text,
    )

    output = get_text_response(response)

    print(f"\n--- {prompt_name} ---")
    print(output if output else "[No text output returned]")
    return output


def main() -> None:
    """
    Run both prompt variants and print their outputs.
    """
    loose_prompt = f"""
Summarize the following text for a product team:

{ARTICLE_TEXT}
""".strip()

    constrained_prompt = f"""
Summarize the following text for a product team.

Requirements:
- Return exactly 3 bullet points
- Cover: problem, improvement, trade-off
- Keep each bullet under 18 words
- Be concise and practical

Text:
{ARTICLE_TEXT}
""".strip()

    print("Comparing loose vs. constrained prompting...\n")
    run_prompt("Loose Prompt Output", loose_prompt)
    run_prompt("Constrained Prompt Output", constrained_prompt)


if __name__ == "__main__":
    main()

Example Output

Comparing loose vs. constrained prompting...


--- Loose Prompt Output ---
The team launched an internal reporting assistant that was powerful but inconsistent in formatting and often too verbose. By adding prompt constraints and output templates, they improved reliability and reduced cleanup, though some users found the tool less useful for open-ended exploration. The team now wants to tailor prompting strategies to different task types such as reporting, brainstorming, and extraction.

--- Constrained Prompt Output ---
- Problem: Reporting assistant was flexible but inconsistent and often too verbose
- Improvement: Templates and length constraints improved reliability and reduced cleanup
- Trade-off: Greater control reduced expressiveness for exploratory use

Discussion Questions

Which output would be easier to display in a dashboard?
Which would be easier to parse in code?
Which is likely to use fewer output tokens?
When would the looser version be more useful?

Key Takeaway

Prompt constraints often improve consistency and reduce unnecessary verbosity, which helps with both control and cost.

6. Hands-on Exercise 2: Build a Cost-Aware Summarization Utility

Goal

Create a utility that produces summaries in different modes: - brief - standard - structured

This simulates a real product decision: varying output style depending on business needs.

What You Will Learn

how to expose flexibility intentionally
how to control output based on application context
how to avoid one-size-fits-all generation patterns

Python Script

"""
Exercise 2: Build a cost-aware summarization utility.

This script demonstrates how an application can choose different prompt
strategies depending on the desired trade-off between flexibility, brevity,
and structure.

Requirements:
    pip install openai python-dotenv

Environment:
    OPENAI_API_KEY must be set in a .env file or environment variable.
"""

from dotenv import load_dotenv
from openai import OpenAI


# Load environment variables from .env
load_dotenv()

# Initialize API client
client = OpenAI()

MODEL_NAME = "gpt-5.4-mini"

TEXT_TO_SUMMARIZE = """
The customer success team receives hundreds of support conversations every week.
Managers want quick summaries for reporting, while agents want more descriptive
notes they can use during handoff. Finance has also asked the engineering team
to reduce AI costs in high-volume workflows. As a result, the team is testing
multiple summary styles: a one-sentence brief mode, a standard mode for human
readability, and a structured mode for analytics dashboards.
""".strip()


def get_text_response(response) -> str:
    """
    Extract plain text from the Responses API result.
    """
    return getattr(response, "output_text", "").strip()


def build_prompt(text: str, mode: str) -> str:
    """
    Create a mode-specific prompt.

    Modes:
        - brief: cheapest and shortest
        - standard: balanced for human reading
        - structured: predictable format for systems
    """
    if mode == "brief":
        return f"""
Summarize the text below in exactly one sentence.
Keep it under 25 words.

Text:
{text}
""".strip()

    if mode == "standard":
        return f"""
Summarize the text below for an internal product team.
Write 2 to 3 sentences. Be clear and concise.

Text:
{text}
""".strip()

    if mode == "structured":
        return f"""
Summarize the text below using exactly these fields:

Audience:
Need:
Constraint:
Approach:

Keep each field on its own line.
Keep each value under 12 words.

Text:
{text}
""".strip()

    raise ValueError(f"Unsupported mode: {mode}")


def summarize(text: str, mode: str) -> str:
    """
    Generate a summary using the selected prompt mode.
    """
    prompt = build_prompt(text, mode)

    response = client.responses.create(
        model=MODEL_NAME,
        input=prompt,
    )

    return get_text_response(response)


def main() -> None:
    """
    Run all supported summarization modes and display their results.
    """
    modes = ["brief", "standard", "structured"]

    print("Cost-aware summarization utility")
    print("=" * 35)

    for mode in modes:
        print(f"\nMode: {mode}")
        print("-" * 20)
        result = summarize(TEXT_TO_SUMMARIZE, mode)
        print(result if result else "[No text output returned]")


if __name__ == "__main__":
    main()

Example Output

Cost-aware summarization utility
===================================

Mode: brief
--------------------
The team is testing multiple summary formats to balance usability and AI cost across high-volume support workflows.

Mode: standard
--------------------
The customer success organization needs different summary formats for different stakeholders. Managers want quick reporting views, agents need richer handoff notes, and finance wants lower AI costs, so the team is testing brief, standard, and structured summary styles.

Mode: structured
--------------------
Audience: Managers, agents, and finance
Need: Reporting, handoff clarity, cost efficiency
Constraint: High-volume workflow costs
Approach: Test multiple summary styles

Extension Challenge

Modify the script so that: - brief is used for high-volume events - standard is used for analyst review - structured is used when the result will feed another system

You can implement a small routing function such as:

def choose_mode(use_case: str) -> str:
    if use_case == "dashboard":
        return "structured"
    if use_case == "handoff":
        return "standard"
    return "brief"

7. Best Practices Checklist

Use this checklist when designing GenAI features:

For Flexibility

Use open-ended prompts only when variability is acceptable
Allow richer responses for creative or exploratory tasks
Match response style to user expectations

For Cost

Keep prompts concise
Limit output length where appropriate
Avoid unnecessary retries
Reuse patterns that produce reliable first-pass outputs
Do not send more context than the task needs

For Control

Specify required structure
Use explicit constraints
Make fallback behavior clear
Prefer predictable output shapes for software integration
Test prompts against realistic input variations

8. Mini Reflection Exercise

Answer these questions individually or in pairs:

Name one feature where flexibility matters more than control.
Name one feature where control matters more than flexibility.
How can better prompt design reduce cost without changing the model?
What are the risks of over-constraining outputs?
If an output will be parsed by code, what prompt changes would you make?

9. Recap

In this session, you learned that GenAI application design is an exercise in trade-offs.

You explored how:

flexibility supports creativity and exploration
control supports reliability and software integration
cost is influenced by prompt size, output length, retries, and architecture

You also practiced using the OpenAI Responses API in Python to: - compare loose and constrained prompting - build a multi-mode summarization tool aligned with different product needs

The core lesson is simple:

Don’t optimize prompts for “best possible output” in the abstract. Optimize for the needs of the task, the user, and the system.

Useful Resources

OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
OpenAI API docs: https://platform.openai.com/docs
OpenAI Python SDK: https://github.com/openai/openai-python
Prompt engineering guide: https://platform.openai.com/docs/guides/prompt-engineering
Python dotenv: https://pypi.org/project/python-dotenv/

Suggested Homework

Build a small Python CLI tool that accepts: - an input text file - a mode (brief, standard, or structured)

The tool should: - read the file - generate a summary with gpt-5.4-mini - print the result - explain in a comment which mode is best for which business scenario

Optional enhancement: - log summary mode and output length to compare operating patterns over time

Back to Chapter | Back to Master Plan | Previous Session