Skip to content

Session 2: Latency, Throughput, and Cost Optimization

Synopsis

Covers practical methods for improving response times and reducing operating costs, including caching, batching, model selection strategies, response truncation, and workflow simplification. Learners balance product quality with operational efficiency.

Session Content

Session 2: Latency, Throughput, and Cost Optimization

Session Overview

Duration: ~45 minutes
Audience: Python developers with basic programming knowledge
Goal: Learn how to optimize GenAI applications for speed, scalability, and cost efficiency using practical techniques with the OpenAI Responses API and the gpt-5.4-mini model.

Learning Objectives

By the end of this session, learners will be able to:

  • Explain the difference between latency, throughput, and cost in GenAI systems
  • Identify common performance bottlenecks in LLM-powered applications
  • Use practical techniques to reduce response time and control token usage
  • Implement batching, caching, and prompt optimization in Python
  • Measure API usage and reason about cost/performance trade-offs
  • Build a simple optimization workflow for iterative improvement

1. Why Optimization Matters in GenAI Applications

Large language model applications are powerful, but they can become:

  • Slow for users if responses take too long
  • Expensive if prompts and outputs are too large
  • Hard to scale if too many requests arrive at once

Optimization is about balancing three competing factors:

1.1 Latency

Latency is how long a single request takes from start to finish.

Examples: - A chatbot reply taking 8 seconds feels slow - A suggestion generated in 1.2 seconds feels responsive

Latency matters most for: - Chat interfaces - Real-time assistants - Interactive coding tools - Customer support workflows

1.2 Throughput

Throughput is how much work your system can handle over time.

Examples: - 10 requests per second - 5,000 document summaries per hour

Throughput matters most for: - Batch processing - Content pipelines - Document enrichment - Background jobs

1.3 Cost

Cost is driven primarily by: - Number of requests - Input tokens - Output tokens - Model choice - Retries and failed requests

Cost matters for: - Production systems - High-volume workloads - Long prompts or long outputs - Multi-step agent workflows

1.4 The Core Trade-off

You usually cannot optimize all three dimensions perfectly at once.

Examples: - Lower latency may require smaller prompts or shorter outputs - Higher throughput may require batching or asynchronous processing - Lower cost may require smaller models or more constrained responses

A good GenAI engineer chooses the right trade-off for the product.


2. Common Performance Bottlenecks

Before optimizing, identify where time and money are being spent.

2.1 Large Prompts

Long prompts increase: - request payload size - token processing time - cost

Common causes: - repeated instructions - unnecessary context - including full documents when only excerpts are needed

2.2 Large Outputs

If you ask for: - essays when bullet points would do - verbose explanations for internal pipelines - full JSON schemas when only a few fields are needed

...you increase latency and cost.

2.3 Too Many Sequential Calls

A common anti-pattern:

  1. classify
  2. summarize
  3. extract entities
  4. rewrite
  5. validate

If done sequentially, latency compounds quickly.

2.4 No Caching

If the same prompt or near-identical request is repeated, you may be paying repeatedly for the same answer.

2.5 Poor Retry or Concurrency Strategy

  • Too many retries increase cost
  • Too little concurrency reduces throughput
  • Too much concurrency can create rate-limit issues

3. Optimization Techniques: Theory

3.1 Prompt Compression

Reduce prompt size without losing meaning.

Instead of:

  • long background paragraphs
  • repeated policy reminders
  • excessive examples

Prefer:

  • short, direct instructions
  • structured input
  • reusable system guidance patterns
  • only relevant context snippets

Example

Verbose prompt:

You are an expert assistant helping our support team classify incoming customer support tickets. Please carefully read the ticket and then determine whether it belongs to billing, technical support, account access, or general inquiry. Make sure your answer is concise and helpful for downstream systems.

Compressed prompt:

Classify the support ticket into one of:
billing, technical_support, account_access, general_inquiry.
Return only the label.

Same task, lower token use.


3.2 Output Constraining

Reduce verbosity by explicitly controlling the expected format.

Useful techniques: - “Return only JSON” - “Use max 3 bullet points” - “Answer in one sentence” - “Return only the category label”

This improves: - latency - cost - machine-readability


3.3 Model Selection

Not every task needs the largest or most expensive model.

Use smaller/faster models when: - the task is simple - the prompt is well-structured - perfect nuance is not necessary - the workload is high-volume

Examples: - classification - routing - tagging - extraction from clean text - simple summarization

For this session, exercises use gpt-5.4-mini, which is a practical choice for lightweight production tasks.


3.4 Caching

Cache results when: - inputs repeat exactly - users retry the same request - many documents contain identical text snippets - prompts are deterministic enough

Types of cache: - in-memory dictionary - local file cache - Redis - application-level memoization

Caching reduces: - latency - API load - cost


3.5 Batching and Parallelism

For high throughput workloads: - process items concurrently - avoid waiting for one request before starting the next

Use: - asyncio - worker pools - queue-based processing

Note: - More concurrency is not always better - Respect rate limits and add backoff logic


3.6 Token Budgeting

A token budget is an explicit limit on: - how much input context you send - how much output you request

Examples: - truncate documents to relevant sections - summarize before passing to later stages - retrieve top-k relevant chunks only


3.7 Avoiding Unnecessary Multi-Step Pipelines

Sometimes one well-designed prompt can replace multiple model calls.

Instead of: - summarize - then classify summary - then extract action items

Try: - one prompt that returns summary, category, and action items together in JSON

This can significantly reduce total latency and cost.


4. Measuring Performance in Python

Optimization without measurement is guesswork.

You should measure at least:

  • wall-clock response time
  • approximate input/output size
  • request success/failure
  • cache hit/miss
  • requests per second in batch jobs

5. Hands-On Exercise 1: Measure Latency and Basic Token-Aware Design

Objective

Create a small script that: - calls gpt-5.4-mini - measures request latency - compares a verbose prompt vs a concise prompt

What You’ll Learn

  • How to use the OpenAI Responses API in Python
  • How prompt length affects practical performance
  • How to build a simple benchmark loop

Setup

Install the OpenAI Python SDK:

pip install openai

Set your API key:

export OPENAI_API_KEY="your_api_key_here"

Code

import os
import time
from openai import OpenAI

# Create the OpenAI client using the API key from the environment.
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# A sample support ticket to classify.
ticket_text = """
Hi team, I was charged twice for my monthly subscription.
Can you refund the extra payment? My invoice ID is INV-4821.
"""

# A verbose version of the prompt.
verbose_prompt = f"""
You are an expert AI assistant working with a customer support operations team.
Your task is to carefully analyze the incoming support message and determine
which category best applies to it. The available categories are:
billing, technical_support, account_access, and general_inquiry.

Please read the customer message thoroughly and think carefully before deciding.
Then provide the single best category for this ticket in a concise form that
can be used by downstream systems.

Customer message:
{ticket_text}
"""

# A concise version of the prompt.
concise_prompt = f"""
Classify this support ticket into one of:
billing, technical_support, account_access, general_inquiry.
Return only the label.

Ticket:
{ticket_text}
"""

def run_prompt(prompt_name: str, prompt_text: str) -> None:
    """
    Sends a prompt to the Responses API, measures wall-clock latency,
    and prints the result in a compact format.
    """
    start = time.perf_counter()

    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt_text,
    )

    elapsed = time.perf_counter() - start

    # output_text provides the model's generated text in a convenient form.
    print(f"--- {prompt_name} ---")
    print("Output:", response.output_text.strip())
    print(f"Latency: {elapsed:.2f} seconds")
    print(f"Prompt length (characters): {len(prompt_text)}")
    print()

if __name__ == "__main__":
    run_prompt("Verbose Prompt", verbose_prompt)
    run_prompt("Concise Prompt", concise_prompt)

Example Output

--- Verbose Prompt ---
Output: billing
Latency: 1.42 seconds
Prompt length (characters): 654

--- Concise Prompt ---
Output: billing
Latency: 1.08 seconds
Prompt length (characters): 186

Discussion

Observe: - both prompts solve the same task - the concise prompt is easier to parse - shorter prompts often improve speed and reduce cost

Mini Challenge

Modify the script to test 5 different tickets and compute average latency for each prompt style.


6. Hands-On Exercise 2: Reduce Output Size with Structured Constraints

Objective

Compare a free-form summarization prompt with a constrained prompt.

What You’ll Learn

  • How output constraints reduce excess verbosity
  • Why structured outputs are useful for downstream systems

Code

import os
import time
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

article = """
Our team launched a new analytics dashboard for small business users.
The dashboard includes revenue trends, customer retention charts,
and weekly performance alerts. Early users reported that the product
is easy to navigate, but some requested CSV export support and more
customizable reporting filters. The product team plans to release
those enhancements next quarter.
"""

prompts = {
    "free_form": f"""
Summarize the following product update for an internal audience:

{article}
""",
    "constrained": f"""
Summarize the product update below in exactly 3 bullet points.
Each bullet must be under 12 words.

Text:
{article}
"""
}

for name, prompt in prompts.items():
    start = time.perf_counter()

    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt,
    )

    elapsed = time.perf_counter() - start

    print(f"--- {name} ---")
    print(response.output_text.strip())
    print(f"Latency: {elapsed:.2f} seconds")
    print()

Example Output

--- free_form ---
The team launched a new analytics dashboard for small business users that includes revenue trends, customer retention charts, and weekly alerts. Early feedback has been positive, especially around usability, though users want CSV export support and more customizable filters. Those improvements are planned for next quarter.
Latency: 1.31 seconds

--- constrained ---
- New dashboard launched for small business analytics.
- Users liked usability and requested export features.
- More filters and CSV export arrive next quarter.
Latency: 1.02 seconds

Key Takeaway

Smaller outputs often mean: - lower cost - lower latency - easier parsing - less post-processing


7. Hands-On Exercise 3: Add a Simple Cache

Objective

Avoid repeated API calls for identical requests.

What You’ll Learn

  • How caching reduces cost and latency
  • How to use a deterministic cache key

Code

import os
import hashlib
import time
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Simple in-memory cache for demonstration.
response_cache = {}

def make_cache_key(prompt: str) -> str:
    """
    Create a stable hash-based cache key from the prompt text.
    This avoids using very long strings directly as dictionary keys.
    """
    return hashlib.sha256(prompt.encode("utf-8")).hexdigest()

def get_or_create_response(prompt: str) -> str:
    """
    Return a cached response if available; otherwise call the API and cache it.
    """
    key = make_cache_key(prompt)

    if key in response_cache:
        print("Cache hit")
        return response_cache[key]

    print("Cache miss")
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt,
    )
    text = response.output_text.strip()
    response_cache[key] = text
    return text

if __name__ == "__main__":
    prompt = """
Classify the following support issue into one of:
billing, technical_support, account_access, general_inquiry.
Return only the label.

Issue:
I cannot log into my account after resetting my password.
"""

    for attempt in range(2):
        start = time.perf_counter()
        result = get_or_create_response(prompt)
        elapsed = time.perf_counter() - start

        print(f"Attempt {attempt + 1}: {result}")
        print(f"Elapsed: {elapsed:.4f} seconds")
        print()

Example Output

Cache miss
Attempt 1: account_access
Elapsed: 1.1842 seconds

Cache hit
Attempt 2: account_access
Elapsed: 0.0001 seconds

Discussion

In production, replace the in-memory cache with: - Redis - SQLite - a persistent key-value store - HTTP cache layers for stable tasks

Important Note

Caching is best when: - prompts are deterministic - repeated requests are common - stale outputs are acceptable for some time window


8. Hands-On Exercise 4: Improve Throughput with Async Concurrency

Objective

Process multiple independent tasks concurrently.

What You’ll Learn

  • How throughput differs from single-request latency
  • How to use asyncio with the OpenAI Python SDK
  • Why concurrency helps batch workloads

Code

import os
import time
import asyncio
from openai import AsyncOpenAI

# Async client for concurrent request handling.
client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])

tickets = [
    "I was charged twice for my plan this month.",
    "The app crashes whenever I upload a PDF.",
    "I cannot access my account after changing my email.",
    "Do you offer discounts for annual subscriptions?",
    "My password reset link is not working.",
]

PROMPT_TEMPLATE = """
Classify this support ticket into one of:
billing, technical_support, account_access, general_inquiry.
Return only the label.

Ticket:
{ticket}
"""

async def classify_ticket(ticket: str) -> str:
    """
    Classify a single ticket asynchronously.
    """
    response = await client.responses.create(
        model="gpt-5.4-mini",
        input=PROMPT_TEMPLATE.format(ticket=ticket),
    )
    return response.output_text.strip()

async def main() -> None:
    start = time.perf_counter()

    # Launch all classification tasks concurrently.
    tasks = [classify_ticket(ticket) for ticket in tickets]
    results = await asyncio.gather(*tasks)

    elapsed = time.perf_counter() - start

    for ticket, label in zip(tickets, results):
        print(f"Ticket: {ticket}")
        print(f"Label:  {label}")
        print()

    print(f"Processed {len(tickets)} tickets in {elapsed:.2f} seconds")

if __name__ == "__main__":
    asyncio.run(main())

Example Output

Ticket: I was charged twice for my plan this month.
Label:  billing

Ticket: The app crashes whenever I upload a PDF.
Label:  technical_support

Ticket: I cannot access my account after changing my email.
Label:  account_access

Ticket: Do you offer discounts for annual subscriptions?
Label:  general_inquiry

Ticket: My password reset link is not working.
Label:  account_access

Processed 5 tickets in 1.96 seconds

Discussion

If run sequentially, total time might be much higher.
Concurrency improves overall throughput for independent tasks.

Caution

In production, add: - retry logic - rate-limit handling - bounded concurrency with semaphores - structured logging


9. Hands-On Exercise 5: Replace a Multi-Step Pipeline with a Single Call

Objective

Combine multiple related tasks into one request.

Scenario

Suppose you want to: - classify a ticket - summarize it - extract next action

Instead of three calls, use one structured prompt.

Code

import os
import json
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

ticket = """
Hi support, my invoice shows two charges for March.
Please refund the duplicate amount. Also confirm whether my subscription is still active.
"""

prompt = f"""
Analyze the support ticket below.

Return valid JSON with exactly these keys:
- category
- summary
- next_action

Rules:
- category must be one of: billing, technical_support, account_access, general_inquiry
- summary must be one short sentence
- next_action must be one short sentence
- return JSON only

Ticket:
{ticket}
"""

response = client.responses.create(
    model="gpt-5.4-mini",
    input=prompt,
)

raw_output = response.output_text.strip()
print("Raw model output:")
print(raw_output)
print()

# Parse the JSON output for downstream use.
data = json.loads(raw_output)

print("Parsed result:")
print(f"Category:    {data['category']}")
print(f"Summary:     {data['summary']}")
print(f"Next Action: {data['next_action']}")

Example Output

Raw model output:
{"category":"billing","summary":"The customer reports a duplicate March charge and requests a refund.","next_action":"Review the invoice, confirm subscription status, and issue a refund if duplicate billing is verified."}

Parsed result:
Category:    billing
Summary:     The customer reports a duplicate March charge and requests a refund.
Next Action: Review the invoice, confirm subscription status, and issue a refund if duplicate billing is verified.

Key Takeaway

A single well-structured call can reduce: - total latency - orchestration complexity - cumulative cost


10. Best Practices Checklist

Use this checklist when optimizing an LLM application.

Prompt Design

  • Keep instructions short and specific
  • Remove duplicated guidance
  • Include only relevant context
  • Ask for the smallest useful output

Throughput Engineering

  • Use async processing for independent tasks
  • Avoid unnecessary sequential workflows
  • Batch work conceptually where possible
  • Monitor queue sizes and request timing

Cost Control

  • Prefer smaller/faster models for simple tasks
  • Cache deterministic results
  • Limit output length
  • Eliminate redundant retries
  • Merge related tasks when appropriate

Reliability

  • Add retry with exponential backoff
  • Log latency and failures
  • Validate outputs, especially structured ones
  • Guard against malformed JSON or empty outputs

11. Common Mistakes

Mistake 1: Optimizing Before Measuring

Always benchmark first.

Mistake 2: Overstuffing Context

More context is not always better.
Irrelevant context increases cost and may reduce accuracy.

Mistake 3: Asking for Beautiful Prose in Backend Workflows

Backend systems usually need: - labels - JSON - short summaries - extracted fields

Not long-form writing.

Mistake 4: Ignoring Cache Opportunities

Repeated prompts are common in real apps.

Mistake 5: Excessive Chaining

Every extra LLM call adds: - latency - failure surface - cost


12. Short Guided Practice

Spend 5–7 minutes improving this prompt:

You are a highly intelligent assistant. Please read the following email from a customer and identify what the issue is about. Then explain your reasoning and provide a final category from the list of possible categories that our team uses internally. The categories are billing, technical support, account access, and general inquiry.

Target Improvement

A better version might be:

Classify the customer email into one of:
billing, technical_support, account_access, general_inquiry.
Return only the label.

Reflection Questions

  • What unnecessary text was removed?
  • Did the output format become easier to parse?
  • Would this likely reduce latency and cost?

13. Session Recap

In this session, you learned that GenAI optimization is about balancing:

  • Latency: how fast one request completes
  • Throughput: how much total work the system can process
  • Cost: how efficiently you use model calls and tokens

You practiced: - measuring request latency - compressing prompts - constraining outputs - caching repeated requests - using async concurrency - collapsing multi-step workflows into a single call

These techniques are foundational for production-grade LLM systems.


14. Useful Resources

  • OpenAI Responses API Guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
  • OpenAI API Reference: https://platform.openai.com/docs/api-reference
  • OpenAI Python SDK: https://github.com/openai/openai-python
  • Python asyncio Documentation: https://docs.python.org/3/library/asyncio.html
  • Python time Module: https://docs.python.org/3/library/time.html
  • Python json Module: https://docs.python.org/3/library/json.html
  • Python hashlib Module: https://docs.python.org/3/library/hashlib.html

15. Optional Homework

Build a small “ticket triage optimizer” script that:

  1. reads 20 support tickets from a Python list
  2. classifies them with gpt-5.4-mini
  3. uses async concurrency
  4. caches repeated prompts
  5. records per-request latency
  6. prints:
  7. total runtime
  8. average latency
  9. category counts
  10. cache hit rate

Stretch Goal

Add a second version that: - uses a more verbose prompt - compares total runtime and average output length - reports which version is more cost-efficient


16. Quick Quiz

1. What is latency?

A. Total system memory usage
B. Time taken for one request to complete
C. Number of requests handled per hour
D. Price per API key

Answer: B

2. What is throughput?

A. The number of tasks processed over time
B. The length of a prompt
C. The probability of a retry
D. The format of model output

Answer: A

3. Which technique most directly reduces repeated API cost?

A. Increasing verbosity
B. Caching
C. Longer outputs
D. Sequential chaining

Answer: B

4. Why constrain output format?

A. To make responses slower
B. To increase token use
C. To improve parsing and reduce verbosity
D. To disable concurrency

Answer: C

5. When is async concurrency most helpful?

A. For unrelated tasks that can run independently
B. Only when prompts are long
C. Only for JSON parsing
D. Only for interactive chat UIs

Answer: A


Back to Chapter | Back to Master Plan | Previous Session | Next Session