Session 2: Latency, Throughput, and Cost Optimization
Synopsis
Covers practical methods for improving response times and reducing operating costs, including caching, batching, model selection strategies, response truncation, and workflow simplification. Learners balance product quality with operational efficiency.
Session Content
Session 2: Latency, Throughput, and Cost Optimization
Session Overview
Duration: ~45 minutes
Audience: Python developers with basic programming knowledge
Goal: Learn how to optimize GenAI applications for speed, scalability, and cost efficiency using practical techniques with the OpenAI Responses API and the gpt-5.4-mini model.
Learning Objectives
By the end of this session, learners will be able to:
- Explain the difference between latency, throughput, and cost in GenAI systems
- Identify common performance bottlenecks in LLM-powered applications
- Use practical techniques to reduce response time and control token usage
- Implement batching, caching, and prompt optimization in Python
- Measure API usage and reason about cost/performance trade-offs
- Build a simple optimization workflow for iterative improvement
1. Why Optimization Matters in GenAI Applications
Large language model applications are powerful, but they can become:
- Slow for users if responses take too long
- Expensive if prompts and outputs are too large
- Hard to scale if too many requests arrive at once
Optimization is about balancing three competing factors:
1.1 Latency
Latency is how long a single request takes from start to finish.
Examples: - A chatbot reply taking 8 seconds feels slow - A suggestion generated in 1.2 seconds feels responsive
Latency matters most for: - Chat interfaces - Real-time assistants - Interactive coding tools - Customer support workflows
1.2 Throughput
Throughput is how much work your system can handle over time.
Examples: - 10 requests per second - 5,000 document summaries per hour
Throughput matters most for: - Batch processing - Content pipelines - Document enrichment - Background jobs
1.3 Cost
Cost is driven primarily by: - Number of requests - Input tokens - Output tokens - Model choice - Retries and failed requests
Cost matters for: - Production systems - High-volume workloads - Long prompts or long outputs - Multi-step agent workflows
1.4 The Core Trade-off
You usually cannot optimize all three dimensions perfectly at once.
Examples: - Lower latency may require smaller prompts or shorter outputs - Higher throughput may require batching or asynchronous processing - Lower cost may require smaller models or more constrained responses
A good GenAI engineer chooses the right trade-off for the product.
2. Common Performance Bottlenecks
Before optimizing, identify where time and money are being spent.
2.1 Large Prompts
Long prompts increase: - request payload size - token processing time - cost
Common causes: - repeated instructions - unnecessary context - including full documents when only excerpts are needed
2.2 Large Outputs
If you ask for: - essays when bullet points would do - verbose explanations for internal pipelines - full JSON schemas when only a few fields are needed
...you increase latency and cost.
2.3 Too Many Sequential Calls
A common anti-pattern:
- classify
- summarize
- extract entities
- rewrite
- validate
If done sequentially, latency compounds quickly.
2.4 No Caching
If the same prompt or near-identical request is repeated, you may be paying repeatedly for the same answer.
2.5 Poor Retry or Concurrency Strategy
- Too many retries increase cost
- Too little concurrency reduces throughput
- Too much concurrency can create rate-limit issues
3. Optimization Techniques: Theory
3.1 Prompt Compression
Reduce prompt size without losing meaning.
Instead of:
- long background paragraphs
- repeated policy reminders
- excessive examples
Prefer:
- short, direct instructions
- structured input
- reusable system guidance patterns
- only relevant context snippets
Example
Verbose prompt:
You are an expert assistant helping our support team classify incoming customer support tickets. Please carefully read the ticket and then determine whether it belongs to billing, technical support, account access, or general inquiry. Make sure your answer is concise and helpful for downstream systems.
Compressed prompt:
Classify the support ticket into one of:
billing, technical_support, account_access, general_inquiry.
Return only the label.
Same task, lower token use.
3.2 Output Constraining
Reduce verbosity by explicitly controlling the expected format.
Useful techniques: - “Return only JSON” - “Use max 3 bullet points” - “Answer in one sentence” - “Return only the category label”
This improves: - latency - cost - machine-readability
3.3 Model Selection
Not every task needs the largest or most expensive model.
Use smaller/faster models when: - the task is simple - the prompt is well-structured - perfect nuance is not necessary - the workload is high-volume
Examples: - classification - routing - tagging - extraction from clean text - simple summarization
For this session, exercises use gpt-5.4-mini, which is a practical choice for lightweight production tasks.
3.4 Caching
Cache results when: - inputs repeat exactly - users retry the same request - many documents contain identical text snippets - prompts are deterministic enough
Types of cache: - in-memory dictionary - local file cache - Redis - application-level memoization
Caching reduces: - latency - API load - cost
3.5 Batching and Parallelism
For high throughput workloads: - process items concurrently - avoid waiting for one request before starting the next
Use:
- asyncio
- worker pools
- queue-based processing
Note: - More concurrency is not always better - Respect rate limits and add backoff logic
3.6 Token Budgeting
A token budget is an explicit limit on: - how much input context you send - how much output you request
Examples: - truncate documents to relevant sections - summarize before passing to later stages - retrieve top-k relevant chunks only
3.7 Avoiding Unnecessary Multi-Step Pipelines
Sometimes one well-designed prompt can replace multiple model calls.
Instead of: - summarize - then classify summary - then extract action items
Try: - one prompt that returns summary, category, and action items together in JSON
This can significantly reduce total latency and cost.
4. Measuring Performance in Python
Optimization without measurement is guesswork.
You should measure at least:
- wall-clock response time
- approximate input/output size
- request success/failure
- cache hit/miss
- requests per second in batch jobs
5. Hands-On Exercise 1: Measure Latency and Basic Token-Aware Design
Objective
Create a small script that:
- calls gpt-5.4-mini
- measures request latency
- compares a verbose prompt vs a concise prompt
What You’ll Learn
- How to use the OpenAI Responses API in Python
- How prompt length affects practical performance
- How to build a simple benchmark loop
Setup
Install the OpenAI Python SDK:
pip install openai
Set your API key:
export OPENAI_API_KEY="your_api_key_here"
Code
import os
import time
from openai import OpenAI
# Create the OpenAI client using the API key from the environment.
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# A sample support ticket to classify.
ticket_text = """
Hi team, I was charged twice for my monthly subscription.
Can you refund the extra payment? My invoice ID is INV-4821.
"""
# A verbose version of the prompt.
verbose_prompt = f"""
You are an expert AI assistant working with a customer support operations team.
Your task is to carefully analyze the incoming support message and determine
which category best applies to it. The available categories are:
billing, technical_support, account_access, and general_inquiry.
Please read the customer message thoroughly and think carefully before deciding.
Then provide the single best category for this ticket in a concise form that
can be used by downstream systems.
Customer message:
{ticket_text}
"""
# A concise version of the prompt.
concise_prompt = f"""
Classify this support ticket into one of:
billing, technical_support, account_access, general_inquiry.
Return only the label.
Ticket:
{ticket_text}
"""
def run_prompt(prompt_name: str, prompt_text: str) -> None:
"""
Sends a prompt to the Responses API, measures wall-clock latency,
and prints the result in a compact format.
"""
start = time.perf_counter()
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt_text,
)
elapsed = time.perf_counter() - start
# output_text provides the model's generated text in a convenient form.
print(f"--- {prompt_name} ---")
print("Output:", response.output_text.strip())
print(f"Latency: {elapsed:.2f} seconds")
print(f"Prompt length (characters): {len(prompt_text)}")
print()
if __name__ == "__main__":
run_prompt("Verbose Prompt", verbose_prompt)
run_prompt("Concise Prompt", concise_prompt)
Example Output
--- Verbose Prompt ---
Output: billing
Latency: 1.42 seconds
Prompt length (characters): 654
--- Concise Prompt ---
Output: billing
Latency: 1.08 seconds
Prompt length (characters): 186
Discussion
Observe: - both prompts solve the same task - the concise prompt is easier to parse - shorter prompts often improve speed and reduce cost
Mini Challenge
Modify the script to test 5 different tickets and compute average latency for each prompt style.
6. Hands-On Exercise 2: Reduce Output Size with Structured Constraints
Objective
Compare a free-form summarization prompt with a constrained prompt.
What You’ll Learn
- How output constraints reduce excess verbosity
- Why structured outputs are useful for downstream systems
Code
import os
import time
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
article = """
Our team launched a new analytics dashboard for small business users.
The dashboard includes revenue trends, customer retention charts,
and weekly performance alerts. Early users reported that the product
is easy to navigate, but some requested CSV export support and more
customizable reporting filters. The product team plans to release
those enhancements next quarter.
"""
prompts = {
"free_form": f"""
Summarize the following product update for an internal audience:
{article}
""",
"constrained": f"""
Summarize the product update below in exactly 3 bullet points.
Each bullet must be under 12 words.
Text:
{article}
"""
}
for name, prompt in prompts.items():
start = time.perf_counter()
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt,
)
elapsed = time.perf_counter() - start
print(f"--- {name} ---")
print(response.output_text.strip())
print(f"Latency: {elapsed:.2f} seconds")
print()
Example Output
--- free_form ---
The team launched a new analytics dashboard for small business users that includes revenue trends, customer retention charts, and weekly alerts. Early feedback has been positive, especially around usability, though users want CSV export support and more customizable filters. Those improvements are planned for next quarter.
Latency: 1.31 seconds
--- constrained ---
- New dashboard launched for small business analytics.
- Users liked usability and requested export features.
- More filters and CSV export arrive next quarter.
Latency: 1.02 seconds
Key Takeaway
Smaller outputs often mean: - lower cost - lower latency - easier parsing - less post-processing
7. Hands-On Exercise 3: Add a Simple Cache
Objective
Avoid repeated API calls for identical requests.
What You’ll Learn
- How caching reduces cost and latency
- How to use a deterministic cache key
Code
import os
import hashlib
import time
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Simple in-memory cache for demonstration.
response_cache = {}
def make_cache_key(prompt: str) -> str:
"""
Create a stable hash-based cache key from the prompt text.
This avoids using very long strings directly as dictionary keys.
"""
return hashlib.sha256(prompt.encode("utf-8")).hexdigest()
def get_or_create_response(prompt: str) -> str:
"""
Return a cached response if available; otherwise call the API and cache it.
"""
key = make_cache_key(prompt)
if key in response_cache:
print("Cache hit")
return response_cache[key]
print("Cache miss")
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt,
)
text = response.output_text.strip()
response_cache[key] = text
return text
if __name__ == "__main__":
prompt = """
Classify the following support issue into one of:
billing, technical_support, account_access, general_inquiry.
Return only the label.
Issue:
I cannot log into my account after resetting my password.
"""
for attempt in range(2):
start = time.perf_counter()
result = get_or_create_response(prompt)
elapsed = time.perf_counter() - start
print(f"Attempt {attempt + 1}: {result}")
print(f"Elapsed: {elapsed:.4f} seconds")
print()
Example Output
Cache miss
Attempt 1: account_access
Elapsed: 1.1842 seconds
Cache hit
Attempt 2: account_access
Elapsed: 0.0001 seconds
Discussion
In production, replace the in-memory cache with: - Redis - SQLite - a persistent key-value store - HTTP cache layers for stable tasks
Important Note
Caching is best when: - prompts are deterministic - repeated requests are common - stale outputs are acceptable for some time window
8. Hands-On Exercise 4: Improve Throughput with Async Concurrency
Objective
Process multiple independent tasks concurrently.
What You’ll Learn
- How throughput differs from single-request latency
- How to use
asynciowith the OpenAI Python SDK - Why concurrency helps batch workloads
Code
import os
import time
import asyncio
from openai import AsyncOpenAI
# Async client for concurrent request handling.
client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])
tickets = [
"I was charged twice for my plan this month.",
"The app crashes whenever I upload a PDF.",
"I cannot access my account after changing my email.",
"Do you offer discounts for annual subscriptions?",
"My password reset link is not working.",
]
PROMPT_TEMPLATE = """
Classify this support ticket into one of:
billing, technical_support, account_access, general_inquiry.
Return only the label.
Ticket:
{ticket}
"""
async def classify_ticket(ticket: str) -> str:
"""
Classify a single ticket asynchronously.
"""
response = await client.responses.create(
model="gpt-5.4-mini",
input=PROMPT_TEMPLATE.format(ticket=ticket),
)
return response.output_text.strip()
async def main() -> None:
start = time.perf_counter()
# Launch all classification tasks concurrently.
tasks = [classify_ticket(ticket) for ticket in tickets]
results = await asyncio.gather(*tasks)
elapsed = time.perf_counter() - start
for ticket, label in zip(tickets, results):
print(f"Ticket: {ticket}")
print(f"Label: {label}")
print()
print(f"Processed {len(tickets)} tickets in {elapsed:.2f} seconds")
if __name__ == "__main__":
asyncio.run(main())
Example Output
Ticket: I was charged twice for my plan this month.
Label: billing
Ticket: The app crashes whenever I upload a PDF.
Label: technical_support
Ticket: I cannot access my account after changing my email.
Label: account_access
Ticket: Do you offer discounts for annual subscriptions?
Label: general_inquiry
Ticket: My password reset link is not working.
Label: account_access
Processed 5 tickets in 1.96 seconds
Discussion
If run sequentially, total time might be much higher.
Concurrency improves overall throughput for independent tasks.
Caution
In production, add: - retry logic - rate-limit handling - bounded concurrency with semaphores - structured logging
9. Hands-On Exercise 5: Replace a Multi-Step Pipeline with a Single Call
Objective
Combine multiple related tasks into one request.
Scenario
Suppose you want to: - classify a ticket - summarize it - extract next action
Instead of three calls, use one structured prompt.
Code
import os
import json
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
ticket = """
Hi support, my invoice shows two charges for March.
Please refund the duplicate amount. Also confirm whether my subscription is still active.
"""
prompt = f"""
Analyze the support ticket below.
Return valid JSON with exactly these keys:
- category
- summary
- next_action
Rules:
- category must be one of: billing, technical_support, account_access, general_inquiry
- summary must be one short sentence
- next_action must be one short sentence
- return JSON only
Ticket:
{ticket}
"""
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt,
)
raw_output = response.output_text.strip()
print("Raw model output:")
print(raw_output)
print()
# Parse the JSON output for downstream use.
data = json.loads(raw_output)
print("Parsed result:")
print(f"Category: {data['category']}")
print(f"Summary: {data['summary']}")
print(f"Next Action: {data['next_action']}")
Example Output
Raw model output:
{"category":"billing","summary":"The customer reports a duplicate March charge and requests a refund.","next_action":"Review the invoice, confirm subscription status, and issue a refund if duplicate billing is verified."}
Parsed result:
Category: billing
Summary: The customer reports a duplicate March charge and requests a refund.
Next Action: Review the invoice, confirm subscription status, and issue a refund if duplicate billing is verified.
Key Takeaway
A single well-structured call can reduce: - total latency - orchestration complexity - cumulative cost
10. Best Practices Checklist
Use this checklist when optimizing an LLM application.
Prompt Design
- Keep instructions short and specific
- Remove duplicated guidance
- Include only relevant context
- Ask for the smallest useful output
Throughput Engineering
- Use async processing for independent tasks
- Avoid unnecessary sequential workflows
- Batch work conceptually where possible
- Monitor queue sizes and request timing
Cost Control
- Prefer smaller/faster models for simple tasks
- Cache deterministic results
- Limit output length
- Eliminate redundant retries
- Merge related tasks when appropriate
Reliability
- Add retry with exponential backoff
- Log latency and failures
- Validate outputs, especially structured ones
- Guard against malformed JSON or empty outputs
11. Common Mistakes
Mistake 1: Optimizing Before Measuring
Always benchmark first.
Mistake 2: Overstuffing Context
More context is not always better.
Irrelevant context increases cost and may reduce accuracy.
Mistake 3: Asking for Beautiful Prose in Backend Workflows
Backend systems usually need: - labels - JSON - short summaries - extracted fields
Not long-form writing.
Mistake 4: Ignoring Cache Opportunities
Repeated prompts are common in real apps.
Mistake 5: Excessive Chaining
Every extra LLM call adds: - latency - failure surface - cost
12. Short Guided Practice
Spend 5–7 minutes improving this prompt:
You are a highly intelligent assistant. Please read the following email from a customer and identify what the issue is about. Then explain your reasoning and provide a final category from the list of possible categories that our team uses internally. The categories are billing, technical support, account access, and general inquiry.
Target Improvement
A better version might be:
Classify the customer email into one of:
billing, technical_support, account_access, general_inquiry.
Return only the label.
Reflection Questions
- What unnecessary text was removed?
- Did the output format become easier to parse?
- Would this likely reduce latency and cost?
13. Session Recap
In this session, you learned that GenAI optimization is about balancing:
- Latency: how fast one request completes
- Throughput: how much total work the system can process
- Cost: how efficiently you use model calls and tokens
You practiced: - measuring request latency - compressing prompts - constraining outputs - caching repeated requests - using async concurrency - collapsing multi-step workflows into a single call
These techniques are foundational for production-grade LLM systems.
14. Useful Resources
- OpenAI Responses API Guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
- OpenAI API Reference: https://platform.openai.com/docs/api-reference
- OpenAI Python SDK: https://github.com/openai/openai-python
- Python
asyncioDocumentation: https://docs.python.org/3/library/asyncio.html - Python
timeModule: https://docs.python.org/3/library/time.html - Python
jsonModule: https://docs.python.org/3/library/json.html - Python
hashlibModule: https://docs.python.org/3/library/hashlib.html
15. Optional Homework
Build a small “ticket triage optimizer” script that:
- reads 20 support tickets from a Python list
- classifies them with
gpt-5.4-mini - uses async concurrency
- caches repeated prompts
- records per-request latency
- prints:
- total runtime
- average latency
- category counts
- cache hit rate
Stretch Goal
Add a second version that: - uses a more verbose prompt - compares total runtime and average output length - reports which version is more cost-efficient
16. Quick Quiz
1. What is latency?
A. Total system memory usage
B. Time taken for one request to complete
C. Number of requests handled per hour
D. Price per API key
Answer: B
2. What is throughput?
A. The number of tasks processed over time
B. The length of a prompt
C. The probability of a retry
D. The format of model output
Answer: A
3. Which technique most directly reduces repeated API cost?
A. Increasing verbosity
B. Caching
C. Longer outputs
D. Sequential chaining
Answer: B
4. Why constrain output format?
A. To make responses slower
B. To increase token use
C. To improve parsing and reduce verbosity
D. To disable concurrency
Answer: C
5. When is async concurrency most helpful?
A. For unrelated tasks that can run independently
B. Only when prompts are long
C. Only for JSON parsing
D. Only for interactive chat UIs
Answer: A
Back to Chapter | Back to Master Plan | Previous Session | Next Session