Session 4: Managing Tokens, Rate Limits, and Errors
Synopsis
Introduces common operational constraints when working with model APIs, including token budgeting, request throttling, retries, and failure handling. Learners gain the foundation needed to build robust applications rather than simple demos.
Session Content
Session 4: Managing Tokens, Rate Limits, and Errors
Session Overview
In this session, you will learn how to build more reliable GenAI applications by managing three practical concerns:
- Tokens: how model input and output are measured and controlled
- Rate limits: how to design apps that behave well under API usage constraints
- Errors: how to detect, handle, and recover from common failures
By the end of this session, you will be able to:
- Explain what tokens are and why they matter
- control output size and reduce unnecessary token usage
- implement retry logic for transient failures
- handle common API errors cleanly in Python
- build a small resilient wrapper around the OpenAI Responses API
Learning Objectives
After this session, learners should be able to:
- Describe the relationship between prompts, completions, and token usage.
- Reduce token waste through prompt design and output constraints.
- Recognize common API reliability issues such as rate limiting and temporary service failures.
- Implement exponential backoff retries in Python.
- Build a reusable function for making robust model calls with logging and error handling.
Recommended Prerequisites
Before starting this session, learners should already be comfortable with:
- basic Python functions
try/except- installing packages with
pip - working with environment variables
- making simple OpenAI API calls
Session Timing (~45 Minutes)
- 0–8 min: Why tokens and limits matter
- 8–18 min: Token management strategies
- 18–28 min: Rate limits and retry patterns
- 28–38 min: Handling errors in production-style code
- 38–45 min: Hands-on mini project: resilient API wrapper
1. Why Tokens, Rate Limits, and Errors Matter
When building GenAI applications, it is not enough to get a successful response once. Real applications must be:
- cost-aware
- predictable
- resilient
- safe under load
Three practical constraints affect this:
1.1 Tokens
Models process text as tokens, not raw characters or words.
Tokens affect:
- how much input the model can consider
- how much output it can produce
- how much a request may cost
- how long responses may take
A longer prompt usually means:
- more tokens consumed
- potentially higher cost
- potentially slower responses
A longer response also consumes tokens.
1.2 Rate Limits
APIs often enforce usage limits to ensure fairness and stability. Your app may be limited by:
- requests per minute
- tokens per minute
- concurrent requests
- account tier limits
If you exceed limits, your request may fail temporarily.
1.3 Errors
Even correct code can encounter errors such as:
- invalid API key
- malformed request
- timeout
- rate limit error
- temporary server issue
- network interruption
Good applications expect failures and handle them gracefully.
2. Token Management Fundamentals
2.1 What Is a Token?
A token is a chunk of text used internally by the model. A token may be:
- a whole short word
- part of a longer word
- punctuation
- whitespace patterns
Example intuition:
"Hello"might be one token"unbelievable"may be split into multiple tokens- code often tokenizes differently than plain English
You usually do not need to manually count every token, but you should understand that:
- verbose prompts consume more tokens
- repeated instructions waste tokens
- large context windows can become expensive
2.2 Practical Ways to Reduce Token Usage
Here are common techniques:
A. Be specific, not verbose
Instead of:
Please provide a comprehensive but also concise answer, and think carefully about the structure, and make sure to include a good summary, and do not be too long, but also not too short...
Use:
Answer in 3 bullet points. Keep it under 80 words.
B. Limit output length
Use output constraints in your prompt and API parameters where appropriate.
C. Avoid repeating system instructions unnecessarily
If your app always uses the same behavior, centralize it in one place.
D. Summarize long conversation history
Instead of sending every prior message forever, periodically summarize.
E. Retrieve only relevant context
If using retrieval or documents, include only the sections needed for the current query.
2.3 Controlling Output Length
With the Responses API, you can constrain output size using max_output_tokens.
Example: Short vs uncontrolled response
from openai import OpenAI
client = OpenAI()
prompt = "Explain what an API is for a beginner."
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt,
max_output_tokens=60
)
print(response.output_text)
Example output
An API is a way for one software program to talk to another. It defines rules for requesting data or actions. For example, a weather app may use an API to fetch forecast data from a weather service.
Why this helps
- prevents overly long responses
- improves predictability
- helps control token consumption
3. Hands-On Exercise 1: Compare Prompt Styles for Token Efficiency
Goal
Observe how prompt design influences response size and clarity.
Task
Run two prompts:
- a verbose prompt
- a concise prompt with explicit output constraints
Compare:
- readability
- output length
- usefulness
Code
from openai import OpenAI
client = OpenAI()
verbose_prompt = """
Please explain what Python decorators are in a way that is easy for a beginner to understand.
Make sure your answer is helpful, educational, and includes enough detail to be understandable.
Also try to provide examples if useful, and make the explanation balanced so it is not too short
but also not too long.
"""
concise_prompt = """
Explain Python decorators for a beginner.
Return:
- 3 bullet points
- 1 tiny code example
- under 120 words
"""
for label, prompt in [("VERBOSE", verbose_prompt), ("CONCISE", concise_prompt)]:
print(f"\n--- {label} PROMPT ---")
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt,
max_output_tokens=120
)
print(response.output_text)
Example output
--- VERBOSE PROMPT ---
Python decorators are a way to modify or extend the behavior of a function without changing its actual code directly...
--- CONCISE PROMPT ---
- A decorator wraps a function to add behavior.
- It is written with @decorator_name above a function.
- Common uses include logging, authentication, and timing.
Example:
def log_call(fn):
def wrapper():
print("Calling function")
return fn()
return wrapper
Reflection Questions
- Which prompt produced the more controlled answer?
- Which one is easier to use in an application UI?
- How would output constraints help reduce cost over many requests?
4. Rate Limits: What They Are and How to Respond
4.1 What Is a Rate Limit?
A rate limit is a temporary cap on API usage. If too many requests are sent too quickly, the server may reject some requests until usage drops.
This is normal and expected in many APIs.
4.2 Common Strategies
When rate-limited:
- do not spam retries immediately
- wait before retrying
- use exponential backoff
- add jitter to avoid retry storms
- log the failure
- keep retry counts bounded
4.3 Exponential Backoff
Exponential backoff means each retry waits longer than the previous one.
Example sequence:
- retry 1: wait 1 second
- retry 2: wait 2 seconds
- retry 3: wait 4 seconds
- retry 4: wait 8 seconds
Adding jitter means adding small randomness, such as:
- 1.2s
- 2.3s
- 4.1s
This helps when many clients retry at the same time.
5. Error Handling Basics in Python API Calls
5.1 Errors You Should Expect
In production-style code, expect at least:
- authentication failures
- bad request errors
- rate limit errors
- API status/server errors
- connection issues
- timeouts
5.2 Principles of Good Error Handling
- catch specific exception types when possible
- avoid hiding errors silently
- log enough context to debug
- retry only transient failures
- fail fast on permanent request problems
- return useful messages to the caller
6. Hands-On Exercise 2: Build a Retry Wrapper with Exponential Backoff
Goal
Create a reusable function that retries transient failures safely.
What this example demonstrates
- clean function design
- retry loop
- exponential backoff with jitter
- basic logging
- bounded retries
Code
import random
import time
from openai import OpenAI
from openai import APIConnectionError, APIStatusError, RateLimitError
client = OpenAI()
def generate_with_retry(prompt: str, max_retries: int = 4) -> str:
"""
Send a prompt to the OpenAI Responses API with retry handling.
Retries transient errors such as:
- rate limits
- connection failures
- temporary server-side status errors
Args:
prompt: The input prompt to send to the model.
max_retries: Maximum number of retry attempts before failing.
Returns:
The model's text output.
Raises:
Exception: Re-raises the final exception if all retries fail.
"""
for attempt in range(max_retries + 1):
try:
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt,
max_output_tokens=100
)
return response.output_text
except RateLimitError as exc:
if attempt == max_retries:
print(f"[ERROR] Rate limit persisted after {max_retries} retries.")
raise
wait_time = (2 ** attempt) + random.uniform(0, 0.5)
print(
f"[WARN] Rate limited. Attempt {attempt + 1}/{max_retries}. "
f"Retrying in {wait_time:.2f} seconds..."
)
time.sleep(wait_time)
except APIConnectionError as exc:
if attempt == max_retries:
print(f"[ERROR] Connection issue persisted after {max_retries} retries.")
raise
wait_time = (2 ** attempt) + random.uniform(0, 0.5)
print(
f"[WARN] Connection error: {exc}. "
f"Attempt {attempt + 1}/{max_retries}. "
f"Retrying in {wait_time:.2f} seconds..."
)
time.sleep(wait_time)
except APIStatusError as exc:
status_code = exc.status_code
# Retry only for likely transient server-side failures.
if status_code in (500, 502, 503, 504) and attempt < max_retries:
wait_time = (2 ** attempt) + random.uniform(0, 0.5)
print(
f"[WARN] Server error {status_code}. "
f"Attempt {attempt + 1}/{max_retries}. "
f"Retrying in {wait_time:.2f} seconds..."
)
time.sleep(wait_time)
else:
print(f"[ERROR] Non-retriable API status error: {status_code}")
raise
except Exception as exc:
# Unexpected errors should usually not be retried blindly.
print(f"[ERROR] Unexpected error: {exc}")
raise
if __name__ == "__main__":
prompt = "Give me a 2-sentence explanation of why retry logic matters in APIs."
result = generate_with_retry(prompt)
print("\nModel output:")
print(result)
Example output
Model output:
Retry logic helps applications recover from temporary failures such as rate limits or network interruptions. It improves reliability by allowing a request to succeed without immediately failing the whole user workflow.
Discussion
This function retries transient failures but does not retry everything.
That is important because:
- invalid requests will not be fixed by retrying
- bad API keys will not be fixed by retrying
- malformed input should fail clearly
7. Common Error Categories and Recommended Actions
| Error Type | Example Cause | Retry? | Recommended Action |
|---|---|---|---|
| Authentication error | bad or missing API key | No | fix credentials |
| Bad request | invalid parameter or malformed input | No | fix code or input |
| Rate limit error | too many requests too quickly | Yes | backoff and retry |
| Connection error | network interruption | Yes | retry with backoff |
| Server error (5xx) | temporary service problem | Usually yes | retry with backoff |
| Timeout | slow network or overloaded system | Usually yes | retry carefully |
| Unexpected exception | bug in app logic | No | inspect logs and fix code |
8. Hands-On Exercise 3: Build a Safer Request Function with Structured Results
Goal
Return structured success/error information instead of crashing immediately.
This pattern is helpful when your application needs to:
- show user-friendly messages
- log failures centrally
- continue processing other tasks
Code
import random
import time
from typing import Any, Dict
from openai import OpenAI
from openai import APIConnectionError, APIStatusError, RateLimitError
client = OpenAI()
def safe_generate(prompt: str, max_retries: int = 3) -> Dict[str, Any]:
"""
Generate text using the OpenAI Responses API and return a structured result.
Args:
prompt: Prompt text to send to the model.
max_retries: Maximum number of retries for transient errors.
Returns:
A dictionary with:
- ok: bool
- output: str | None
- error_type: str | None
- message: str
"""
for attempt in range(max_retries + 1):
try:
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt,
max_output_tokens=80
)
return {
"ok": True,
"output": response.output_text,
"error_type": None,
"message": "success"
}
except RateLimitError:
if attempt == max_retries:
return {
"ok": False,
"output": None,
"error_type": "rate_limit",
"message": "Rate limit exceeded after retries."
}
wait_time = (2 ** attempt) + random.uniform(0, 0.5)
time.sleep(wait_time)
except APIConnectionError:
if attempt == max_retries:
return {
"ok": False,
"output": None,
"error_type": "connection_error",
"message": "Network or connection issue after retries."
}
wait_time = (2 ** attempt) + random.uniform(0, 0.5)
time.sleep(wait_time)
except APIStatusError as exc:
if exc.status_code in (500, 502, 503, 504):
if attempt == max_retries:
return {
"ok": False,
"output": None,
"error_type": "server_error",
"message": f"Server error {exc.status_code} after retries."
}
wait_time = (2 ** attempt) + random.uniform(0, 0.5)
time.sleep(wait_time)
else:
return {
"ok": False,
"output": None,
"error_type": "api_status_error",
"message": f"Non-retriable API status error: {exc.status_code}"
}
except Exception as exc:
return {
"ok": False,
"output": None,
"error_type": "unexpected_error",
"message": str(exc)
}
if __name__ == "__main__":
result = safe_generate("Summarize why error handling matters in AI apps in 2 sentences.")
if result["ok"]:
print("Success!")
print(result["output"])
else:
print("Request failed.")
print(f"Type: {result['error_type']}")
print(f"Message: {result['message']}")
Example output
Success!
Error handling matters in AI apps because network issues, rate limits, and server problems can happen even when your code is correct. Good handling improves reliability and gives users clearer feedback when something goes wrong.
9. Mini Project: Resilient Prompt Runner
Goal
Build a small utility that:
- accepts multiple prompts
- sends them one by one
- retries transient failures
- logs success/failure
- keeps outputs short and predictable
This exercise simulates a small batch-processing tool.
Code
import random
import time
from typing import List, Dict, Any
from openai import OpenAI
from openai import APIConnectionError, APIStatusError, RateLimitError
client = OpenAI()
def robust_generate(prompt: str, max_retries: int = 3, max_output_tokens: int = 80) -> Dict[str, Any]:
"""
Generate a short response for a prompt with retry handling.
Args:
prompt: Text prompt to send.
max_retries: Number of retries for transient failures.
max_output_tokens: Upper bound on generated output tokens.
Returns:
Dictionary containing success/failure information.
"""
for attempt in range(max_retries + 1):
try:
response = client.responses.create(
model="gpt-5.4-mini",
input=prompt,
max_output_tokens=max_output_tokens
)
return {
"ok": True,
"prompt": prompt,
"output": response.output_text,
"attempts_used": attempt + 1
}
except RateLimitError:
if attempt == max_retries:
return {
"ok": False,
"prompt": prompt,
"output": None,
"attempts_used": attempt + 1,
"error": "rate_limit"
}
time.sleep((2 ** attempt) + random.uniform(0, 0.5))
except APIConnectionError:
if attempt == max_retries:
return {
"ok": False,
"prompt": prompt,
"output": None,
"attempts_used": attempt + 1,
"error": "connection_error"
}
time.sleep((2 ** attempt) + random.uniform(0, 0.5))
except APIStatusError as exc:
if exc.status_code in (500, 502, 503, 504) and attempt < max_retries:
time.sleep((2 ** attempt) + random.uniform(0, 0.5))
else:
return {
"ok": False,
"prompt": prompt,
"output": None,
"attempts_used": attempt + 1,
"error": f"api_status_{exc.status_code}"
}
except Exception as exc:
return {
"ok": False,
"prompt": prompt,
"output": None,
"attempts_used": attempt + 1,
"error": f"unexpected: {exc}"
}
def process_prompts(prompts: List[str]) -> List[Dict[str, Any]]:
"""
Process a list of prompts sequentially.
Args:
prompts: List of input prompts.
Returns:
List of result dictionaries.
"""
results = []
for index, prompt in enumerate(prompts, start=1):
print(f"\nProcessing prompt {index}/{len(prompts)}...")
result = robust_generate(prompt)
results.append(result)
if result["ok"]:
print(" Status : SUCCESS")
print(f" Attempts: {result['attempts_used']}")
print(f" Output : {result['output']}")
else:
print(" Status : FAILED")
print(f" Attempts: {result['attempts_used']}")
print(f" Error : {result['error']}")
return results
if __name__ == "__main__":
prompts = [
"Explain token limits in one sentence.",
"Why should apps use exponential backoff? Answer in one sentence.",
"What is a transient API error? Answer in one sentence."
]
all_results = process_prompts(prompts)
print("\nFinal summary:")
success_count = sum(1 for item in all_results if item["ok"])
failure_count = len(all_results) - success_count
print(f"Successful requests: {success_count}")
print(f"Failed requests : {failure_count}")
Example output
Processing prompt 1/3...
Status : SUCCESS
Attempts: 1
Output : Token limits cap how much text a model can read and generate in a single request.
Processing prompt 2/3...
Status : SUCCESS
Attempts: 1
Output : Exponential backoff reduces repeated pressure on an API and improves the chance that retries succeed.
Processing prompt 3/3...
Status : SUCCESS
Attempts: 1
Output : A transient API error is a temporary failure, such as a rate limit or brief server issue, that may succeed if retried later.
Final summary:
Successful requests: 3
Failed requests : 0
10. Best Practices Checklist
Use this checklist when building GenAI apps:
Token Management
- Keep prompts concise
- Ask for structured outputs
- set
max_output_tokenswhen appropriate - trim or summarize long history
- include only relevant context
Rate Limit Handling
- expect temporary rate limits
- retry with exponential backoff
- add jitter
- cap retry attempts
- avoid aggressive loops
Error Handling
- catch specific exceptions
- retry only transient failures
- fail clearly for invalid requests
- log useful context
- return structured error info to callers
11. Common Mistakes to Avoid
Mistake 1: Retrying every error
Do not retry:
- bad API key
- malformed input
- invalid parameter names
These are code/configuration problems, not transient problems.
Mistake 2: No output limits
If you do not constrain outputs, responses may become:
- longer than needed
- more expensive
- harder to display cleanly
Mistake 3: Swallowing exceptions silently
This makes debugging difficult. At minimum, log:
- what operation failed
- which prompt or request triggered it
- what exception occurred
Mistake 4: Sending unnecessary context
Large prompts increase token usage and may reduce efficiency.
12. Quick Knowledge Check
Answer these questions before moving on:
- Why is concise prompting often better than verbose prompting?
- What problem does exponential backoff solve?
- Which errors should generally not be retried?
- Why might structured error results be useful in an application?
- How does
max_output_tokenshelp control behavior?
13. Wrap-Up
In this session, you learned how to make GenAI applications more practical and robust by managing:
- tokens for cost and output control
- rate limits through bounded retries and backoff
- errors using targeted exception handling and structured results
These skills are essential for moving from simple demos to dependable applications.
Useful Resources
- OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
- OpenAI API docs overview: https://platform.openai.com/docs
- Python SDK usage: https://github.com/openai/openai-python
- Python
timemodule: https://docs.python.org/3/library/time.html - Python
randommodule: https://docs.python.org/3/library/random.html - Python exception handling: https://docs.python.org/3/tutorial/errors.html
Suggested Homework
- Modify the retry wrapper to log timestamps for each retry.
- Add a parameter that lets users choose between short, medium, and long response styles.
- Extend the mini project to save results to a JSON file.
- Add prompt truncation or summarization before sending very large inputs.
- Create a CLI script that reads prompts from a text file and processes them safely.