Session 4: Managing Tokens, Rate Limits, and Errors

Synopsis

Introduces common operational constraints when working with model APIs, including token budgeting, request throttling, retries, and failure handling. Learners gain the foundation needed to build robust applications rather than simple demos.

Session Content

Session 4: Managing Tokens, Rate Limits, and Errors

Session Overview

In this session, you will learn how to build more reliable GenAI applications by managing three practical concerns:

Tokens: how model input and output are measured and controlled
Rate limits: how to design apps that behave well under API usage constraints
Errors: how to detect, handle, and recover from common failures

By the end of this session, you will be able to:

Explain what tokens are and why they matter
control output size and reduce unnecessary token usage
implement retry logic for transient failures
handle common API errors cleanly in Python
build a small resilient wrapper around the OpenAI Responses API

Learning Objectives

After this session, learners should be able to:

Describe the relationship between prompts, completions, and token usage.
Reduce token waste through prompt design and output constraints.
Recognize common API reliability issues such as rate limiting and temporary service failures.
Implement exponential backoff retries in Python.
Build a reusable function for making robust model calls with logging and error handling.

Recommended Prerequisites

Before starting this session, learners should already be comfortable with:

basic Python functions
try / except
installing packages with pip
working with environment variables
making simple OpenAI API calls

Session Timing (~45 Minutes)

0–8 min: Why tokens and limits matter
8–18 min: Token management strategies
18–28 min: Rate limits and retry patterns
28–38 min: Handling errors in production-style code
38–45 min: Hands-on mini project: resilient API wrapper

1. Why Tokens, Rate Limits, and Errors Matter

When building GenAI applications, it is not enough to get a successful response once. Real applications must be:

cost-aware
predictable
resilient
safe under load

Three practical constraints affect this:

1.1 Tokens

Models process text as tokens, not raw characters or words.

Tokens affect:

how much input the model can consider
how much output it can produce
how much a request may cost
how long responses may take

A longer prompt usually means:

more tokens consumed
potentially higher cost
potentially slower responses

A longer response also consumes tokens.

1.2 Rate Limits

APIs often enforce usage limits to ensure fairness and stability. Your app may be limited by:

requests per minute
tokens per minute
concurrent requests
account tier limits

If you exceed limits, your request may fail temporarily.

1.3 Errors

Even correct code can encounter errors such as:

invalid API key
malformed request
timeout
rate limit error
temporary server issue
network interruption

Good applications expect failures and handle them gracefully.

2. Token Management Fundamentals

2.1 What Is a Token?

A token is a chunk of text used internally by the model. A token may be:

a whole short word
part of a longer word
punctuation
whitespace patterns

Example intuition:

"Hello" might be one token
"unbelievable" may be split into multiple tokens
code often tokenizes differently than plain English

You usually do not need to manually count every token, but you should understand that:

verbose prompts consume more tokens
repeated instructions waste tokens
large context windows can become expensive

2.2 Practical Ways to Reduce Token Usage

Here are common techniques:

A. Be specific, not verbose

Instead of:

Please provide a comprehensive but also concise answer, and think carefully about the structure, and make sure to include a good summary, and do not be too long, but also not too short...

Use:

Answer in 3 bullet points. Keep it under 80 words.

B. Limit output length

Use output constraints in your prompt and API parameters where appropriate.

C. Avoid repeating system instructions unnecessarily

If your app always uses the same behavior, centralize it in one place.

D. Summarize long conversation history

Instead of sending every prior message forever, periodically summarize.

E. Retrieve only relevant context

If using retrieval or documents, include only the sections needed for the current query.

2.3 Controlling Output Length

With the Responses API, you can constrain output size using max_output_tokens.

Example: Short vs uncontrolled response

from openai import OpenAI

client = OpenAI()

prompt = "Explain what an API is for a beginner."

response = client.responses.create(
    model="gpt-5.4-mini",
    input=prompt,
    max_output_tokens=60
)

print(response.output_text)

Example output

An API is a way for one software program to talk to another. It defines rules for requesting data or actions. For example, a weather app may use an API to fetch forecast data from a weather service.

Why this helps

prevents overly long responses
improves predictability
helps control token consumption

3. Hands-On Exercise 1: Compare Prompt Styles for Token Efficiency

Goal

Observe how prompt design influences response size and clarity.

Task

Run two prompts:

a verbose prompt
a concise prompt with explicit output constraints

Compare:

readability
output length
usefulness

Code

from openai import OpenAI

client = OpenAI()

verbose_prompt = """
Please explain what Python decorators are in a way that is easy for a beginner to understand.
Make sure your answer is helpful, educational, and includes enough detail to be understandable.
Also try to provide examples if useful, and make the explanation balanced so it is not too short
but also not too long.
"""

concise_prompt = """
Explain Python decorators for a beginner.
Return:
- 3 bullet points
- 1 tiny code example
- under 120 words
"""

for label, prompt in [("VERBOSE", verbose_prompt), ("CONCISE", concise_prompt)]:
    print(f"\n--- {label} PROMPT ---")
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt,
        max_output_tokens=120
    )
    print(response.output_text)

Example output

--- VERBOSE PROMPT ---
Python decorators are a way to modify or extend the behavior of a function without changing its actual code directly...

--- CONCISE PROMPT ---
- A decorator wraps a function to add behavior.
- It is written with @decorator_name above a function.
- Common uses include logging, authentication, and timing.

Example:
def log_call(fn):
    def wrapper():
        print("Calling function")
        return fn()
    return wrapper

Reflection Questions

Which prompt produced the more controlled answer?
Which one is easier to use in an application UI?
How would output constraints help reduce cost over many requests?

4. Rate Limits: What They Are and How to Respond

4.1 What Is a Rate Limit?

A rate limit is a temporary cap on API usage. If too many requests are sent too quickly, the server may reject some requests until usage drops.

This is normal and expected in many APIs.

4.2 Common Strategies

When rate-limited:

do not spam retries immediately
wait before retrying
use exponential backoff
add jitter to avoid retry storms
log the failure
keep retry counts bounded

4.3 Exponential Backoff

Exponential backoff means each retry waits longer than the previous one.

Example sequence:

retry 1: wait 1 second
retry 2: wait 2 seconds
retry 3: wait 4 seconds
retry 4: wait 8 seconds

Adding jitter means adding small randomness, such as:

1.2s
2.3s
4.1s

This helps when many clients retry at the same time.

5. Error Handling Basics in Python API Calls

5.1 Errors You Should Expect

In production-style code, expect at least:

authentication failures
bad request errors
rate limit errors
API status/server errors
connection issues
timeouts

5.2 Principles of Good Error Handling

catch specific exception types when possible
avoid hiding errors silently
log enough context to debug
retry only transient failures
fail fast on permanent request problems
return useful messages to the caller

6. Hands-On Exercise 2: Build a Retry Wrapper with Exponential Backoff

Goal

Create a reusable function that retries transient failures safely.

What this example demonstrates

clean function design
retry loop
exponential backoff with jitter
basic logging
bounded retries

Code

import random
import time
from openai import OpenAI
from openai import APIConnectionError, APIStatusError, RateLimitError

client = OpenAI()


def generate_with_retry(prompt: str, max_retries: int = 4) -> str:
    """
    Send a prompt to the OpenAI Responses API with retry handling.

    Retries transient errors such as:
    - rate limits
    - connection failures
    - temporary server-side status errors

    Args:
        prompt: The input prompt to send to the model.
        max_retries: Maximum number of retry attempts before failing.

    Returns:
        The model's text output.

    Raises:
        Exception: Re-raises the final exception if all retries fail.
    """
    for attempt in range(max_retries + 1):
        try:
            response = client.responses.create(
                model="gpt-5.4-mini",
                input=prompt,
                max_output_tokens=100
            )
            return response.output_text

        except RateLimitError as exc:
            if attempt == max_retries:
                print(f"[ERROR] Rate limit persisted after {max_retries} retries.")
                raise

            wait_time = (2 ** attempt) + random.uniform(0, 0.5)
            print(
                f"[WARN] Rate limited. Attempt {attempt + 1}/{max_retries}. "
                f"Retrying in {wait_time:.2f} seconds..."
            )
            time.sleep(wait_time)

        except APIConnectionError as exc:
            if attempt == max_retries:
                print(f"[ERROR] Connection issue persisted after {max_retries} retries.")
                raise

            wait_time = (2 ** attempt) + random.uniform(0, 0.5)
            print(
                f"[WARN] Connection error: {exc}. "
                f"Attempt {attempt + 1}/{max_retries}. "
                f"Retrying in {wait_time:.2f} seconds..."
            )
            time.sleep(wait_time)

        except APIStatusError as exc:
            status_code = exc.status_code

            # Retry only for likely transient server-side failures.
            if status_code in (500, 502, 503, 504) and attempt < max_retries:
                wait_time = (2 ** attempt) + random.uniform(0, 0.5)
                print(
                    f"[WARN] Server error {status_code}. "
                    f"Attempt {attempt + 1}/{max_retries}. "
                    f"Retrying in {wait_time:.2f} seconds..."
                )
                time.sleep(wait_time)
            else:
                print(f"[ERROR] Non-retriable API status error: {status_code}")
                raise

        except Exception as exc:
            # Unexpected errors should usually not be retried blindly.
            print(f"[ERROR] Unexpected error: {exc}")
            raise


if __name__ == "__main__":
    prompt = "Give me a 2-sentence explanation of why retry logic matters in APIs."
    result = generate_with_retry(prompt)
    print("\nModel output:")
    print(result)

Example output

Model output:
Retry logic helps applications recover from temporary failures such as rate limits or network interruptions. It improves reliability by allowing a request to succeed without immediately failing the whole user workflow.

Discussion

This function retries transient failures but does not retry everything.

That is important because:

invalid requests will not be fixed by retrying
bad API keys will not be fixed by retrying
malformed input should fail clearly

7. Common Error Categories and Recommended Actions

Error Type	Example Cause	Retry?	Recommended Action
Authentication error	bad or missing API key	No	fix credentials
Bad request	invalid parameter or malformed input	No	fix code or input
Rate limit error	too many requests too quickly	Yes	backoff and retry
Connection error	network interruption	Yes	retry with backoff
Server error (5xx)	temporary service problem	Usually yes	retry with backoff
Timeout	slow network or overloaded system	Usually yes	retry carefully
Unexpected exception	bug in app logic	No	inspect logs and fix code

8. Hands-On Exercise 3: Build a Safer Request Function with Structured Results

Goal

Return structured success/error information instead of crashing immediately.

This pattern is helpful when your application needs to:

show user-friendly messages
log failures centrally
continue processing other tasks

Code

import random
import time
from typing import Any, Dict

from openai import OpenAI
from openai import APIConnectionError, APIStatusError, RateLimitError

client = OpenAI()


def safe_generate(prompt: str, max_retries: int = 3) -> Dict[str, Any]:
    """
    Generate text using the OpenAI Responses API and return a structured result.

    Args:
        prompt: Prompt text to send to the model.
        max_retries: Maximum number of retries for transient errors.

    Returns:
        A dictionary with:
        - ok: bool
        - output: str | None
        - error_type: str | None
        - message: str
    """
    for attempt in range(max_retries + 1):
        try:
            response = client.responses.create(
                model="gpt-5.4-mini",
                input=prompt,
                max_output_tokens=80
            )
            return {
                "ok": True,
                "output": response.output_text,
                "error_type": None,
                "message": "success"
            }

        except RateLimitError:
            if attempt == max_retries:
                return {
                    "ok": False,
                    "output": None,
                    "error_type": "rate_limit",
                    "message": "Rate limit exceeded after retries."
                }

            wait_time = (2 ** attempt) + random.uniform(0, 0.5)
            time.sleep(wait_time)

        except APIConnectionError:
            if attempt == max_retries:
                return {
                    "ok": False,
                    "output": None,
                    "error_type": "connection_error",
                    "message": "Network or connection issue after retries."
                }

            wait_time = (2 ** attempt) + random.uniform(0, 0.5)
            time.sleep(wait_time)

        except APIStatusError as exc:
            if exc.status_code in (500, 502, 503, 504):
                if attempt == max_retries:
                    return {
                        "ok": False,
                        "output": None,
                        "error_type": "server_error",
                        "message": f"Server error {exc.status_code} after retries."
                    }

                wait_time = (2 ** attempt) + random.uniform(0, 0.5)
                time.sleep(wait_time)
            else:
                return {
                    "ok": False,
                    "output": None,
                    "error_type": "api_status_error",
                    "message": f"Non-retriable API status error: {exc.status_code}"
                }

        except Exception as exc:
            return {
                "ok": False,
                "output": None,
                "error_type": "unexpected_error",
                "message": str(exc)
            }


if __name__ == "__main__":
    result = safe_generate("Summarize why error handling matters in AI apps in 2 sentences.")

    if result["ok"]:
        print("Success!")
        print(result["output"])
    else:
        print("Request failed.")
        print(f"Type: {result['error_type']}")
        print(f"Message: {result['message']}")

Example output

Success!
Error handling matters in AI apps because network issues, rate limits, and server problems can happen even when your code is correct. Good handling improves reliability and gives users clearer feedback when something goes wrong.

9. Mini Project: Resilient Prompt Runner

Goal

Build a small utility that:

accepts multiple prompts
sends them one by one
retries transient failures
logs success/failure
keeps outputs short and predictable

This exercise simulates a small batch-processing tool.

Code

import random
import time
from typing import List, Dict, Any

from openai import OpenAI
from openai import APIConnectionError, APIStatusError, RateLimitError

client = OpenAI()


def robust_generate(prompt: str, max_retries: int = 3, max_output_tokens: int = 80) -> Dict[str, Any]:
    """
    Generate a short response for a prompt with retry handling.

    Args:
        prompt: Text prompt to send.
        max_retries: Number of retries for transient failures.
        max_output_tokens: Upper bound on generated output tokens.

    Returns:
        Dictionary containing success/failure information.
    """
    for attempt in range(max_retries + 1):
        try:
            response = client.responses.create(
                model="gpt-5.4-mini",
                input=prompt,
                max_output_tokens=max_output_tokens
            )
            return {
                "ok": True,
                "prompt": prompt,
                "output": response.output_text,
                "attempts_used": attempt + 1
            }

        except RateLimitError:
            if attempt == max_retries:
                return {
                    "ok": False,
                    "prompt": prompt,
                    "output": None,
                    "attempts_used": attempt + 1,
                    "error": "rate_limit"
                }
            time.sleep((2 ** attempt) + random.uniform(0, 0.5))

        except APIConnectionError:
            if attempt == max_retries:
                return {
                    "ok": False,
                    "prompt": prompt,
                    "output": None,
                    "attempts_used": attempt + 1,
                    "error": "connection_error"
                }
            time.sleep((2 ** attempt) + random.uniform(0, 0.5))

        except APIStatusError as exc:
            if exc.status_code in (500, 502, 503, 504) and attempt < max_retries:
                time.sleep((2 ** attempt) + random.uniform(0, 0.5))
            else:
                return {
                    "ok": False,
                    "prompt": prompt,
                    "output": None,
                    "attempts_used": attempt + 1,
                    "error": f"api_status_{exc.status_code}"
                }

        except Exception as exc:
            return {
                "ok": False,
                "prompt": prompt,
                "output": None,
                "attempts_used": attempt + 1,
                "error": f"unexpected: {exc}"
            }


def process_prompts(prompts: List[str]) -> List[Dict[str, Any]]:
    """
    Process a list of prompts sequentially.

    Args:
        prompts: List of input prompts.

    Returns:
        List of result dictionaries.
    """
    results = []

    for index, prompt in enumerate(prompts, start=1):
        print(f"\nProcessing prompt {index}/{len(prompts)}...")
        result = robust_generate(prompt)
        results.append(result)

        if result["ok"]:
            print("  Status : SUCCESS")
            print(f"  Attempts: {result['attempts_used']}")
            print(f"  Output : {result['output']}")
        else:
            print("  Status : FAILED")
            print(f"  Attempts: {result['attempts_used']}")
            print(f"  Error  : {result['error']}")

    return results


if __name__ == "__main__":
    prompts = [
        "Explain token limits in one sentence.",
        "Why should apps use exponential backoff? Answer in one sentence.",
        "What is a transient API error? Answer in one sentence."
    ]

    all_results = process_prompts(prompts)

    print("\nFinal summary:")
    success_count = sum(1 for item in all_results if item["ok"])
    failure_count = len(all_results) - success_count
    print(f"Successful requests: {success_count}")
    print(f"Failed requests    : {failure_count}")

Example output

Processing prompt 1/3...
  Status : SUCCESS
  Attempts: 1
  Output : Token limits cap how much text a model can read and generate in a single request.

Processing prompt 2/3...
  Status : SUCCESS
  Attempts: 1
  Output : Exponential backoff reduces repeated pressure on an API and improves the chance that retries succeed.

Processing prompt 3/3...
  Status : SUCCESS
  Attempts: 1
  Output : A transient API error is a temporary failure, such as a rate limit or brief server issue, that may succeed if retried later.

Final summary:
Successful requests: 3
Failed requests    : 0

10. Best Practices Checklist

Use this checklist when building GenAI apps:

Token Management

Keep prompts concise
Ask for structured outputs
set max_output_tokens when appropriate
trim or summarize long history
include only relevant context

Rate Limit Handling

expect temporary rate limits
retry with exponential backoff
add jitter
cap retry attempts
avoid aggressive loops

Error Handling

catch specific exceptions
retry only transient failures
fail clearly for invalid requests
log useful context
return structured error info to callers

11. Common Mistakes to Avoid

Mistake 1: Retrying every error

Do not retry:

bad API key
malformed input
invalid parameter names

These are code/configuration problems, not transient problems.

Mistake 2: No output limits

If you do not constrain outputs, responses may become:

longer than needed
more expensive
harder to display cleanly

Mistake 3: Swallowing exceptions silently

This makes debugging difficult. At minimum, log:

what operation failed
which prompt or request triggered it
what exception occurred

Mistake 4: Sending unnecessary context

Large prompts increase token usage and may reduce efficiency.

12. Quick Knowledge Check

Answer these questions before moving on:

Why is concise prompting often better than verbose prompting?
What problem does exponential backoff solve?
Which errors should generally not be retried?
Why might structured error results be useful in an application?
How does max_output_tokens help control behavior?

13. Wrap-Up

In this session, you learned how to make GenAI applications more practical and robust by managing:

tokens for cost and output control
rate limits through bounded retries and backoff
errors using targeted exception handling and structured results

These skills are essential for moving from simple demos to dependable applications.

Useful Resources

OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
OpenAI API docs overview: https://platform.openai.com/docs
Python SDK usage: https://github.com/openai/openai-python
Python time module: https://docs.python.org/3/library/time.html
Python random module: https://docs.python.org/3/library/random.html
Python exception handling: https://docs.python.org/3/tutorial/errors.html

Suggested Homework

Modify the retry wrapper to log timestamps for each retry.
Add a parameter that lets users choose between short, medium, and long response styles.
Extend the mini project to save results to a JSON file.
Add prompt truncation or summarization before sending very large inputs.
Create a CLI script that reads prompts from a text file and processes them safely.

Back to Chapter | Back to Master Plan | Previous Session