Session 3: Choosing the Right Model for the Task

Synopsis

Explores how to select models based on task requirements such as accuracy, reasoning ability, speed, token limits, multimodal support, and pricing. This session builds practical judgment for balancing performance and cost.

Session Content

Session 3: Choosing the Right Model for the Task

Session Overview

In this session, learners will explore how to choose the most appropriate model for different GenAI use cases. The focus is on understanding practical trade-offs such as quality, speed, cost, context window, and tool usage. By the end of the session, learners will be able to evaluate a task and select a model intentionally rather than defaulting to a single model for everything.

Duration: ~45 minutes

Learning Objectives

By the end of this session, learners should be able to:

Explain the main factors involved in model selection.
Compare models based on reasoning ability, latency, cost, and task fit.
Recognize when a smaller/faster model is sufficient.
Use the OpenAI Python SDK with the Responses API to test models on the same prompt.
Build a simple evaluation harness to compare outputs across tasks.

Agenda

Why model choice matters
Core model-selection criteria
Matching model types to task categories
Hands-on: compare models on the same task
Hands-on: build a lightweight evaluation script
Wrap-up and model selection checklist

1. Why Model Choice Matters

Many beginners start by using a single model for every task. While this works for experimentation, it is often suboptimal in production.

Different tasks have different requirements:

Fast customer support classification may need low latency and low cost.
Complex multi-step reasoning may need stronger reasoning capability.
Long-document analysis may need a larger context window.
Structured extraction may benefit from reliable instruction following and predictable output formatting.
Agentic workflows may need strong tool use, planning, and function-calling behavior.

Common Trade-offs

When choosing a model, you are usually balancing:

Quality: How good is the answer?
Latency: How quickly does it return?
Cost: How expensive is each request?
Reliability: Does it follow instructions consistently?
Context length: Can it handle large inputs?
Tool use: Can it effectively decide when and how to call tools?

Key Principle

Do not ask, “What is the best model?”
Ask, “What is the best model for this specific task and constraint set?”

2. Core Model-Selection Criteria

2.1 Task Complexity

Some tasks are simple:

Sentiment classification
Short summarization
Keyword extraction
Rewriting text in a specific tone

Some tasks are more complex:

Multi-step planning
Code generation with constraints
Long-context synthesis
Agent orchestration
Ambiguous reasoning tasks

As task complexity increases, stronger models often perform better.

2.2 Latency Requirements

Latency matters when:

Users are waiting in a chat UI
You are processing high volumes of requests
Your application has strict response-time SLAs

For interactive systems:

Lower latency often improves user experience.
A slightly weaker but faster model may be preferred.

For offline workflows:

Batch jobs can often tolerate slower responses if quality is significantly better.

2.3 Cost Sensitivity

If your application serves many users or processes many documents, cost can dominate architectural decisions.

Examples:

A daily batch pipeline over 100,000 support tickets
Automatic tagging for thousands of products
Large-scale monitoring and triage systems

A common strategy:

Use a smaller/faster model for most requests.
Escalate only difficult cases to a more capable model.

This is often called routing or tiered inference.

2.4 Output Reliability and Structure

If your application needs:

JSON output
Strict classification labels
Deterministic formatting
Tool calling with consistent arguments

then instruction-following and formatting reliability may matter more than “creative quality.”

Good model choice depends on what failure looks like in your system.

For example:

In marketing copy generation, some variation is acceptable.
In an invoice extraction pipeline, malformed output can break the workflow.

2.5 Context Window

Some tasks require sending large inputs:

Entire PDFs
Meeting transcripts
Long conversation histories
Multi-document comparisons

Questions to ask:

Can the model handle the full input?
Should you chunk the document instead?
Can you preprocess before sending to the model?

A larger context window can simplify workflows, but it may also increase cost and latency.

2.6 Tool Use and Agentic Behavior

For agentic applications, you should care about whether the model can:

Understand when a tool is needed
Select the correct tool
Produce good tool arguments
Continue reasoning after tool results return

Examples:

Calendar assistants
Retrieval-augmented workflows
Data-analysis agents
Customer support agents with backend APIs

In these cases, model selection includes more than text quality.

3. Matching Model Types to Task Categories

This section gives a practical mental model for choosing among models.

3.1 Small/Fast Models

Best for:

Classification
Routing
Basic extraction
Rewriting
Simple Q&A
High-volume automation

Benefits:

Lower cost
Lower latency
Good enough for many structured tasks

Risks:

May struggle with nuanced reasoning
May be less robust on ambiguous prompts

3.2 More Capable Models

Best for:

Complex reasoning
Multi-step generation
Detailed analysis
Difficult coding tasks
Agentic workflows with planning/tool use

Benefits:

Better quality on hard tasks
Better reasoning and instruction following
Better at handling ambiguity

Risks:

Higher cost
Potentially slower responses

3.3 Practical Rule of Thumb

Start with these questions:

Is the task simple or complex?
Is latency critical?
Is cost critical?
Is the output highly structured?
Does the workflow require tools?
Do we need long-context handling?

Then test at least two models on realistic prompts and compare results empirically.

4. Hands-On Exercise 1: Compare Models on the Same Prompt

Goal

Run the same prompt through different models and compare quality, speed, and practical usefulness.

What You Will Learn

How to call the OpenAI Responses API from Python
How to test multiple models with the same input
How to inspect outputs side by side

Setup

Install the OpenAI Python SDK:

pip install openai

Set your API key:

export OPENAI_API_KEY="your_api_key_here"

Example 1: Basic Model Comparison Script

"""
Session 3 - Exercise 1
Compare multiple models on the same prompt using the OpenAI Responses API.

Before running:
    pip install openai
    export OPENAI_API_KEY="your_api_key_here"
"""

from time import perf_counter
from openai import OpenAI

# Create the API client. The API key is read from the OPENAI_API_KEY environment variable.
client = OpenAI()

# We will compare models on the exact same prompt.
# Note: gpt-5.4-mini is required for this course and is included below.
MODELS = [
    "gpt-5.4-mini",
    # Add other models available in your environment/account for comparison.
    # Example:
    # "gpt-5.4"
]

PROMPT = """
You are helping a project manager.
Summarize the following update in 3 bullet points, then list 2 risks.

Update:
The mobile app redesign is mostly complete. The login page, dashboard,
and notifications screen have passed QA. However, the billing page still
has known issues on smaller Android devices. The analytics integration
was delayed because the vendor changed their API. The release is still
planned for next Friday, but the team may need to disable analytics in
the first version if the integration is not stable.
""".strip()


def get_text_response(model_name: str, prompt: str) -> tuple[str, float]:
    """
    Send a prompt to a model and return:
      - The extracted text output
      - The elapsed time in seconds
    """
    start = perf_counter()

    response = client.responses.create(
        model=model_name,
        input=prompt
    )

    elapsed = perf_counter() - start

    # output_text provides a convenient text aggregation for standard text responses.
    text = response.output_text
    return text, elapsed


def main() -> None:
    print("=== Model Comparison ===\n")

    for model in MODELS:
        print(f"--- Testing model: {model} ---")
        try:
            text, elapsed = get_text_response(model, PROMPT)
            print(f"Elapsed time: {elapsed:.2f} seconds")
            print("Response:")
            print(text)
        except Exception as exc:
            print(f"Error while calling model '{model}': {exc}")

        print("\n" + "=" * 60 + "\n")


if __name__ == "__main__":
    main()

Example Output

=== Model Comparison ===

--- Testing model: gpt-5.4-mini ---
Elapsed time: 1.42 seconds
Response:
- The mobile app redesign is nearly complete, with the login page, dashboard, and notifications screen already passing QA.
- The billing page still has unresolved issues on smaller Android devices.
- The release is planned for next Friday, though analytics integration is delayed due to a vendor API change.

Risks:
1. Billing page issues may affect usability on smaller Android devices at launch.
2. Analytics may need to be disabled in the initial release if integration remains unstable.

============================================================

Discussion Prompts

After running the script, discuss:

Which model gave the clearest summary?
Which model was fastest?
Did any model miss an important risk?
Which output would you trust in a production workflow?

5. Theory Interlude: Evaluation Before Standardization

Before choosing a default model, evaluate on a small representative task set.

A good evaluation set includes:

Easy tasks
Medium tasks
Edge cases
Ambiguous examples
Inputs that resemble production data

Example Evaluation Dimensions

You can score outputs on:

Correctness
Completeness
Clarity
Format adherence
Latency
Cost estimate

Even a manual score from 1 to 5 is useful early on.

6. Hands-On Exercise 2: Build a Lightweight Evaluation Harness

Goal

Create a small Python script that runs multiple prompts against multiple models and records outputs for review.

What You Will Learn

How to structure task-based comparisons
How to collect results consistently
How to build the foundation for model benchmarking

Example 2: Simple Evaluation Harness

"""
Session 3 - Exercise 2
A lightweight evaluation harness for comparing models on multiple tasks.

Before running:
    pip install openai
    export OPENAI_API_KEY="your_api_key_here"
"""

import csv
from time import perf_counter
from openai import OpenAI

client = OpenAI()

MODELS = [
    "gpt-5.4-mini",
    # Add more models if available to your account/environment.
]

TASKS = [
    {
        "task_name": "classification",
        "prompt": (
            "Classify the sentiment of this customer message as Positive, Neutral, or Negative. "
            "Reply with only one label.\n\n"
            "Message: I like the new dashboard, but it takes too long to load."
        ),
    },
    {
        "task_name": "structured_extraction",
        "prompt": (
            "Extract the following information and return valid JSON with keys: "
            "name, company, meeting_date.\n\n"
            "Text: Sarah Chen from BrightPath Analytics confirmed the meeting for March 18."
        ),
    },
    {
        "task_name": "summarization",
        "prompt": (
            "Summarize this in 2 sentences:\n\n"
            "Our infrastructure migration reduced server costs by 18 percent, "
            "but deployment times increased because the CI pipeline now runs extra checks. "
            "The team plans to optimize the pipeline over the next sprint."
        ),
    },
]


def run_task(model_name: str, prompt: str) -> tuple[str, float]:
    """
    Execute a single prompt against a model and return:
      - text output
      - elapsed time in seconds
    """
    start = perf_counter()

    response = client.responses.create(
        model=model_name,
        input=prompt
    )

    elapsed = perf_counter() - start
    return response.output_text, elapsed


def main() -> None:
    output_file = "model_eval_results.csv"

    with open(output_file, "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.DictWriter(
            csvfile,
            fieldnames=["model", "task_name", "latency_seconds", "output"]
        )
        writer.writeheader()

        for model in MODELS:
            for task in TASKS:
                print(f"Running model={model}, task={task['task_name']}")

                try:
                    output_text, elapsed = run_task(model, task["prompt"])
                    writer.writerow(
                        {
                            "model": model,
                            "task_name": task["task_name"],
                            "latency_seconds": f"{elapsed:.2f}",
                            "output": output_text,
                        }
                    )
                except Exception as exc:
                    writer.writerow(
                        {
                            "model": model,
                            "task_name": task["task_name"],
                            "latency_seconds": "ERROR",
                            "output": f"ERROR: {exc}",
                        }
                    )

    print(f"\nEvaluation complete. Results saved to: {output_file}")


if __name__ == "__main__":
    main()

Example Output in Terminal

Running model=gpt-5.4-mini, task=classification
Running model=gpt-5.4-mini, task=structured_extraction
Running model=gpt-5.4-mini, task=summarization

Evaluation complete. Results saved to: model_eval_results.csv

Example CSV Output

model,task_name,latency_seconds,output
gpt-5.4-mini,classification,0.88,Neutral
gpt-5.4-mini,structured_extraction,1.02,"{""name"": ""Sarah Chen"", ""company"": ""BrightPath Analytics"", ""meeting_date"": ""March 18""}"
gpt-5.4-mini,summarization,1.11,"The infrastructure migration reduced server costs by 18 percent. Deployment times increased due to extra CI checks, and the team plans to optimize the pipeline next sprint."

7. Hands-On Exercise 3: Add Simple Manual Scoring

Goal

Add a human-review step to make model comparison more systematic.

Why This Matters

Automated benchmarking is helpful, but many real-world criteria still need human judgment:

Is the answer genuinely useful?
Is the format easy to consume downstream?
Did the model capture subtle issues?
Would a user trust this response?

Example 3: Evaluation Results with Reviewer Scores

"""
Session 3 - Exercise 3
Add simple manual scoring fields to the evaluation pipeline.

This script writes outputs to a CSV file and includes blank columns
for a human reviewer to score later.

Before running:
    pip install openai
    export OPENAI_API_KEY="your_api_key_here"
"""

import csv
from time import perf_counter
from openai import OpenAI

client = OpenAI()

MODELS = ["gpt-5.4-mini"]

TASKS = [
    {
        "task_name": "support_reply",
        "prompt": (
            "Write a short, polite reply to this customer:\n\n"
            "I was charged twice for my subscription this month and I need help immediately."
        ),
    },
    {
        "task_name": "bug_summary",
        "prompt": (
            "Summarize this bug report in 2 bullet points:\n\n"
            "Users on iOS 17 report that after uploading a profile picture, the app freezes on "
            "the confirmation screen. Restarting the app usually resolves the freeze, but the "
            "uploaded image is sometimes lost."
        ),
    },
]


def call_model(model_name: str, prompt: str) -> tuple[str, float]:
    """Call the model and return text output and latency."""
    start = perf_counter()
    response = client.responses.create(
        model=model_name,
        input=prompt
    )
    elapsed = perf_counter() - start
    return response.output_text, elapsed


def main() -> None:
    output_file = "scored_eval_template.csv"

    with open(output_file, "w", newline="", encoding="utf-8") as file:
        writer = csv.DictWriter(
            file,
            fieldnames=[
                "model",
                "task_name",
                "latency_seconds",
                "output",
                "score_correctness",
                "score_clarity",
                "score_format",
                "reviewer_notes",
            ],
        )
        writer.writeheader()

        for model in MODELS:
            for task in TASKS:
                print(f"Evaluating {model} on {task['task_name']}...")
                try:
                    output_text, elapsed = call_model(model, task["prompt"])
                    writer.writerow(
                        {
                            "model": model,
                            "task_name": task["task_name"],
                            "latency_seconds": f"{elapsed:.2f}",
                            "output": output_text,
                            "score_correctness": "",
                            "score_clarity": "",
                            "score_format": "",
                            "reviewer_notes": "",
                        }
                    )
                except Exception as exc:
                    writer.writerow(
                        {
                            "model": model,
                            "task_name": task["task_name"],
                            "latency_seconds": "ERROR",
                            "output": f"ERROR: {exc}",
                            "score_correctness": "",
                            "score_clarity": "",
                            "score_format": "",
                            "reviewer_notes": "",
                        }
                    )

    print(f"Saved review template to {output_file}")


if __name__ == "__main__":
    main()

Suggested Review Rubric

Score each from 1 to 5:

Correctness: Is the output factually aligned with the prompt?
Clarity: Is it easy to understand?
Format: Did it follow the requested format?
Notes: What was strong or weak?

8. Patterns for Real-World Model Selection

Pattern 1: Use a Small Model First

Use a smaller/faster model for:

Triage
Tagging
Simple extraction
“Good enough” first-pass summaries

Escalate only when needed.

Example escalation rules:

The model says confidence is low
Output fails validation
Input exceeds complexity threshold
User asks a complex follow-up

Pattern 2: Route by Task Type

Different tasks use different default models:

Classification → small/fast model
Summarization → mid-tier or fast model depending on quality needs
Complex planning → more capable model
Tool-heavy agent loop → stronger reasoning/tool-use model

Pattern 3: Validate Structured Outputs

When you need structured data:

Constrain output instructions clearly
Validate returned JSON
Retry or escalate on malformed output

Model choice should consider not only intelligence, but consistency.

Pattern 4: Benchmark With Real Inputs

Toy prompts can be misleading.

Always test with:

Real support tickets
Actual internal notes
Production-like documents
Realistic user questions

The more realistic your evaluation set, the better your model decisions.

9. Mini Case Study

Scenario

You are building an internal assistant for a software company. It has three responsibilities:

Classify incoming support tickets
Summarize incident reports
Help an operations agent investigate tricky issues with multiple tools

Likely Strategy

Support ticket classification: small/fast model
Incident summarization: evaluate fast vs more capable models depending on report complexity
Operations investigation agent: more capable model for tool use and multi-step reasoning

Why This Works

Not every part of a system needs the same model.
A good architecture often combines models based on task requirements.

10. Common Mistakes

Mistake 1: Using the strongest model for everything

This can create unnecessary cost and latency.

Mistake 2: Choosing only by benchmark reputation

A model may score well generally but still be a poor fit for your exact workflow.

Mistake 3: Ignoring output format consistency

A “smart” answer that breaks your parser is still a failure.

Mistake 4: Evaluating on too few examples

One or two prompts are not enough to choose a production model.

Mistake 5: Forgetting user experience

If your app feels slow, users may prefer a slightly weaker but faster model.

11. Practical Checklist for Choosing a Model

Use this checklist before selecting a model:

What is the exact task?
How important is accuracy?
How important is speed?
How important is cost?
Does the task require structured output?
Does the task require tool use?
How long are the inputs?
What does failure look like?
Can a smaller model handle most requests?
Do we need fallback or escalation logic?

12. In-Session Activity

Pair Exercise

Work in pairs and choose one of these tasks:

Generate a support reply
Extract structured data from text
Summarize a long status update
Classify feedback sentiment

For the chosen task:

Define what matters most: quality, speed, cost, or format.
Write 3 realistic prompts.
Test at least one model using gpt-5.4-mini and, if available, compare with another model.
Discuss which model you would choose and why.

13. Summary

In this session, you learned that model selection is not about picking one universally “best” model. It is about choosing the right tool for the job based on:

Task complexity
Response quality
Latency
Cost
Context size
Structure requirements
Tool-use needs

You also practiced comparing models using the OpenAI Responses API and built a lightweight evaluation workflow in Python.

14. Useful Resources

OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
OpenAI API docs: https://platform.openai.com/docs
OpenAI Python SDK: https://github.com/openai/openai-python
Prompting guide: https://platform.openai.com/docs/guides/prompt-engineering

15. Homework

Task

Create a small benchmark for a use case you care about.

Requirements

Use gpt-5.4-mini
Include at least 5 prompts
Cover at least 3 task types
Save outputs to CSV
Add manual scoring columns
Write 5 sentences summarizing which model characteristics mattered most

Stretch Goal

Add simple automatic validation, such as:

Check if classification output matches allowed labels
Check if JSON parses successfully
Check if the response contains exactly the requested number of bullet points

16. Next Session Preview

In the next session, we will explore prompt engineering fundamentals, including how prompt structure, examples, constraints, and formatting instructions can dramatically improve model output quality.

Back to Chapter | Back to Master Plan | Previous Session | Next Session