Session 2: Understanding Large Language Models

Synopsis

Provides a high-level explanation of how LLMs are trained and how they generate outputs. Learners become familiar with concepts such as tokens, context windows, inference, reasoning patterns, and model limitations without diving into mathematical depth.

Session Content

Session 2: Understanding Large Language Models

Session Overview

In this session, learners will build a practical understanding of what Large Language Models (LLMs) are, how they work at a conceptual level, what they are good at, and where they fail. The goal is not deep mathematical theory, but a Python-developer-friendly mental model that helps learners use LLMs effectively in applications.

By the end of this session, learners will be able to:

Explain what an LLM is in simple technical terms
Describe tokens, context windows, and next-token prediction
Understand why prompting matters
Recognize common strengths and limitations of LLMs
Use the OpenAI Responses API with Python to interact with an LLM
Experiment with prompt design and compare outputs systematically

Session Duration

~45 minutes

Suggested breakdown:

5 min: Introduction and motivation
12 min: Core concepts of LLMs
8 min: How LLMs generate text
8 min: Strengths, limitations, and failure modes
10 min: Hands-on exercises with Python and the Responses API
2 min: Recap

1. Introduction: What Is a Large Language Model?

A Large Language Model is a machine learning model trained on vast amounts of text to predict and generate language.

At a practical level, an LLM:

Reads input text
Identifies patterns from training
Predicts what text should come next
Produces useful outputs such as explanations, summaries, code, classifications, or structured data

For developers, the most important thing to understand is this:

An LLM does not “think” like a human. It predicts likely continuations of text based on patterns it has learned.

This simple idea explains a lot of both its power and its limitations.

Why this matters for developers

If you are building applications with LLMs, you are not writing traditional deterministic logic. Instead, you are:

Providing instructions in natural language
Supplying context
Constraining outputs
Evaluating probabilistic behavior

This means application design shifts from only writing code to also designing prompts, context, evaluation, and guardrails.

2. Core Concepts

2.1 Tokens

LLMs do not process text as whole sentences. They process tokens, which are chunks of text.

A token may be:

A word
Part of a word
Punctuation
Whitespace-related chunks
Code fragments

For example, the sentence:

Python developers love clean code.

may be split into several tokens rather than four exact words.

Why tokens matter

Tokens affect:

Input size limits
Cost
Latency
How much context the model can consider at once

When building applications, you should think in terms of token budgets, not just characters or words.

2.2 Context Window

The context window is the amount of text the model can consider in a single request.

This includes:

Your instructions
User input
Retrieved documents
Conversation history
The model’s generated output

Why context matters

If the important information is missing, the model cannot use it.

If the context is noisy, conflicting, or too large, quality may drop.

Good LLM applications carefully choose what goes into the prompt.

2.3 Next-Token Prediction

At the heart of an LLM is a surprisingly simple training objective:

Given previous tokens, predict the next token.

For example:

Input: The capital of France is
Likely next token: Paris

By repeating this process token by token, the model can generate:

Answers
Explanations
Stories
Code
JSON
Summaries

Important mental model

LLMs do not retrieve “facts” the way a database does. They generate outputs based on learned statistical patterns.

That is why they can sound fluent even when they are wrong.

2.4 Parameters and “Large”

The word “large” in LLM refers to the scale of the model, especially the number of learned parameters.

You do not need to understand the full math to use LLMs effectively, but conceptually:

Parameters are the learned internal values of the model
More parameters generally allow richer pattern representation
Scale often improves capability, but not perfectly

Bigger models are often more capable, but they may also be:

More expensive
Slower
Still imperfect

3. How LLMs Generate Text

3.1 Prompt In, Tokens Out

When you send a prompt to an LLM, the model:

Reads the input tokens
Builds an internal representation of the context
Predicts a likely next token
Repeats until it decides the response is complete

This process is autoregressive generation.

3.2 Why Prompting Matters

Prompting matters because the model is highly sensitive to:

Wording
Specificity
Examples
Output format instructions
Role or task framing

Compare these prompts:

Weak prompt

Tell me about Python.

Better prompt

Explain Python to a beginner web developer in 5 bullet points.
Focus on syntax readability, ecosystem, and common use cases.

The second prompt gives the model stronger constraints, which usually leads to more useful output.

3.3 Deterministic vs Probabilistic Behavior

Traditional software often behaves deterministically:

Same input
Same logic
Same output

LLMs are probabilistic:

Same input may produce variation
Small wording changes may affect results
Output quality depends heavily on context and instruction design

This does not make them unreliable by default, but it means you must design for variability.

4. What LLMs Are Good At

LLMs are especially good at tasks involving language patterns, including:

Summarization
Rewriting
Classification
Information extraction
Code generation
Q&A over provided context
Brainstorming
Translation
Drafting structured text

Developer takeaway

Use LLMs when the task benefits from flexible language understanding or generation.

Examples:

Summarize a support ticket
Extract fields from user messages
Convert free text into structured JSON
Draft documentation from code comments
Explain an error message to a user

5. Limitations and Failure Modes

Understanding limitations is essential when building real systems.

5.1 Hallucinations

A hallucination occurs when the model generates plausible-sounding but false information.

Example:

Inventing a library function that does not exist
Making up a citation
Misstating a fact confidently

Why it happens

Because the model is optimizing for plausible text generation, not guaranteed truth.

Mitigation strategies

Provide trusted context
Ask for grounded answers
Use retrieval when facts matter
Validate outputs in code
Avoid blind trust

5.2 Prompt Sensitivity

Small prompt changes can produce different results.

This means:

Prompt design matters
Testing matters
Evaluation matters

A prompt that works once is not enough. You want prompts that work consistently across many cases.

5.3 Context Dependence

The model can only use what is in the current context plus what it learned during training.

If your application needs:

Fresh news
Private company data
Product catalogs
User-specific state

you must provide that information explicitly through system design.

5.4 Overconfidence

LLMs often respond fluently and confidently even when uncertain.

This is one reason why polished language should never be confused with correctness.

5.5 Non-Deterministic Output

Outputs may vary in:

Wording
Structure
Level of detail
Completeness

If you need stable downstream processing, constrain outputs clearly and validate them.

6. A Practical Mental Model for Python Developers

Think of an LLM as a powerful text transformation engine.

You provide:

Instructions
Context
Examples
Output constraints

The model returns:

Generated text
Structured responses
Candidate reasoning artifacts
Reformatted or extracted information

Analogy

A traditional function might look like this:

result = transform(data)

An LLM-powered function feels more like this:

result = transform_with_probabilistic_language_model(
    instructions="Summarize this bug report in 3 bullet points",
    context=data,
    constraints="Use plain English"
)

The interface is simple, but quality depends on prompt and context design.

7. Hands-On Setup

7.1 Install the OpenAI Python SDK

pip install openai

7.2 Set Your API Key

macOS/Linux

export OPENAI_API_KEY="your_api_key_here"

Windows PowerShell

setx OPENAI_API_KEY "your_api_key_here"

After setting it, restart your terminal if needed.

8. Hands-On Exercise 1: Your First LLM Call with the Responses API

Goal

Make a basic request to the gpt-5.4-mini model and print the response.

Code

"""
Exercise 1: Basic call to the OpenAI Responses API.

What this demonstrates:
- Creating an OpenAI client
- Sending a simple prompt to gpt-5.4-mini
- Reading the text output safely

Requirements:
- pip install openai
- Set OPENAI_API_KEY in your environment
"""

from openai import OpenAI


def main() -> None:
    # Create the API client. The SDK will automatically read OPENAI_API_KEY.
    client = OpenAI()

    # Send a simple prompt using the Responses API.
    response = client.responses.create(
        model="gpt-5.4-mini",
        input="In 3 short bullet points, explain what a large language model is."
    )

    # output_text is the easiest way to get the generated text from the response.
    print("Model response:\n")
    print(response.output_text)


if __name__ == "__main__":
    main()

Example output

Model response:

- A large language model is an AI system trained on huge amounts of text data.
- It predicts and generates text based on patterns learned during training.
- It can help with tasks like answering questions, summarizing, writing, and coding.

Discussion

Observe that:

The code is very small
The behavior is powerful but probabilistic
The quality depends on prompt wording

9. Hands-On Exercise 2: Compare Weak vs Strong Prompts

Goal

See how prompt quality affects output quality.

Code

"""
Exercise 2: Compare a weak prompt with a stronger, more constrained prompt.

This helps learners see that LLM output quality often improves when:
- The audience is specified
- The format is constrained
- The goal is clearly stated
"""

from openai import OpenAI


def get_response(client: OpenAI, prompt: str) -> str:
    """Send a prompt to the model and return plain text output."""
    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt
    )
    return response.output_text


def main() -> None:
    client = OpenAI()

    weak_prompt = "Tell me about Python."

    strong_prompt = (
        "Explain Python to a beginner programmer in exactly 5 bullet points. "
        "Cover readability, common use cases, libraries, learning curve, "
        "and why it is popular."
    )

    print("=== Weak Prompt ===")
    print(weak_prompt)
    print()
    print(get_response(client, weak_prompt))
    print("\n" + "=" * 60 + "\n")

    print("=== Strong Prompt ===")
    print(strong_prompt)
    print()
    print(get_response(client, strong_prompt))


if __name__ == "__main__":
    main()

Example output

=== Weak Prompt ===
Tell me about Python.

Python is a high-level programming language known for its readability and simplicity. It is widely used in web development, data science, automation, artificial intelligence, and more. Python has a large community and a rich ecosystem of libraries.

============================================================

=== Strong Prompt ===
Explain Python to a beginner programmer in exactly 5 bullet points. Cover readability, common use cases, libraries, learning curve, and why it is popular.

- Python has a clean, readable syntax, which makes it easier for beginners to learn.
- It is used for many tasks, including web development, data analysis, automation, and AI.
- Python has a huge library ecosystem that helps developers build applications faster.
- Its learning curve is gentle compared to many other programming languages.
- Python is popular because it is versatile, productive, and supported by a large community.

Reflection questions

Which response is easier to use in an application?
Which prompt is more testable?
Which output would be easier to show directly in a UI?

10. Hands-On Exercise 3: Summarization and Controlled Output

Goal

Use an LLM for summarization and force a useful format.

Scenario

You have a long bug report and want a concise engineering summary.

Code

"""
Exercise 3: Summarize a bug report in a structured format.

This demonstrates:
- Practical use of LLMs for summarization
- How output constraints improve usefulness
- How to turn messy text into a cleaner artifact
"""

from openai import OpenAI


BUG_REPORT = """
Customer reports that the checkout page freezes after clicking 'Pay Now'.
The issue seems to happen mostly on mobile Safari, but one user also reported
it on Chrome for iPhone. The problem started after the release deployed on
Tuesday evening. Several users said the loading spinner keeps spinning and
the payment never completes, although in two cases the card was actually charged.
Support marked this as high priority because it affects purchases and creates
confusion about whether orders succeeded.
"""


def main() -> None:
    client = OpenAI()

    prompt = f"""
You are helping an engineering team triage a bug report.

Summarize the following bug report using this exact format:

Summary:
<one sentence>

Impact:
<one sentence>

Suspected clues:
- <bullet>
- <bullet>
- <bullet>

Bug report:
{BUG_REPORT}
"""

    response = client.responses.create(
        model="gpt-5.4-mini",
        input=prompt
    )

    print(response.output_text)


if __name__ == "__main__":
    main()

Example output

Summary:
The checkout page may freeze after users tap "Pay Now," especially on mobile Safari, preventing clear payment completion.

Impact:
This is a high-priority revenue-impacting issue because users may be charged without receiving confirmation that their order succeeded.

Suspected clues:
- The issue began after the Tuesday evening release.
- Mobile Safari appears to be the most frequently affected platform.
- In some cases, payment processing may succeed even though the UI remains stuck on a spinner.

Discussion

This is a strong use case for LLMs:

Input is unstructured
Desired output is concise
Format matters
Perfect factual precision is less critical than useful synthesis

11. Hands-On Exercise 4: Observe Variability and Prompt Sensitivity

Goal

Explore how small input changes can change results.

Code

"""
Exercise 4: Compare outputs across similar prompts.

This exercise helps learners understand:
- Prompt sensitivity
- Non-deterministic behavior
- Why evaluation matters
"""

from openai import OpenAI


PROMPTS = [
    "Explain what an API is.",
    "Explain what an API is to a 12-year-old.",
    "Explain what an API is in one sentence.",
    "Explain what an API is using a restaurant analogy.",
]


def main() -> None:
    client = OpenAI()

    for index, prompt in enumerate(PROMPTS, start=1):
        response = client.responses.create(
            model="gpt-5.4-mini",
            input=prompt
        )

        print(f"--- Prompt {index} ---")
        print(prompt)
        print()
        print(response.output_text)
        print("\n" + "-" * 60 + "\n")


if __name__ == "__main__":
    main()

Example output

--- Prompt 1 ---
Explain what an API is.

An API is a way for different software systems to communicate and share functionality with each other.

------------------------------------------------------------

--- Prompt 2 ---
Explain what an API is to a 12-year-old.

An API is like a messenger that helps two apps talk to each other without needing to know how everything works inside.

------------------------------------------------------------

--- Prompt 3 ---
Explain what an API is in one sentence.

An API is a defined way for software applications to communicate with each other.

------------------------------------------------------------

--- Prompt 4 ---
Explain what an API is using a restaurant analogy.

An API is like a waiter in a restaurant: you place an order, the waiter carries it to the kitchen, and then brings back your food without you needing to know how the kitchen works.

Reflection

This illustrates an important lesson:

Prompting is part of programming when building LLM applications.

12. Guided Discussion: When Should You Use an LLM?

Ask learners to classify these tasks as either:

Good LLM use case
Possible but needs validation
Better solved with traditional code

Task list

Convert a support email into a short summary
Compute tax from fixed business rules
Extract product names from messy customer text
Generate SQL directly against production without checks
Rephrase documentation for beginners
Validate whether a UUID matches a required format

13. Best Practices Introduced in This Session

As you start working with LLMs, keep these habits:

Be explicit in prompts
Specify audience, format, and constraints
Prefer structured outputs when possible
Keep important context close to the request
Do not trust fluent output blindly
Validate outputs when correctness matters
Test prompts across multiple inputs, not just one example

14. Common Misconceptions

Misconception 1: “The model understands exactly like a human.”

Not quite. It is better to think of it as pattern-based language generation with powerful emergent capabilities.

Misconception 2: “If the answer sounds confident, it is probably correct.”

Confidence in wording is not evidence of factual accuracy.

Misconception 3: “A single good demo means the prompt is production-ready.”

Production use requires repeatability, evaluation, and safeguards.

Misconception 4: “LLMs replace all traditional programming.”

They complement traditional programming. In many systems, the best design combines deterministic code with LLM-based language capabilities.

15. Mini Quiz

1. What is the core training objective of an LLM?

Answer: Predict the next token based on previous tokens.

2. Why do tokens matter?

Answer: They affect context limits, cost, and how much text the model can process.

3. What is a hallucination?

Answer: A plausible-sounding but false or invented output from the model.

4. Why does prompt design matter?

Answer: Because LLMs are sensitive to wording, structure, and constraints.

5. Name one task that is often better handled by traditional code.

Answer: Deterministic validation, such as checking whether a UUID matches a format.

16. Recap

In this session, learners explored:

What an LLM is
Tokens and context windows
Next-token prediction
Why prompting matters
Strengths and limitations of LLMs
How to call gpt-5.4-mini using the OpenAI Responses API
How better prompts lead to better outputs

The key practical takeaway is:

LLMs are powerful probabilistic tools for language tasks, but they must be guided with clear prompts, good context, and careful validation.

17. Useful Resources

OpenAI Responses API guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
OpenAI API docs overview: https://developers.openai.com/api/
OpenAI Python SDK: https://github.com/openai/openai-python
Prompting guide: https://platform.openai.com/docs/guides/prompt-engineering

18. Suggested Homework

Rewrite one weak prompt into three stronger versions and compare outputs.
Build a small Python script that:
Reads a paragraph from a file
Sends it to gpt-5.4-mini
Returns:
- a one-sentence summary
- three bullet points
Collect three examples where the model gives a strong answer and three where it gives a weak or misleading answer.

19. Instructor Notes

Key teaching emphasis

Focus on giving learners a practical mental model, not a research-level treatment.

Watch for confusion around

“The model knows everything”
“The model is always factual”
“Prompting is just wording, not engineering”

Good discussion prompt

Ask learners:

If a model gives beautiful output that is factually wrong, is the problem the model, the prompt, the system design, or all three?

This sets up future sessions on evaluation, grounding, and agentic workflows.

Back to Chapter | Back to Master Plan | Previous Session | Next Session

Session 2: Understanding Large Language Models

Synopsis

Session Content

Session 2: Understanding Large Language Models

Session Overview

Session Duration

1. Introduction: What Is a Large Language Model?

Why this matters for developers

2. Core Concepts

2.1 Tokens

Why tokens matter

2.2 Context Window

Why context matters

2.3 Next-Token Prediction

Important mental model

2.4 Parameters and “Large”

3. How LLMs Generate Text

3.1 Prompt In, Tokens Out

3.2 Why Prompting Matters

Weak prompt

Better prompt

3.3 Deterministic vs Probabilistic Behavior

4. What LLMs Are Good At

Developer takeaway

5. Limitations and Failure Modes

5.1 Hallucinations

Why it happens

Mitigation strategies

5.2 Prompt Sensitivity

5.3 Context Dependence

5.4 Overconfidence

5.5 Non-Deterministic Output

6. A Practical Mental Model for Python Developers

Analogy

7. Hands-On Setup

7.1 Install the OpenAI Python SDK

7.2 Set Your API Key

macOS/Linux

Windows PowerShell

8. Hands-On Exercise 1: Your First LLM Call with the Responses API

Goal

Code

Example output

Discussion

9. Hands-On Exercise 2: Compare Weak vs Strong Prompts

Goal

Code

Example output

Reflection questions

10. Hands-On Exercise 3: Summarization and Controlled Output

Goal

Scenario

Code

Example output

Discussion

11. Hands-On Exercise 4: Observe Variability and Prompt Sensitivity

Goal

Code

Example output

Reflection

12. Guided Discussion: When Should You Use an LLM?

Task list

Suggested answers

13. Best Practices Introduced in This Session

14. Common Misconceptions

Misconception 1: “The model understands exactly like a human.”

Misconception 2: “If the answer sounds confident, it is probably correct.”

Misconception 3: “A single good demo means the prompt is production-ready.”

Misconception 4: “LLMs replace all traditional programming.”

15. Mini Quiz

1. What is the core training objective of an LLM?

2. Why do tokens matter?

3. What is a hallucination?

4. Why does prompt design matter?

5. Name one task that is often better handled by traditional code.

16. Recap

17. Useful Resources

18. Suggested Homework

19. Instructor Notes

Key teaching emphasis