Skip to content

Session 2: Reflection, Self-Critique, and Retry Strategies

Synopsis

Introduces techniques for having systems review intermediate outputs, detect weaknesses, and attempt improved solutions. Learners explore how controlled reflection can improve quality without creating unstable loops.

Session Content

Session 2: Reflection, Self-Critique, and Retry Strategies

Session Overview

In this session, learners will explore how to make LLM-powered applications more reliable by introducing reflection, self-critique, and retry loops. These patterns are especially useful when building agentic systems that must improve outputs iteratively rather than relying on a single model response.

By the end of this session, learners will be able to:

  • Explain the role of reflection in GenAI workflows
  • Implement self-critique prompts using the OpenAI Responses API
  • Build retry strategies that improve output quality
  • Compare single-pass generation vs. critique-and-revise pipelines
  • Apply structured prompting patterns for safer iterative improvement

Learning Objectives

After this session, you should be able to:

  1. Define reflection and self-critique in the context of LLM applications
  2. Describe when retry loops are beneficial and when they can be wasteful
  3. Build a Python workflow that:
  4. Generates an initial answer
  5. Critiques that answer
  6. Produces a revised version
  7. Add simple stopping rules to iterative improvement loops
  8. Evaluate tradeoffs between latency, cost, and quality

Session Timing (~45 minutes)

  • 0–8 min: Introduction to reflection and critique patterns
  • 8–18 min: Theory: self-critique, retry strategies, stopping criteria
  • 18–30 min: Hands-on Exercise 1: Generate → Critique → Revise
  • 30–40 min: Hands-on Exercise 2: Retry loop with quality checks
  • 40–45 min: Recap, discussion, and next steps

1. Why Reflection Matters in Agentic Systems

A simple prompt can often produce a reasonable answer, but many real applications need more than “reasonable.” They need outputs that are:

  • More accurate
  • Better structured
  • Safer
  • More complete
  • Better aligned with user intent

Reflection patterns help by creating a second pass over the model’s own output.

Common Reflection Patterns

1. Single-pass generation

The model answers once.

Use when: - Speed matters most - Task complexity is low - Minor mistakes are acceptable

2. Generate → Critique → Revise

The model first creates an answer, then reviews it, then improves it.

Use when: - Quality matters more than latency - Outputs must be clearer or more reliable - You want a lightweight agentic workflow

3. Retry with feedback

The application detects issues, then asks the model to try again with guidance.

Use when: - You can define simple quality checks - The first answer may fail formatting or completeness requirements - You want bounded improvement

4. Multi-step reflective loop

The model iteratively improves its output until it meets some criteria.

Use when: - You can define stopping conditions - You are willing to trade latency and cost for quality - The task benefits from refinement


2. Key Concepts

2.1 Reflection

Reflection is the process of reviewing an output before returning it to the user or before moving to the next workflow step.

Reflection can be: - Internal to the model via prompting - External in application code via validation, retries, and scoring

Examples: - “Check whether your answer includes all requested sections.” - “Find weaknesses in the draft.” - “Revise the answer to be shorter and clearer.”


2.2 Self-Critique

Self-critique is when the model evaluates its own answer against explicit criteria.

Typical critique criteria: - Accuracy - Completeness - Clarity - Tone - Constraint adherence - Formatting - Safety

A good critique prompt is: - Specific - Criteria-based - Focused on actionable feedback - Separated from final answer generation


2.3 Retry Strategies

Retries are useful when: - Output violates format requirements - The answer is too vague - Required information is missing - The output is low quality based on a rule or score

Retry loops should be: - Bounded with a max retry count - Directed with explicit feedback - Measured for quality improvement

Bad retry loop: - “Try again” with no guidance

Better retry loop: - “The answer is missing two bullet points and does not include a summary. Revise accordingly.”


2.4 Stopping Criteria

Without stopping criteria, loops can waste time and tokens.

Common stopping conditions: - Maximum number of attempts reached - Quality threshold met - No substantial improvement detected - Output passes all validation checks


3. Prompting Patterns for Reflection

3.1 Initial generation prompt

The first prompt should clearly define the task and desired output format.

Example: - “Write a concise project summary in 5 bullet points for a technical audience.”


3.2 Critique prompt

A critique prompt should ask for evaluation only, not revision.

Example: - “Review the following summary. Identify missing details, unclear phrasing, and format violations. Return only a short critique.”


3.3 Revision prompt

The revision prompt should combine: - Original task - Original answer - Critique feedback - Constraints for the improved output

Example: - “Revise the summary using the critique below. Keep it to 5 bullets and improve clarity.”


4. Architecture Pattern: Generate → Critique → Revise

A practical architecture looks like this:

  1. User submits a task
  2. Model generates an initial draft
  3. Model critiques the draft
  4. Model revises based on critique
  5. Application optionally validates the final result

Benefits

  • Higher quality than single-pass generation
  • Easy to implement
  • Transparent workflow

Limitations

  • Increased latency
  • Higher cost
  • Critique may be shallow if poorly prompted
  • Improvement is not guaranteed unless criteria are clear

5. Hands-on Exercise 1: Generate → Critique → Revise

Goal

Build a Python script using the OpenAI Responses API that: 1. Generates a draft answer 2. Critiques it 3. Revises it

We will use a practical task: generating a short explanation of retry strategies for junior developers.


Prerequisites

Install the OpenAI Python SDK:

pip install openai

Set your API key:

export OPENAI_API_KEY="your_api_key_here"

On Windows PowerShell:

setx OPENAI_API_KEY "your_api_key_here"

Code Example

"""
Exercise 1: Generate -> Critique -> Revise using the OpenAI Responses API.

This script demonstrates a basic reflection workflow:
1. Generate an initial answer
2. Ask the model to critique that answer
3. Ask the model to revise based on the critique

Model used:
- gpt-5.4-mini

Before running:
- pip install openai
- Set OPENAI_API_KEY in your environment
"""

from openai import OpenAI

client = OpenAI()

MODEL = "gpt-5.4-mini"


def generate_draft(topic: str) -> str:
    """
    Generate an initial draft on the requested topic.
    """
    prompt = f"""
You are a helpful assistant for Python developers learning GenAI.

Write a short explanation of: "{topic}"

Requirements:
- Audience: junior Python developers
- Length: 120 to 180 words
- Tone: clear, practical, encouraging
- Include one real-world use case
"""

    response = client.responses.create(
        model=MODEL,
        input=prompt,
    )

    return response.output_text.strip()


def critique_draft(topic: str, draft: str) -> str:
    """
    Ask the model to critique the draft without rewriting it.
    """
    prompt = f"""
You are a careful reviewer.

Task:
Review the draft below about "{topic}".

Evaluate it on:
1. Clarity
2. Completeness
3. Practical usefulness
4. Whether it includes a real-world use case
5. Whether it fits the junior Python developer audience

Instructions:
- Do NOT rewrite the draft
- Return only a concise critique
- Include 3 to 5 bullet points
- Be specific and actionable

Draft:
\"\"\"
{draft}
\"\"\"
"""

    response = client.responses.create(
        model=MODEL,
        input=prompt,
    )

    return response.output_text.strip()


def revise_draft(topic: str, draft: str, critique: str) -> str:
    """
    Revise the original draft using the critique.
    """
    prompt = f"""
You are a helpful assistant improving a draft for junior Python developers.

Original task:
Write a short explanation of "{topic}" for junior Python developers.

Constraints:
- Length: 120 to 180 words
- Tone: clear, practical, encouraging
- Include one real-world use case

Original draft:
\"\"\"
{draft}
\"\"\"

Critique:
{critique}

Instructions:
- Produce an improved final version
- Address the critique directly
- Keep the explanation concise and practical
- Return only the revised answer
"""

    response = client.responses.create(
        model=MODEL,
        input=prompt,
    )

    return response.output_text.strip()


def main() -> None:
    topic = "retry strategies in LLM applications"

    draft = generate_draft(topic)
    critique = critique_draft(topic, draft)
    revised = revise_draft(topic, draft, critique)

    print("\n=== INITIAL DRAFT ===\n")
    print(draft)

    print("\n=== CRITIQUE ===\n")
    print(critique)

    print("\n=== REVISED VERSION ===\n")
    print(revised)


if __name__ == "__main__":
    main()

Example Output

=== INITIAL DRAFT ===

Retry strategies in LLM applications help your program recover when the first model response is incomplete, unclear, or incorrectly formatted. Instead of failing immediately, your app can ask the model to try again with better instructions. This is useful when you need outputs in a strict structure, such as JSON or bullet points.

For junior developers, a good way to think about retries is like validating user input in Python. If the answer does not meet your rules, you try again with guidance. A real-world use case is a customer support bot that must return a short summary and action items. If the summary is missing action items, the app can retry with feedback so the next response is more useful.

=== CRITIQUE ===

- The explanation is clear, but it could mention that retries should be limited to avoid unnecessary cost and latency.
- Completeness is good, but reflection and feedback-driven retries could be explained more explicitly.
- The real-world use case is helpful, though it could connect more directly to why retries improve reliability.
- The tone fits junior developers well, but one sentence about simple stopping rules would make it more practical.

=== REVISED VERSION ===

Retry strategies in LLM applications help your software recover when a model’s first answer is incomplete, unclear, or in the wrong format. Instead of accepting a weak result, your app can check the response, give feedback, and ask the model to try again. This makes the system more reliable, especially when you need structured output like JSON, bullet points, or required sections.

A useful way to think about retries is like validating input in Python. If the result does not pass your checks, you retry with clear instructions about what is missing. For example, a customer support bot might need to return both a summary and action items. If the first answer misses the action items, the app can retry with that specific feedback. In practice, retries should be limited so you improve quality without adding too much cost or delay.

Exercise Tasks

  1. Run the script as provided
  2. Change the topic to:
  3. "self-critique in AI systems"
  4. "why stopping criteria matter in agent loops"
  5. Compare the draft and revised outputs
  6. Identify whether the critique was specific enough to improve the answer
  7. Modify the critique prompt to require:
  8. one strength
  9. three weaknesses
  10. one priority recommendation

Discussion Questions

  • Did the revision actually improve the draft?
  • What kind of critique produced the best revision?
  • What happens if the critique is too generic?
  • When is a second pass worth the extra cost?

6. Hands-on Exercise 2: Retry Loop with Quality Checks

Goal

Create a bounded retry loop that checks whether a generated answer satisfies explicit quality rules.

We will ask the model to produce a study note with: - Exactly 3 bullet points - A final one-sentence summary - Beginner-friendly wording

If the answer fails, we retry with specific feedback.


Design Strategy

This exercise combines: - LLM generation - Programmatic validation - Retry with targeted instructions - Bounded attempts

This is a core agentic pattern: the system evaluates progress and adapts.


Code Example

"""
Exercise 2: Retry loop with validation and targeted feedback.

This script asks the model to generate a study note and checks whether:
1. It has exactly 3 bullet points
2. It ends with a summary line starting with 'Summary:'
3. It is reasonably beginner-friendly

If the output fails validation, the script retries with explicit feedback.

Model used:
- gpt-5.4-mini
"""

from openai import OpenAI

client = OpenAI()

MODEL = "gpt-5.4-mini"


def generate_study_note(topic: str, feedback: str | None = None) -> str:
    """
    Generate a study note. Optionally include feedback from a previous failed attempt.
    """
    prompt = f"""
You are creating study notes for beginner Python developers.

Task:
Create a short study note about "{topic}".

Requirements:
- Exactly 3 bullet points
- Each bullet should be one sentence
- After the bullet points, include one final line that starts with "Summary:"
- Use beginner-friendly language
- Keep the content practical and concise
"""

    if feedback:
        prompt += f"""

Feedback from the previous attempt:
{feedback}

Please fix these issues in the new answer.
"""

    response = client.responses.create(
        model=MODEL,
        input=prompt,
    )

    return response.output_text.strip()


def validate_study_note(text: str) -> list[str]:
    """
    Validate the study note against simple formatting rules.
    Returns a list of validation errors. If empty, the note passed.
    """
    errors = []

    lines = [line.strip() for line in text.splitlines() if line.strip()]
    bullet_lines = [line for line in lines if line.startswith("- ")]
    summary_lines = [line for line in lines if line.startswith("Summary:")]

    if len(bullet_lines) != 3:
        errors.append(f"Expected exactly 3 bullet points, found {len(bullet_lines)}.")

    if len(summary_lines) != 1:
        errors.append(f"Expected exactly 1 summary line starting with 'Summary:', found {len(summary_lines)}.")

    # Simple beginner-friendly heuristic:
    # Flag overly long bullet lines as possibly too complex.
    for idx, bullet in enumerate(bullet_lines, start=1):
        word_count = len(bullet.split())
        if word_count > 22:
            errors.append(
                f"Bullet point {idx} may be too long for beginners ({word_count} words)."
            )

    return errors


def generate_with_retries(topic: str, max_attempts: int = 3) -> str:
    """
    Generate a valid study note with up to max_attempts attempts.
    Raises RuntimeError if all attempts fail.
    """
    feedback = None

    for attempt in range(1, max_attempts + 1):
        result = generate_study_note(topic, feedback)
        errors = validate_study_note(result)

        print(f"\n--- Attempt {attempt} ---\n")
        print(result)

        if not errors:
            print("\nValidation passed.")
            return result

        feedback = " ".join(errors)
        print("\nValidation errors:")
        for error in errors:
            print(f"- {error}")

    raise RuntimeError("Failed to generate a valid study note within the retry limit.")


def main() -> None:
    topic = "self-critique and reflection in LLM apps"

    try:
        final_note = generate_with_retries(topic, max_attempts=3)
        print("\n=== FINAL ACCEPTED NOTE ===\n")
        print(final_note)
    except RuntimeError as exc:
        print(f"\nError: {exc}")


if __name__ == "__main__":
    main()

Example Output

--- Attempt 1 ---

- Self-critique means the model reviews its own answer to look for mistakes or missing details.
- Reflection helps an AI system improve results before showing them to the user.
- Retry strategies let the application ask for a better answer if the first one does not meet the rules.
Summary: These techniques help make LLM apps more reliable and easier to control.

Validation passed.

=== FINAL ACCEPTED NOTE ===

- Self-critique means the model reviews its own answer to look for mistakes or missing details.
- Reflection helps an AI system improve results before showing them to the user.
- Retry strategies let the application ask for a better answer if the first one does not meet the rules.
Summary: These techniques help make LLM apps more reliable and easier to control.

Suggested Extensions

Modify the validator to also check: - Maximum total word count - Presence of a required keyword - No markdown headings - Summary length under 15 words

Then update the retry feedback to mention all detected issues.


7. Best Practices for Reflection and Retry Design

7.1 Separate roles clearly

Use different prompt intentions for: - generation - critique - revision - validation feedback

This helps avoid muddled outputs.


7.2 Make critique actionable

Weak critique: - “This could be better.”

Strong critique: - “The answer lacks a real-world example and does not explain why retry limits matter.”


7.3 Keep retries bounded

Always set: - maximum attempts - acceptance criteria - failure behavior

Example failure behaviors: - return best attempt so far - escalate to human review - log for inspection - provide fallback response


7.4 Use code-based validation where possible

Prefer deterministic checks for: - JSON validity - exact section counts - word or line counts - required headers - regex-based patterns

Use model-based critique for: - clarity - helpfulness - tone - conceptual completeness


7.5 Track quality vs. cost

Reflection can improve results, but every extra pass adds: - latency - token usage - operational cost

Measure whether the quality gains are worth it.


8. Common Pitfalls

Pitfall 1: Critique is too vague

If the critique lacks specifics, the revision may not improve meaningfully.

Pitfall 2: Too many retries

Unlimited retries can create expensive loops with little benefit.

Pitfall 3: No validation

Without checks, you may assume revision helped when it did not.

Pitfall 4: Confusing critique with revision

If you ask for both at once, the output may be less structured and harder to use programmatically.

Pitfall 5: Overengineering simple tasks

Not every problem needs reflection. For easy tasks, a single pass may be enough.


9. Mini Challenge

Build a script that produces a short tutorial paragraph on a GenAI topic and improves it in up to 2 revision rounds.

Requirements

  • Initial draft under 100 words
  • Critique must mention:
  • one missing concept
  • one clarity issue
  • one improvement suggestion
  • Final version must:
  • stay under 120 words
  • include one example
  • be suitable for beginners

Stretch Goal

Add a scoring function in Python that checks: - whether the text includes the word "example" - whether word count is below 120 - whether it contains at least 2 sentences

Use the score to decide whether to stop early.


10. Recap

In this session, you learned that:

  • Reflection is a practical way to improve LLM outputs
  • Self-critique works best when guided by explicit criteria
  • Retry loops should be bounded and feedback-driven
  • Validation can be programmatic, model-based, or both
  • Agentic systems often improve quality by combining generation, checking, and revision

A powerful mental model is:

Generate → Check → Improve → Stop

That pattern appears again and again in real GenAI and agentic applications.


Useful Resources

  • OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
  • OpenAI API docs overview: https://platform.openai.com/docs
  • OpenAI Python SDK: https://github.com/openai/openai-python
  • Prompting guide: https://platform.openai.com/docs/guides/text
  • Python environment variables: https://docs.python.org/3/library/os.html#os.environ

Take-Home Assignment

Create a Python program that generates a short technical explanation on a topic of your choice and improves it using a reflection loop.

Requirements

  1. Use gpt-5.4-mini
  2. Use the OpenAI Responses API
  3. Include:
  4. initial generation
  5. critique step
  6. revised answer
  7. Add at least 2 validation rules in Python
  8. Retry at most 3 times
  9. Print:
  10. each attempt
  11. critique feedback
  12. final accepted output

Suggested Topics

  • prompt engineering basics
  • what makes an AI agent “agentic”
  • tool use in LLM applications
  • why structured outputs matter

End of Session

In the next session, learners can build on these ideas by adding tool use, planning, or multi-step agent workflows where reflection becomes part of a larger decision loop.


Back to Chapter | Back to Master Plan | Previous Session | Next Session