Session 2: Reflection, Self-Critique, and Retry Strategies
Synopsis
Introduces techniques for having systems review intermediate outputs, detect weaknesses, and attempt improved solutions. Learners explore how controlled reflection can improve quality without creating unstable loops.
Session Content
Session 2: Reflection, Self-Critique, and Retry Strategies
Session Overview
In this session, learners will explore how to make LLM-powered applications more reliable by introducing reflection, self-critique, and retry loops. These patterns are especially useful when building agentic systems that must improve outputs iteratively rather than relying on a single model response.
By the end of this session, learners will be able to:
- Explain the role of reflection in GenAI workflows
- Implement self-critique prompts using the OpenAI Responses API
- Build retry strategies that improve output quality
- Compare single-pass generation vs. critique-and-revise pipelines
- Apply structured prompting patterns for safer iterative improvement
Learning Objectives
After this session, you should be able to:
- Define reflection and self-critique in the context of LLM applications
- Describe when retry loops are beneficial and when they can be wasteful
- Build a Python workflow that:
- Generates an initial answer
- Critiques that answer
- Produces a revised version
- Add simple stopping rules to iterative improvement loops
- Evaluate tradeoffs between latency, cost, and quality
Session Timing (~45 minutes)
- 0–8 min: Introduction to reflection and critique patterns
- 8–18 min: Theory: self-critique, retry strategies, stopping criteria
- 18–30 min: Hands-on Exercise 1: Generate → Critique → Revise
- 30–40 min: Hands-on Exercise 2: Retry loop with quality checks
- 40–45 min: Recap, discussion, and next steps
1. Why Reflection Matters in Agentic Systems
A simple prompt can often produce a reasonable answer, but many real applications need more than “reasonable.” They need outputs that are:
- More accurate
- Better structured
- Safer
- More complete
- Better aligned with user intent
Reflection patterns help by creating a second pass over the model’s own output.
Common Reflection Patterns
1. Single-pass generation
The model answers once.
Use when: - Speed matters most - Task complexity is low - Minor mistakes are acceptable
2. Generate → Critique → Revise
The model first creates an answer, then reviews it, then improves it.
Use when: - Quality matters more than latency - Outputs must be clearer or more reliable - You want a lightweight agentic workflow
3. Retry with feedback
The application detects issues, then asks the model to try again with guidance.
Use when: - You can define simple quality checks - The first answer may fail formatting or completeness requirements - You want bounded improvement
4. Multi-step reflective loop
The model iteratively improves its output until it meets some criteria.
Use when: - You can define stopping conditions - You are willing to trade latency and cost for quality - The task benefits from refinement
2. Key Concepts
2.1 Reflection
Reflection is the process of reviewing an output before returning it to the user or before moving to the next workflow step.
Reflection can be: - Internal to the model via prompting - External in application code via validation, retries, and scoring
Examples: - “Check whether your answer includes all requested sections.” - “Find weaknesses in the draft.” - “Revise the answer to be shorter and clearer.”
2.2 Self-Critique
Self-critique is when the model evaluates its own answer against explicit criteria.
Typical critique criteria: - Accuracy - Completeness - Clarity - Tone - Constraint adherence - Formatting - Safety
A good critique prompt is: - Specific - Criteria-based - Focused on actionable feedback - Separated from final answer generation
2.3 Retry Strategies
Retries are useful when: - Output violates format requirements - The answer is too vague - Required information is missing - The output is low quality based on a rule or score
Retry loops should be: - Bounded with a max retry count - Directed with explicit feedback - Measured for quality improvement
Bad retry loop: - “Try again” with no guidance
Better retry loop: - “The answer is missing two bullet points and does not include a summary. Revise accordingly.”
2.4 Stopping Criteria
Without stopping criteria, loops can waste time and tokens.
Common stopping conditions: - Maximum number of attempts reached - Quality threshold met - No substantial improvement detected - Output passes all validation checks
3. Prompting Patterns for Reflection
3.1 Initial generation prompt
The first prompt should clearly define the task and desired output format.
Example: - “Write a concise project summary in 5 bullet points for a technical audience.”
3.2 Critique prompt
A critique prompt should ask for evaluation only, not revision.
Example: - “Review the following summary. Identify missing details, unclear phrasing, and format violations. Return only a short critique.”
3.3 Revision prompt
The revision prompt should combine: - Original task - Original answer - Critique feedback - Constraints for the improved output
Example: - “Revise the summary using the critique below. Keep it to 5 bullets and improve clarity.”
4. Architecture Pattern: Generate → Critique → Revise
A practical architecture looks like this:
- User submits a task
- Model generates an initial draft
- Model critiques the draft
- Model revises based on critique
- Application optionally validates the final result
Benefits
- Higher quality than single-pass generation
- Easy to implement
- Transparent workflow
Limitations
- Increased latency
- Higher cost
- Critique may be shallow if poorly prompted
- Improvement is not guaranteed unless criteria are clear
5. Hands-on Exercise 1: Generate → Critique → Revise
Goal
Build a Python script using the OpenAI Responses API that: 1. Generates a draft answer 2. Critiques it 3. Revises it
We will use a practical task: generating a short explanation of retry strategies for junior developers.
Prerequisites
Install the OpenAI Python SDK:
pip install openai
Set your API key:
export OPENAI_API_KEY="your_api_key_here"
On Windows PowerShell:
setx OPENAI_API_KEY "your_api_key_here"
Code Example
"""
Exercise 1: Generate -> Critique -> Revise using the OpenAI Responses API.
This script demonstrates a basic reflection workflow:
1. Generate an initial answer
2. Ask the model to critique that answer
3. Ask the model to revise based on the critique
Model used:
- gpt-5.4-mini
Before running:
- pip install openai
- Set OPENAI_API_KEY in your environment
"""
from openai import OpenAI
client = OpenAI()
MODEL = "gpt-5.4-mini"
def generate_draft(topic: str) -> str:
"""
Generate an initial draft on the requested topic.
"""
prompt = f"""
You are a helpful assistant for Python developers learning GenAI.
Write a short explanation of: "{topic}"
Requirements:
- Audience: junior Python developers
- Length: 120 to 180 words
- Tone: clear, practical, encouraging
- Include one real-world use case
"""
response = client.responses.create(
model=MODEL,
input=prompt,
)
return response.output_text.strip()
def critique_draft(topic: str, draft: str) -> str:
"""
Ask the model to critique the draft without rewriting it.
"""
prompt = f"""
You are a careful reviewer.
Task:
Review the draft below about "{topic}".
Evaluate it on:
1. Clarity
2. Completeness
3. Practical usefulness
4. Whether it includes a real-world use case
5. Whether it fits the junior Python developer audience
Instructions:
- Do NOT rewrite the draft
- Return only a concise critique
- Include 3 to 5 bullet points
- Be specific and actionable
Draft:
\"\"\"
{draft}
\"\"\"
"""
response = client.responses.create(
model=MODEL,
input=prompt,
)
return response.output_text.strip()
def revise_draft(topic: str, draft: str, critique: str) -> str:
"""
Revise the original draft using the critique.
"""
prompt = f"""
You are a helpful assistant improving a draft for junior Python developers.
Original task:
Write a short explanation of "{topic}" for junior Python developers.
Constraints:
- Length: 120 to 180 words
- Tone: clear, practical, encouraging
- Include one real-world use case
Original draft:
\"\"\"
{draft}
\"\"\"
Critique:
{critique}
Instructions:
- Produce an improved final version
- Address the critique directly
- Keep the explanation concise and practical
- Return only the revised answer
"""
response = client.responses.create(
model=MODEL,
input=prompt,
)
return response.output_text.strip()
def main() -> None:
topic = "retry strategies in LLM applications"
draft = generate_draft(topic)
critique = critique_draft(topic, draft)
revised = revise_draft(topic, draft, critique)
print("\n=== INITIAL DRAFT ===\n")
print(draft)
print("\n=== CRITIQUE ===\n")
print(critique)
print("\n=== REVISED VERSION ===\n")
print(revised)
if __name__ == "__main__":
main()
Example Output
=== INITIAL DRAFT ===
Retry strategies in LLM applications help your program recover when the first model response is incomplete, unclear, or incorrectly formatted. Instead of failing immediately, your app can ask the model to try again with better instructions. This is useful when you need outputs in a strict structure, such as JSON or bullet points.
For junior developers, a good way to think about retries is like validating user input in Python. If the answer does not meet your rules, you try again with guidance. A real-world use case is a customer support bot that must return a short summary and action items. If the summary is missing action items, the app can retry with feedback so the next response is more useful.
=== CRITIQUE ===
- The explanation is clear, but it could mention that retries should be limited to avoid unnecessary cost and latency.
- Completeness is good, but reflection and feedback-driven retries could be explained more explicitly.
- The real-world use case is helpful, though it could connect more directly to why retries improve reliability.
- The tone fits junior developers well, but one sentence about simple stopping rules would make it more practical.
=== REVISED VERSION ===
Retry strategies in LLM applications help your software recover when a model’s first answer is incomplete, unclear, or in the wrong format. Instead of accepting a weak result, your app can check the response, give feedback, and ask the model to try again. This makes the system more reliable, especially when you need structured output like JSON, bullet points, or required sections.
A useful way to think about retries is like validating input in Python. If the result does not pass your checks, you retry with clear instructions about what is missing. For example, a customer support bot might need to return both a summary and action items. If the first answer misses the action items, the app can retry with that specific feedback. In practice, retries should be limited so you improve quality without adding too much cost or delay.
Exercise Tasks
- Run the script as provided
- Change the topic to:
"self-critique in AI systems""why stopping criteria matter in agent loops"- Compare the draft and revised outputs
- Identify whether the critique was specific enough to improve the answer
- Modify the critique prompt to require:
- one strength
- three weaknesses
- one priority recommendation
Discussion Questions
- Did the revision actually improve the draft?
- What kind of critique produced the best revision?
- What happens if the critique is too generic?
- When is a second pass worth the extra cost?
6. Hands-on Exercise 2: Retry Loop with Quality Checks
Goal
Create a bounded retry loop that checks whether a generated answer satisfies explicit quality rules.
We will ask the model to produce a study note with: - Exactly 3 bullet points - A final one-sentence summary - Beginner-friendly wording
If the answer fails, we retry with specific feedback.
Design Strategy
This exercise combines: - LLM generation - Programmatic validation - Retry with targeted instructions - Bounded attempts
This is a core agentic pattern: the system evaluates progress and adapts.
Code Example
"""
Exercise 2: Retry loop with validation and targeted feedback.
This script asks the model to generate a study note and checks whether:
1. It has exactly 3 bullet points
2. It ends with a summary line starting with 'Summary:'
3. It is reasonably beginner-friendly
If the output fails validation, the script retries with explicit feedback.
Model used:
- gpt-5.4-mini
"""
from openai import OpenAI
client = OpenAI()
MODEL = "gpt-5.4-mini"
def generate_study_note(topic: str, feedback: str | None = None) -> str:
"""
Generate a study note. Optionally include feedback from a previous failed attempt.
"""
prompt = f"""
You are creating study notes for beginner Python developers.
Task:
Create a short study note about "{topic}".
Requirements:
- Exactly 3 bullet points
- Each bullet should be one sentence
- After the bullet points, include one final line that starts with "Summary:"
- Use beginner-friendly language
- Keep the content practical and concise
"""
if feedback:
prompt += f"""
Feedback from the previous attempt:
{feedback}
Please fix these issues in the new answer.
"""
response = client.responses.create(
model=MODEL,
input=prompt,
)
return response.output_text.strip()
def validate_study_note(text: str) -> list[str]:
"""
Validate the study note against simple formatting rules.
Returns a list of validation errors. If empty, the note passed.
"""
errors = []
lines = [line.strip() for line in text.splitlines() if line.strip()]
bullet_lines = [line for line in lines if line.startswith("- ")]
summary_lines = [line for line in lines if line.startswith("Summary:")]
if len(bullet_lines) != 3:
errors.append(f"Expected exactly 3 bullet points, found {len(bullet_lines)}.")
if len(summary_lines) != 1:
errors.append(f"Expected exactly 1 summary line starting with 'Summary:', found {len(summary_lines)}.")
# Simple beginner-friendly heuristic:
# Flag overly long bullet lines as possibly too complex.
for idx, bullet in enumerate(bullet_lines, start=1):
word_count = len(bullet.split())
if word_count > 22:
errors.append(
f"Bullet point {idx} may be too long for beginners ({word_count} words)."
)
return errors
def generate_with_retries(topic: str, max_attempts: int = 3) -> str:
"""
Generate a valid study note with up to max_attempts attempts.
Raises RuntimeError if all attempts fail.
"""
feedback = None
for attempt in range(1, max_attempts + 1):
result = generate_study_note(topic, feedback)
errors = validate_study_note(result)
print(f"\n--- Attempt {attempt} ---\n")
print(result)
if not errors:
print("\nValidation passed.")
return result
feedback = " ".join(errors)
print("\nValidation errors:")
for error in errors:
print(f"- {error}")
raise RuntimeError("Failed to generate a valid study note within the retry limit.")
def main() -> None:
topic = "self-critique and reflection in LLM apps"
try:
final_note = generate_with_retries(topic, max_attempts=3)
print("\n=== FINAL ACCEPTED NOTE ===\n")
print(final_note)
except RuntimeError as exc:
print(f"\nError: {exc}")
if __name__ == "__main__":
main()
Example Output
--- Attempt 1 ---
- Self-critique means the model reviews its own answer to look for mistakes or missing details.
- Reflection helps an AI system improve results before showing them to the user.
- Retry strategies let the application ask for a better answer if the first one does not meet the rules.
Summary: These techniques help make LLM apps more reliable and easier to control.
Validation passed.
=== FINAL ACCEPTED NOTE ===
- Self-critique means the model reviews its own answer to look for mistakes or missing details.
- Reflection helps an AI system improve results before showing them to the user.
- Retry strategies let the application ask for a better answer if the first one does not meet the rules.
Summary: These techniques help make LLM apps more reliable and easier to control.
Suggested Extensions
Modify the validator to also check: - Maximum total word count - Presence of a required keyword - No markdown headings - Summary length under 15 words
Then update the retry feedback to mention all detected issues.
7. Best Practices for Reflection and Retry Design
7.1 Separate roles clearly
Use different prompt intentions for: - generation - critique - revision - validation feedback
This helps avoid muddled outputs.
7.2 Make critique actionable
Weak critique: - “This could be better.”
Strong critique: - “The answer lacks a real-world example and does not explain why retry limits matter.”
7.3 Keep retries bounded
Always set: - maximum attempts - acceptance criteria - failure behavior
Example failure behaviors: - return best attempt so far - escalate to human review - log for inspection - provide fallback response
7.4 Use code-based validation where possible
Prefer deterministic checks for: - JSON validity - exact section counts - word or line counts - required headers - regex-based patterns
Use model-based critique for: - clarity - helpfulness - tone - conceptual completeness
7.5 Track quality vs. cost
Reflection can improve results, but every extra pass adds: - latency - token usage - operational cost
Measure whether the quality gains are worth it.
8. Common Pitfalls
Pitfall 1: Critique is too vague
If the critique lacks specifics, the revision may not improve meaningfully.
Pitfall 2: Too many retries
Unlimited retries can create expensive loops with little benefit.
Pitfall 3: No validation
Without checks, you may assume revision helped when it did not.
Pitfall 4: Confusing critique with revision
If you ask for both at once, the output may be less structured and harder to use programmatically.
Pitfall 5: Overengineering simple tasks
Not every problem needs reflection. For easy tasks, a single pass may be enough.
9. Mini Challenge
Build a script that produces a short tutorial paragraph on a GenAI topic and improves it in up to 2 revision rounds.
Requirements
- Initial draft under 100 words
- Critique must mention:
- one missing concept
- one clarity issue
- one improvement suggestion
- Final version must:
- stay under 120 words
- include one example
- be suitable for beginners
Stretch Goal
Add a scoring function in Python that checks:
- whether the text includes the word "example"
- whether word count is below 120
- whether it contains at least 2 sentences
Use the score to decide whether to stop early.
10. Recap
In this session, you learned that:
- Reflection is a practical way to improve LLM outputs
- Self-critique works best when guided by explicit criteria
- Retry loops should be bounded and feedback-driven
- Validation can be programmatic, model-based, or both
- Agentic systems often improve quality by combining generation, checking, and revision
A powerful mental model is:
Generate → Check → Improve → Stop
That pattern appears again and again in real GenAI and agentic applications.
Useful Resources
- OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
- OpenAI API docs overview: https://platform.openai.com/docs
- OpenAI Python SDK: https://github.com/openai/openai-python
- Prompting guide: https://platform.openai.com/docs/guides/text
- Python environment variables: https://docs.python.org/3/library/os.html#os.environ
Take-Home Assignment
Create a Python program that generates a short technical explanation on a topic of your choice and improves it using a reflection loop.
Requirements
- Use
gpt-5.4-mini - Use the OpenAI Responses API
- Include:
- initial generation
- critique step
- revised answer
- Add at least 2 validation rules in Python
- Retry at most 3 times
- Print:
- each attempt
- critique feedback
- final accepted output
Suggested Topics
- prompt engineering basics
- what makes an AI agent “agentic”
- tool use in LLM applications
- why structured outputs matter
End of Session
In the next session, learners can build on these ideas by adding tool use, planning, or multi-step agent workflows where reflection becomes part of a larger decision loop.
Back to Chapter | Back to Master Plan | Previous Session | Next Session