Session 4: Evaluating Multi-Agent Performance
Synopsis
Focuses on how to assess collaborative quality, communication overhead, task completion, and system efficiency. Learners apply earlier evaluation methods to more complex distributed agent behaviors.
Session Content
Session 4: Evaluating Multi-Agent Performance
Session Overview
In this session, learners will explore how to evaluate multi-agent systems in a structured, repeatable, and practical way. The focus is on measuring the quality, reliability, and efficiency of agent collaboration rather than evaluating only single-model outputs. By the end of the session, learners will understand evaluation dimensions for multi-agent workflows, common failure modes, and how to build a lightweight evaluation harness in Python using the OpenAI Responses API with gpt-5.4-mini.
Learning Objectives
By the end of this session, learners will be able to:
- Explain why multi-agent evaluation is different from single-agent or single-prompt evaluation
- Identify key evaluation dimensions such as task success, coordination quality, latency, cost, and robustness
- Recognize common failure patterns in multi-agent systems
- Design simple test cases and scoring rubrics for multi-agent workflows
- Implement a basic evaluator in Python using the OpenAI SDK and Responses API
- Analyze agent outputs and compare alternative multi-agent designs
Agenda for a 45-Minute Session
- 0–5 min: Why evaluating multi-agent systems is hard
- 5–15 min: Core evaluation dimensions and failure modes
- 15–25 min: Evaluation strategies: human review, rubric scoring, LLM-as-judge, and automated checks
- 25–40 min: Hands-on: build a simple evaluation harness for a two-agent workflow
- 40–45 min: Wrap-up, discussion, and extensions
1. Why Multi-Agent Evaluation Is Different
A single-agent system usually produces one output for one task. Evaluation often asks:
- Was the answer correct?
- Was it helpful?
- Was it safe?
- Was it concise?
A multi-agent system introduces more complexity:
- Multiple intermediate outputs exist
- Agents may depend on each other’s work
- Errors can propagate across steps
- Coordination may be successful even when the final answer is weak, or vice versa
- Two systems may achieve the same result with very different costs and latencies
Example
Suppose you have:
- Research Agent: gathers facts
- Writer Agent: drafts a summary
- Reviewer Agent: checks for omissions or inaccuracies
A weak final answer could be caused by:
- Poor fact gathering by the research agent
- Good research but poor synthesis by the writer
- Good draft but ineffective review
- Correct review feedback that the writer ignored
- Excessive back-and-forth causing latency/cost issues
So evaluation must consider:
- Final output quality
- Intermediate output quality
- Coordination effectiveness
- Resource usage
- Stability across repeated runs
2. Core Evaluation Dimensions
When evaluating a multi-agent system, define metrics across several categories.
2.1 Task Success
This is the most important dimension.
Questions to ask:
- Did the system solve the task?
- Was the answer accurate?
- Did it satisfy user constraints?
- Was the requested format followed?
Example metrics
- Binary success/failure
- Accuracy score from 1–5
- Format compliance: pass/fail
- Coverage score: how many required points were addressed
2.2 Coordination Quality
Multi-agent systems should work together effectively.
Questions to ask:
- Did agents pass useful context to one another?
- Did downstream agents use upstream outputs correctly?
- Was work duplicated unnecessarily?
- Did the workflow reduce confusion or increase it?
Indicators of poor coordination
- Repeating the same work in multiple agents
- Contradictory outputs between agents
- Missing handoff details
- Reviewer flags issues already solved
- Planner creates steps that executors ignore
2.3 Latency
Multi-agent systems often trade quality for speed.
Questions to ask:
- How long did the full workflow take?
- Which agent caused the bottleneck?
- Did multiple rounds of refinement improve results enough to justify the delay?
Common measurements
- Total wall-clock time
- Time per agent step
- Number of model calls
- Number of sequential rounds
2.4 Cost
More agents usually mean more tokens and more API calls.
Questions to ask:
- Is the quality improvement worth the added cost?
- Which steps consume the most tokens?
- Can some agents be simplified or removed?
Common measurements
- Number of requests
- Input/output token estimates
- Relative cost per run
- Cost per successful task
2.5 Robustness
A good multi-agent system should behave reasonably across different inputs.
Questions to ask:
- Does it work only on easy cases?
- Does it fail badly on ambiguous tasks?
- How sensitive is it to prompt wording?
- Does one agent collapse when upstream output is slightly malformed?
What robustness testing looks like
- Easy/medium/hard task sets
- Noisy or incomplete inputs
- Edge cases
- Repeated runs with minor prompt variations
2.6 Interpretability and Debuggability
You should be able to understand why the system failed.
Questions to ask:
- Can you inspect each agent’s output?
- Can you trace where the failure began?
- Do prompts and outputs make debugging possible?
This is one of the major benefits of multi-agent design: failures can be localized if the workflow is transparent.
3. Common Failure Modes in Multi-Agent Systems
Below are recurring patterns you should watch for.
3.1 Error Propagation
A mistake made early is amplified later.
Example:
The planner misunderstands the task, and all subsequent agents execute the wrong plan.
3.2 Hallucinated Handoffs
An agent claims that another agent found something that was never actually produced.
Example:
The writer says “Based on the researcher’s evidence…” even though the researcher provided no such evidence.
3.3 Over-Decomposition
The task is split into too many agents or too many steps.
Symptoms:
- High latency
- High cost
- Minimal quality gains
- Excessive repetition
3.4 Under-Specification
Agents are not given clear roles, boundaries, or output schemas.
Symptoms:
- Overlapping work
- Inconsistent formats
- Missing key information in handoffs
3.5 Conflicting Objectives
Different agents optimize for different things.
Example:
A summarizer aims to be concise while a reviewer demands exhaustive detail.
3.6 Reviewer Ineffectiveness
The reviewer agent exists but adds little value.
Symptoms:
- Generic feedback only
- Misses important errors
- Suggests changes that do not improve outcomes
4. Evaluation Strategies
No single evaluation method is enough. In practice, combine several.
4.1 Human Evaluation
Humans review outputs against a rubric.
Strengths
- Good for nuanced quality judgments
- Useful for early-stage systems
- Can detect real-world usefulness
Weaknesses
- Slow
- Expensive
- Hard to scale consistently
4.2 Rubric-Based Scoring
Define explicit criteria and assign scores.
Example rubric for a research-summary workflow
- Accuracy: 0–5
- Coverage: 0–5
- Clarity: 0–5
- Proper structure: 0–3
- Evidence usage: 0–5
This improves consistency and can later be used by human reviewers or LLM judges.
4.3 Automated Checks
Useful when outputs must satisfy strict requirements.
Examples
- JSON schema validation
- Presence/absence of required sections
- Keyword or entity matching
- Unit-test-like assertions
- Regex checks for format compliance
Automated checks are fast and objective, but they usually capture only part of quality.
4.4 LLM-as-Judge
A model evaluates outputs according to a prompt and rubric.
Good uses
- Comparing two system variants
- Scoring qualitative dimensions
- Classifying whether requirements were met
Risks
- Judge bias
- Inconsistent scoring
- Overly generous assessments
- Prompt sensitivity
Best practice
Use LLM judges together with:
- fixed rubrics
- spot-checked human review
- simple automated checks
4.5 Comparative Evaluation
Instead of asking “Is this output good?”, ask:
- Which of these two systems performed better?
- Which version was more accurate, useful, or concise?
Pairwise comparison often produces more reliable signal than absolute scoring.
5. Designing an Evaluation Plan
A practical evaluation plan usually includes the following pieces.
5.1 Define the Workflow
Write down:
- agent roles
- inputs and outputs
- handoff format
- number of turns or rounds
- expected final deliverable
5.2 Create a Test Set
Build a small but diverse benchmark.
Example categories
- straightforward tasks
- ambiguous tasks
- tasks with incomplete information
- tasks requiring strict formatting
- failure-triggering edge cases
Start with 5–10 cases before scaling up.
5.3 Define Success Criteria
For each task, specify:
- must-have requirements
- optional nice-to-have qualities
- expected structure
- acceptable failure boundaries
5.4 Decide Metrics
Typical metrics include:
- final score
- success rate
- average latency
- average number of calls
- pass rate on automated checks
- reviewer usefulness score
5.5 Log Everything
To debug and compare systems, log:
- prompts
- outputs
- timestamps
- agent names
- evaluation scores
- failures and exceptions
6. Hands-On Exercise 1: Build a Simple Two-Agent Workflow
In this exercise, learners will implement:
- a Research Agent
- a Writer Agent
- a lightweight evaluation setup
The task: produce a short structured summary from a topic prompt.
This is intentionally simple so the evaluation concepts stay clear.
6.1 Setup
Install the OpenAI Python SDK:
pip install openai
Set your API key:
export OPENAI_API_KEY="your_api_key_here"
6.2 Python Script: Two-Agent Workflow
"""
two_agent_workflow.py
A simple multi-agent workflow using the OpenAI Responses API.
Agents:
1. Research Agent: extracts key points on a topic
2. Writer Agent: turns key points into a structured summary
Model used:
- gpt-5.4-mini
Before running:
pip install openai
export OPENAI_API_KEY="your_api_key_here"
"""
from openai import OpenAI
# Create a reusable client instance.
client = OpenAI()
MODEL = "gpt-5.4-mini"
def call_model(system_prompt: str, user_prompt: str) -> str:
"""
Calls the OpenAI Responses API and returns plain text output.
Args:
system_prompt: The system instruction defining the agent role.
user_prompt: The task-specific prompt.
Returns:
Model-generated text as a string.
"""
response = client.responses.create(
model=MODEL,
input=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
)
return response.output_text.strip()
def research_agent(topic: str) -> str:
"""
Produces concise bullet points about a topic.
This simulates an upstream information-gathering agent.
"""
system_prompt = (
"You are a research assistant. "
"Produce 4 to 6 concise factual bullet points about the given topic. "
"Be clear, avoid unsupported claims, and do not write a full essay."
)
user_prompt = f"Topic: {topic}"
return call_model(system_prompt, user_prompt)
def writer_agent(topic: str, research_notes: str) -> str:
"""
Produces a structured summary using the research notes.
This simulates a downstream synthesis agent.
"""
system_prompt = (
"You are a technical writer. "
"Using the provided research notes, write a short summary with exactly "
"these sections:\n"
"Title:\n"
"Summary:\n"
"Key Takeaways:\n"
"The writing should be clear and concise."
)
user_prompt = (
f"Topic: {topic}\n\n"
f"Research Notes:\n{research_notes}\n\n"
"Write the final structured summary."
)
return call_model(system_prompt, user_prompt)
def run_workflow(topic: str) -> dict:
"""
Runs the two-agent pipeline and returns all intermediate and final outputs.
Args:
topic: The topic to summarize.
Returns:
A dictionary containing the topic, research notes, and final summary.
"""
research_notes = research_agent(topic)
final_summary = writer_agent(topic, research_notes)
return {
"topic": topic,
"research_notes": research_notes,
"final_summary": final_summary,
}
if __name__ == "__main__":
topic = "Benefits and risks of autonomous delivery robots in cities"
result = run_workflow(topic)
print("=== TOPIC ===")
print(result["topic"])
print("\n=== RESEARCH NOTES ===")
print(result["research_notes"])
print("\n=== FINAL SUMMARY ===")
print(result["final_summary"])
6.3 Example Output
=== TOPIC ===
Benefits and risks of autonomous delivery robots in cities
=== RESEARCH NOTES ===
- Autonomous delivery robots can reduce the cost of last-mile delivery.
- They may improve delivery speed for short urban routes.
- These robots can help reduce human workload for repetitive local deliveries.
- Challenges include pedestrian safety, sidewalk congestion, and navigation in complex environments.
- Regulatory uncertainty can slow adoption across different cities.
- Weather, vandalism, and theft can affect operational reliability.
=== FINAL SUMMARY ===
Title:
Autonomous Delivery Robots in Cities
Summary:
Autonomous delivery robots can make last-mile logistics more efficient by lowering costs and handling short urban deliveries. They may also reduce routine workload for human workers. However, their use introduces challenges related to pedestrian safety, sidewalk congestion, regulation, and operational reliability.
Key Takeaways:
- They can improve efficiency in urban delivery operations.
- Adoption depends on safety, regulation, and public-space management.
- Real-world reliability is affected by weather and security risks.
7. Hands-On Exercise 2: Add Evaluation with Automated Checks
Now we evaluate the workflow using simple deterministic checks.
We will test for:
- presence of required sections
- summary length
- whether research notes contain enough bullet points
7.1 Python Script: Automated Evaluation
"""
automated_evaluation.py
Adds simple automated checks for a two-agent workflow.
Checks:
- Research agent produced at least 4 bullet points
- Final summary contains required sections
- Summary section is not empty
Before running:
pip install openai
export OPENAI_API_KEY="your_api_key_here"
"""
import re
from openai import OpenAI
client = OpenAI()
MODEL = "gpt-5.4-mini"
def call_model(system_prompt: str, user_prompt: str) -> str:
"""
Calls the OpenAI Responses API and returns plain text output.
"""
response = client.responses.create(
model=MODEL,
input=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
)
return response.output_text.strip()
def research_agent(topic: str) -> str:
"""
Produces concise bullet points about a topic.
"""
system_prompt = (
"You are a research assistant. "
"Produce 4 to 6 concise factual bullet points about the given topic. "
"Use one bullet per line starting with '- '."
)
user_prompt = f"Topic: {topic}"
return call_model(system_prompt, user_prompt)
def writer_agent(topic: str, research_notes: str) -> str:
"""
Produces a structured summary with fixed sections.
"""
system_prompt = (
"You are a technical writer. "
"Using the provided research notes, write a short summary with exactly "
"these sections:\n"
"Title:\n"
"Summary:\n"
"Key Takeaways:\n"
"Under 'Key Takeaways', include 2 or 3 bullet points."
)
user_prompt = (
f"Topic: {topic}\n\n"
f"Research Notes:\n{research_notes}\n\n"
"Write the final structured summary."
)
return call_model(system_prompt, user_prompt)
def count_bullets(text: str) -> int:
"""
Counts lines that start with '- '.
"""
return sum(1 for line in text.splitlines() if line.strip().startswith("- "))
def contains_required_sections(text: str) -> bool:
"""
Checks whether all required section headers exist.
"""
required_headers = ["Title:", "Summary:", "Key Takeaways:"]
return all(header in text for header in required_headers)
def extract_summary_section(text: str) -> str:
"""
Extracts the content after 'Summary:' and before 'Key Takeaways:'.
Returns an empty string if parsing fails.
"""
match = re.search(r"Summary:\s*(.*?)\s*Key Takeaways:", text, re.DOTALL)
return match.group(1).strip() if match else ""
def evaluate_output(research_notes: str, final_summary: str) -> dict:
"""
Runs deterministic checks and returns a score summary.
"""
checks = {
"research_has_at_least_4_bullets": count_bullets(research_notes) >= 4,
"final_has_required_sections": contains_required_sections(final_summary),
"summary_section_not_empty": len(extract_summary_section(final_summary)) > 0,
}
score = sum(checks.values())
max_score = len(checks)
return {
"checks": checks,
"score": score,
"max_score": max_score,
}
def run_workflow(topic: str) -> dict:
"""
Runs the workflow and evaluates the output.
"""
research_notes = research_agent(topic)
final_summary = writer_agent(topic, research_notes)
evaluation = evaluate_output(research_notes, final_summary)
return {
"topic": topic,
"research_notes": research_notes,
"final_summary": final_summary,
"evaluation": evaluation,
}
if __name__ == "__main__":
topic = "How electric buses affect public transportation systems"
result = run_workflow(topic)
print("=== TOPIC ===")
print(result["topic"])
print("\n=== RESEARCH NOTES ===")
print(result["research_notes"])
print("\n=== FINAL SUMMARY ===")
print(result["final_summary"])
print("\n=== EVALUATION ===")
for check_name, passed in result["evaluation"]["checks"].items():
print(f"{check_name}: {'PASS' if passed else 'FAIL'}")
print(
f"\nOverall Score: "
f"{result['evaluation']['score']}/{result['evaluation']['max_score']}"
)
7.2 Example Output
=== EVALUATION ===
research_has_at_least_4_bullets: PASS
final_has_required_sections: PASS
summary_section_not_empty: PASS
Overall Score: 3/3
8. Hands-On Exercise 3: Use an LLM Judge for Rubric Scoring
Now we add a model-based evaluator. This is useful for dimensions such as:
- clarity
- coverage
- faithfulness to research notes
- usefulness of final summary
The evaluator model will score the output against a rubric.
8.1 Rubric Design
We will score each category from 1 to 5:
- Clarity: Is the summary easy to understand?
- Coverage: Does it include important points from the research notes?
- Faithfulness: Does it avoid introducing unsupported claims?
- Structure: Does it follow the requested format well?
8.2 Python Script: LLM-as-Judge
"""
llm_judge_evaluation.py
Uses the OpenAI Responses API both for generation and for evaluation.
Workflow:
1. Research Agent creates notes
2. Writer Agent creates a structured summary
3. Judge Agent scores the final summary using a rubric
Before running:
pip install openai
export OPENAI_API_KEY="your_api_key_here"
"""
import json
from openai import OpenAI
client = OpenAI()
MODEL = "gpt-5.4-mini"
def call_model(system_prompt: str, user_prompt: str) -> str:
"""
Calls the OpenAI Responses API and returns plain text output.
"""
response = client.responses.create(
model=MODEL,
input=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
)
return response.output_text.strip()
def research_agent(topic: str) -> str:
"""
Produces concise factual notes on the topic.
"""
system_prompt = (
"You are a research assistant. "
"Produce 4 to 6 concise factual bullet points about the given topic. "
"Use one bullet per line starting with '- '."
)
user_prompt = f"Topic: {topic}"
return call_model(system_prompt, user_prompt)
def writer_agent(topic: str, research_notes: str) -> str:
"""
Produces the final structured summary.
"""
system_prompt = (
"You are a technical writer. "
"Using the provided research notes, write a short summary with exactly "
"these sections:\n"
"Title:\n"
"Summary:\n"
"Key Takeaways:\n"
"Under 'Key Takeaways', include 2 or 3 bullet points."
)
user_prompt = (
f"Topic: {topic}\n\n"
f"Research Notes:\n{research_notes}\n\n"
"Write the final structured summary."
)
return call_model(system_prompt, user_prompt)
def judge_output(topic: str, research_notes: str, final_summary: str) -> dict:
"""
Uses an LLM judge to score the final summary against a rubric.
The model is instructed to return valid JSON only.
"""
system_prompt = (
"You are an evaluator for a multi-agent workflow. "
"Score the final summary using this rubric from 1 to 5:\n"
"- clarity\n"
"- coverage\n"
"- faithfulness_to_notes\n"
"- structure\n\n"
"Return ONLY valid JSON with this exact schema:\n"
"{\n"
' "clarity": <int>,\n'
' "coverage": <int>,\n'
' "faithfulness_to_notes": <int>,\n'
' "structure": <int>,\n'
' "overall_comment": "<short string>"\n'
"}\n"
"Do not include markdown fences."
)
user_prompt = (
f"Topic: {topic}\n\n"
f"Research Notes:\n{research_notes}\n\n"
f"Final Summary:\n{final_summary}\n\n"
"Evaluate the summary."
)
raw_result = call_model(system_prompt, user_prompt)
try:
return json.loads(raw_result)
except json.JSONDecodeError:
return {
"clarity": 0,
"coverage": 0,
"faithfulness_to_notes": 0,
"structure": 0,
"overall_comment": f"Failed to parse judge output: {raw_result}",
}
def run_workflow(topic: str) -> dict:
"""
Runs generation and judge evaluation.
"""
research_notes = research_agent(topic)
final_summary = writer_agent(topic, research_notes)
judge_scores = judge_output(topic, research_notes, final_summary)
return {
"topic": topic,
"research_notes": research_notes,
"final_summary": final_summary,
"judge_scores": judge_scores,
}
if __name__ == "__main__":
topic = "The role of telemedicine in rural healthcare access"
result = run_workflow(topic)
print("=== TOPIC ===")
print(result["topic"])
print("\n=== RESEARCH NOTES ===")
print(result["research_notes"])
print("\n=== FINAL SUMMARY ===")
print(result["final_summary"])
print("\n=== JUDGE SCORES ===")
print(json.dumps(result["judge_scores"], indent=2))
8.3 Example Output
{
"clarity": 5,
"coverage": 4,
"faithfulness_to_notes": 5,
"structure": 5,
"overall_comment": "Clear and well-structured summary that uses the research notes appropriately, though one additional point from the notes could be incorporated."
}
9. Hands-On Exercise 4: Evaluate Multiple Test Cases
A good evaluation should not rely on a single example. In this exercise, learners will run the workflow across multiple topics and collect summary metrics.
9.1 Python Script: Batch Evaluation Harness
"""
batch_evaluation_harness.py
Runs a simple two-agent workflow over multiple test cases and evaluates
each result with both automated checks and an LLM judge.
Before running:
pip install openai
export OPENAI_API_KEY="your_api_key_here"
"""
import json
import re
import time
from statistics import mean
from openai import OpenAI
client = OpenAI()
MODEL = "gpt-5.4-mini"
def call_model(system_prompt: str, user_prompt: str) -> str:
"""
Calls the OpenAI Responses API and returns plain text output.
"""
response = client.responses.create(
model=MODEL,
input=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
)
return response.output_text.strip()
def research_agent(topic: str) -> str:
"""
Produces bullet-point research notes.
"""
system_prompt = (
"You are a research assistant. "
"Produce 4 to 6 concise factual bullet points about the given topic. "
"Use one bullet per line starting with '- '."
)
user_prompt = f"Topic: {topic}"
return call_model(system_prompt, user_prompt)
def writer_agent(topic: str, research_notes: str) -> str:
"""
Produces the final structured summary.
"""
system_prompt = (
"You are a technical writer. "
"Using the provided research notes, write a short summary with exactly "
"these sections:\n"
"Title:\n"
"Summary:\n"
"Key Takeaways:\n"
"Under 'Key Takeaways', include 2 or 3 bullet points."
)
user_prompt = (
f"Topic: {topic}\n\n"
f"Research Notes:\n{research_notes}\n\n"
"Write the final structured summary."
)
return call_model(system_prompt, user_prompt)
def count_bullets(text: str) -> int:
"""
Counts lines that start with '- '.
"""
return sum(1 for line in text.splitlines() if line.strip().startswith("- "))
def contains_required_sections(text: str) -> bool:
"""
Verifies presence of required section headers.
"""
required_headers = ["Title:", "Summary:", "Key Takeaways:"]
return all(header in text for header in required_headers)
def extract_summary_section(text: str) -> str:
"""
Extracts the content between 'Summary:' and 'Key Takeaways:'.
"""
match = re.search(r"Summary:\s*(.*?)\s*Key Takeaways:", text, re.DOTALL)
return match.group(1).strip() if match else ""
def automated_eval(research_notes: str, final_summary: str) -> dict:
"""
Runs deterministic validation checks.
"""
checks = {
"research_has_at_least_4_bullets": count_bullets(research_notes) >= 4,
"final_has_required_sections": contains_required_sections(final_summary),
"summary_section_not_empty": len(extract_summary_section(final_summary)) > 0,
}
return {
"checks": checks,
"score": sum(checks.values()),
"max_score": len(checks),
}
def judge_eval(topic: str, research_notes: str, final_summary: str) -> dict:
"""
Uses an LLM judge to score the output.
"""
system_prompt = (
"You are an evaluator for a multi-agent workflow. "
"Score the final summary using this rubric from 1 to 5:\n"
"- clarity\n"
"- coverage\n"
"- faithfulness_to_notes\n"
"- structure\n\n"
"Return ONLY valid JSON with this exact schema:\n"
"{\n"
' "clarity": <int>,\n'
' "coverage": <int>,\n'
' "faithfulness_to_notes": <int>,\n'
' "structure": <int>,\n'
' "overall_comment": "<short string>"\n'
"}\n"
"Do not include markdown fences."
)
user_prompt = (
f"Topic: {topic}\n\n"
f"Research Notes:\n{research_notes}\n\n"
f"Final Summary:\n{final_summary}\n\n"
"Evaluate the summary."
)
raw_result = call_model(system_prompt, user_prompt)
try:
return json.loads(raw_result)
except json.JSONDecodeError:
return {
"clarity": 0,
"coverage": 0,
"faithfulness_to_notes": 0,
"structure": 0,
"overall_comment": f"Failed to parse judge output: {raw_result}",
}
def evaluate_topic(topic: str) -> dict:
"""
Runs the full workflow for a single topic and records latency.
"""
start_time = time.time()
research_notes = research_agent(topic)
final_summary = writer_agent(topic, research_notes)
auto_scores = automated_eval(research_notes, final_summary)
judge_scores = judge_eval(topic, research_notes, final_summary)
elapsed_seconds = round(time.time() - start_time, 2)
return {
"topic": topic,
"research_notes": research_notes,
"final_summary": final_summary,
"automated_evaluation": auto_scores,
"judge_evaluation": judge_scores,
"latency_seconds": elapsed_seconds,
}
def summarize_results(results: list[dict]) -> dict:
"""
Aggregates scores across test cases.
"""
auto_scores = [r["automated_evaluation"]["score"] for r in results]
auto_max = results[0]["automated_evaluation"]["max_score"] if results else 0
clarity_scores = [r["judge_evaluation"]["clarity"] for r in results]
coverage_scores = [r["judge_evaluation"]["coverage"] for r in results]
faithfulness_scores = [r["judge_evaluation"]["faithfulness_to_notes"] for r in results]
structure_scores = [r["judge_evaluation"]["structure"] for r in results]
latencies = [r["latency_seconds"] for r in results]
return {
"num_cases": len(results),
"avg_automated_score": round(mean(auto_scores), 2) if auto_scores else 0,
"automated_score_max": auto_max,
"avg_clarity": round(mean(clarity_scores), 2) if clarity_scores else 0,
"avg_coverage": round(mean(coverage_scores), 2) if coverage_scores else 0,
"avg_faithfulness": round(mean(faithfulness_scores), 2) if faithfulness_scores else 0,
"avg_structure": round(mean(structure_scores), 2) if structure_scores else 0,
"avg_latency_seconds": round(mean(latencies), 2) if latencies else 0,
}
if __name__ == "__main__":
topics = [
"Benefits and challenges of vertical farming",
"How wearable health devices support preventive care",
"The impact of remote work on urban transportation",
"Why data quality matters in machine learning projects",
]
all_results = []
for topic in topics:
print(f"Running evaluation for: {topic}")
result = evaluate_topic(topic)
all_results.append(result)
summary = summarize_results(all_results)
print("\n=== AGGREGATE SUMMARY ===")
print(json.dumps(summary, indent=2))
print("\n=== SAMPLE RESULT ===")
print(json.dumps(all_results[0], indent=2))
9.2 Example Aggregate Output
{
"num_cases": 4,
"avg_automated_score": 3.0,
"automated_score_max": 3,
"avg_clarity": 4.75,
"avg_coverage": 4.25,
"avg_faithfulness": 4.75,
"avg_structure": 5.0,
"avg_latency_seconds": 3.18
}
10. Interpreting Results
Once you have data, the next step is deciding what it means.
Example interpretation patterns
High structure, low coverage
- The writer follows formatting rules
- But important research points are missing
- Possible fix: improve handoff quality or strengthen writer instructions
High coverage, low faithfulness
- The final answer includes many points
- But adds unsupported details
- Possible fix: instruct the writer to use only supplied notes
Good quality, high latency
- The workflow is effective but slow
- Possible fix: simplify agent steps or reduce review rounds
Strong average, weak edge-case performance
- The system works on standard tasks
- But fails on ambiguous or noisy inputs
- Possible fix: expand test coverage and harden prompts
11. Discussion Prompts
Use these for live discussion or reflection:
- What should count as success in a multi-agent system: final answer quality or collaboration quality?
- When is a reviewer agent actually useful?
- How would you compare a two-agent and three-agent architecture fairly?
- What kinds of tasks benefit most from intermediate-output evaluation?
- What are the dangers of relying only on LLM judges?
12. Best Practices for Evaluating Multi-Agent Systems
- Start with a small, representative benchmark
- Score both final outputs and intermediate artifacts
- Combine automated checks with qualitative review
- Measure cost and latency alongside quality
- Use pairwise comparisons when testing variants
- Keep prompts, inputs, and outputs logged for debugging
- Evaluate across multiple runs and task types
- Treat evaluator prompts as part of the system design
13. Suggested Instructor-Led Walkthrough
A practical classroom flow for the hands-on portion:
- Run the two-agent pipeline on one topic
- Inspect intermediate research notes
- Apply automated checks
- Add LLM judge scoring
- Run across multiple topics
- Compare results and diagnose where failures originate
- Brainstorm prompt or workflow improvements
14. Wrap-Up
In this session, learners moved from the idea of “Did the model answer correctly?” to a more realistic question: “How well did the multi-agent system work as a coordinated process?” That shift is essential in agentic development.
Key takeaway:
Evaluating multi-agent systems requires measuring both outcome quality and interaction quality.
A good evaluation setup helps you improve prompts, redesign workflows, reduce cost, and make agent systems more reliable.
Useful Resources
- OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
- OpenAI API docs: https://platform.openai.com/docs
- OpenAI Python SDK: https://github.com/openai/openai-python
- Prompt engineering guide: https://platform.openai.com/docs/guides/prompt-engineering
- Python
remodule docs: https://docs.python.org/3/library/re.html - Python
jsonmodule docs: https://docs.python.org/3/library/json.html - Python
statisticsmodule docs: https://docs.python.org/3/library/statistics.html
Optional Homework
- Add a Reviewer Agent and evaluate whether it improves final quality enough to justify added latency.
- Compare two prompt versions for the Writer Agent using the same benchmark.
- Add CSV or JSONL logging to store all runs for later analysis.
- Create edge-case tasks designed to trigger coordination failures.
- Build a pairwise evaluator that compares outputs from two workflow variants.
Quick Recap
- Multi-agent evaluation is broader than final-answer evaluation
- Key dimensions include success, coordination, latency, cost, and robustness
- Automated checks are useful but limited
- LLM judges help evaluate nuanced qualities
- Batch evaluation reveals patterns hidden by single examples
- Logged intermediate outputs make debugging much easier