Session 4: Evaluating Multi-Agent Performance

Synopsis

Focuses on how to assess collaborative quality, communication overhead, task completion, and system efficiency. Learners apply earlier evaluation methods to more complex distributed agent behaviors.

Session Content

Session 4: Evaluating Multi-Agent Performance

Session Overview

In this session, learners will explore how to evaluate multi-agent systems in a structured, repeatable, and practical way. The focus is on measuring the quality, reliability, and efficiency of agent collaboration rather than evaluating only single-model outputs. By the end of the session, learners will understand evaluation dimensions for multi-agent workflows, common failure modes, and how to build a lightweight evaluation harness in Python using the OpenAI Responses API with gpt-5.4-mini.

Learning Objectives

By the end of this session, learners will be able to:

Explain why multi-agent evaluation is different from single-agent or single-prompt evaluation
Identify key evaluation dimensions such as task success, coordination quality, latency, cost, and robustness
Recognize common failure patterns in multi-agent systems
Design simple test cases and scoring rubrics for multi-agent workflows
Implement a basic evaluator in Python using the OpenAI SDK and Responses API
Analyze agent outputs and compare alternative multi-agent designs

Agenda for a 45-Minute Session

0–5 min: Why evaluating multi-agent systems is hard
5–15 min: Core evaluation dimensions and failure modes
15–25 min: Evaluation strategies: human review, rubric scoring, LLM-as-judge, and automated checks
25–40 min: Hands-on: build a simple evaluation harness for a two-agent workflow
40–45 min: Wrap-up, discussion, and extensions

1. Why Multi-Agent Evaluation Is Different

A single-agent system usually produces one output for one task. Evaluation often asks:

Was the answer correct?
Was it helpful?
Was it safe?
Was it concise?

A multi-agent system introduces more complexity:

Multiple intermediate outputs exist
Agents may depend on each other’s work
Errors can propagate across steps
Coordination may be successful even when the final answer is weak, or vice versa
Two systems may achieve the same result with very different costs and latencies

Example

Suppose you have:

Research Agent: gathers facts
Writer Agent: drafts a summary
Reviewer Agent: checks for omissions or inaccuracies

A weak final answer could be caused by:

Poor fact gathering by the research agent
Good research but poor synthesis by the writer
Good draft but ineffective review
Correct review feedback that the writer ignored
Excessive back-and-forth causing latency/cost issues

So evaluation must consider:

Final output quality
Intermediate output quality
Coordination effectiveness
Resource usage
Stability across repeated runs

2. Core Evaluation Dimensions

When evaluating a multi-agent system, define metrics across several categories.

2.1 Task Success

This is the most important dimension.

Questions to ask:

Did the system solve the task?
Was the answer accurate?
Did it satisfy user constraints?
Was the requested format followed?

Example metrics

Binary success/failure
Accuracy score from 1–5
Format compliance: pass/fail
Coverage score: how many required points were addressed

2.2 Coordination Quality

Multi-agent systems should work together effectively.

Questions to ask:

Did agents pass useful context to one another?
Did downstream agents use upstream outputs correctly?
Was work duplicated unnecessarily?
Did the workflow reduce confusion or increase it?

Indicators of poor coordination

Repeating the same work in multiple agents
Contradictory outputs between agents
Missing handoff details
Reviewer flags issues already solved
Planner creates steps that executors ignore

2.3 Latency

Multi-agent systems often trade quality for speed.

Questions to ask:

How long did the full workflow take?
Which agent caused the bottleneck?
Did multiple rounds of refinement improve results enough to justify the delay?

Common measurements

Total wall-clock time
Time per agent step
Number of model calls
Number of sequential rounds

2.4 Cost

More agents usually mean more tokens and more API calls.

Questions to ask:

Is the quality improvement worth the added cost?
Which steps consume the most tokens?
Can some agents be simplified or removed?

Common measurements

Number of requests
Input/output token estimates
Relative cost per run
Cost per successful task

2.5 Robustness

A good multi-agent system should behave reasonably across different inputs.

Questions to ask:

Does it work only on easy cases?
Does it fail badly on ambiguous tasks?
How sensitive is it to prompt wording?
Does one agent collapse when upstream output is slightly malformed?

What robustness testing looks like

Easy/medium/hard task sets
Noisy or incomplete inputs
Edge cases
Repeated runs with minor prompt variations

2.6 Interpretability and Debuggability

You should be able to understand why the system failed.

Questions to ask:

Can you inspect each agent’s output?
Can you trace where the failure began?
Do prompts and outputs make debugging possible?

This is one of the major benefits of multi-agent design: failures can be localized if the workflow is transparent.

3. Common Failure Modes in Multi-Agent Systems

Below are recurring patterns you should watch for.

3.1 Error Propagation

A mistake made early is amplified later.

Example:
The planner misunderstands the task, and all subsequent agents execute the wrong plan.

3.2 Hallucinated Handoffs

An agent claims that another agent found something that was never actually produced.

Example:
The writer says “Based on the researcher’s evidence…” even though the researcher provided no such evidence.

3.3 Over-Decomposition

The task is split into too many agents or too many steps.

Symptoms:

High latency
High cost
Minimal quality gains
Excessive repetition

3.4 Under-Specification

Agents are not given clear roles, boundaries, or output schemas.

Symptoms:

Overlapping work
Inconsistent formats
Missing key information in handoffs

3.5 Conflicting Objectives

Different agents optimize for different things.

Example:
A summarizer aims to be concise while a reviewer demands exhaustive detail.

3.6 Reviewer Ineffectiveness

The reviewer agent exists but adds little value.

Symptoms:

Generic feedback only
Misses important errors
Suggests changes that do not improve outcomes

4. Evaluation Strategies

No single evaluation method is enough. In practice, combine several.

4.1 Human Evaluation

Humans review outputs against a rubric.

Strengths

Good for nuanced quality judgments
Useful for early-stage systems
Can detect real-world usefulness

Weaknesses

Slow
Expensive
Hard to scale consistently

4.2 Rubric-Based Scoring

Define explicit criteria and assign scores.

Example rubric for a research-summary workflow

Accuracy: 0–5
Coverage: 0–5
Clarity: 0–5
Proper structure: 0–3
Evidence usage: 0–5

This improves consistency and can later be used by human reviewers or LLM judges.

4.3 Automated Checks

Useful when outputs must satisfy strict requirements.

Examples

JSON schema validation
Presence/absence of required sections
Keyword or entity matching
Unit-test-like assertions
Regex checks for format compliance

Automated checks are fast and objective, but they usually capture only part of quality.

4.4 LLM-as-Judge

A model evaluates outputs according to a prompt and rubric.

Good uses

Comparing two system variants
Scoring qualitative dimensions
Classifying whether requirements were met

Risks

Judge bias
Inconsistent scoring
Overly generous assessments
Prompt sensitivity

Best practice

Use LLM judges together with:

fixed rubrics
spot-checked human review
simple automated checks

4.5 Comparative Evaluation

Instead of asking “Is this output good?”, ask:

Which of these two systems performed better?
Which version was more accurate, useful, or concise?

Pairwise comparison often produces more reliable signal than absolute scoring.

5. Designing an Evaluation Plan

A practical evaluation plan usually includes the following pieces.

5.1 Define the Workflow

Write down:

agent roles
inputs and outputs
handoff format
number of turns or rounds
expected final deliverable

5.2 Create a Test Set

Build a small but diverse benchmark.

Example categories

straightforward tasks
ambiguous tasks
tasks with incomplete information
tasks requiring strict formatting
failure-triggering edge cases

Start with 5–10 cases before scaling up.

5.3 Define Success Criteria

For each task, specify:

must-have requirements
optional nice-to-have qualities
expected structure
acceptable failure boundaries

5.4 Decide Metrics

Typical metrics include:

final score
success rate
average latency
average number of calls
pass rate on automated checks
reviewer usefulness score

5.5 Log Everything

To debug and compare systems, log:

prompts
outputs
timestamps
agent names
evaluation scores
failures and exceptions

6. Hands-On Exercise 1: Build a Simple Two-Agent Workflow

In this exercise, learners will implement:

a Research Agent
a Writer Agent
a lightweight evaluation setup

The task: produce a short structured summary from a topic prompt.

This is intentionally simple so the evaluation concepts stay clear.

6.1 Setup

Install the OpenAI Python SDK:

pip install openai

Set your API key:

export OPENAI_API_KEY="your_api_key_here"

6.2 Python Script: Two-Agent Workflow

"""
two_agent_workflow.py

A simple multi-agent workflow using the OpenAI Responses API.
Agents:
1. Research Agent: extracts key points on a topic
2. Writer Agent: turns key points into a structured summary

Model used:
- gpt-5.4-mini

Before running:
    pip install openai
    export OPENAI_API_KEY="your_api_key_here"
"""

from openai import OpenAI

# Create a reusable client instance.
client = OpenAI()

MODEL = "gpt-5.4-mini"


def call_model(system_prompt: str, user_prompt: str) -> str:
    """
    Calls the OpenAI Responses API and returns plain text output.

    Args:
        system_prompt: The system instruction defining the agent role.
        user_prompt: The task-specific prompt.

    Returns:
        Model-generated text as a string.
    """
    response = client.responses.create(
        model=MODEL,
        input=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
    )
    return response.output_text.strip()


def research_agent(topic: str) -> str:
    """
    Produces concise bullet points about a topic.
    This simulates an upstream information-gathering agent.
    """
    system_prompt = (
        "You are a research assistant. "
        "Produce 4 to 6 concise factual bullet points about the given topic. "
        "Be clear, avoid unsupported claims, and do not write a full essay."
    )
    user_prompt = f"Topic: {topic}"
    return call_model(system_prompt, user_prompt)


def writer_agent(topic: str, research_notes: str) -> str:
    """
    Produces a structured summary using the research notes.
    This simulates a downstream synthesis agent.
    """
    system_prompt = (
        "You are a technical writer. "
        "Using the provided research notes, write a short summary with exactly "
        "these sections:\n"
        "Title:\n"
        "Summary:\n"
        "Key Takeaways:\n"
        "The writing should be clear and concise."
    )
    user_prompt = (
        f"Topic: {topic}\n\n"
        f"Research Notes:\n{research_notes}\n\n"
        "Write the final structured summary."
    )
    return call_model(system_prompt, user_prompt)


def run_workflow(topic: str) -> dict:
    """
    Runs the two-agent pipeline and returns all intermediate and final outputs.

    Args:
        topic: The topic to summarize.

    Returns:
        A dictionary containing the topic, research notes, and final summary.
    """
    research_notes = research_agent(topic)
    final_summary = writer_agent(topic, research_notes)

    return {
        "topic": topic,
        "research_notes": research_notes,
        "final_summary": final_summary,
    }


if __name__ == "__main__":
    topic = "Benefits and risks of autonomous delivery robots in cities"
    result = run_workflow(topic)

    print("=== TOPIC ===")
    print(result["topic"])
    print("\n=== RESEARCH NOTES ===")
    print(result["research_notes"])
    print("\n=== FINAL SUMMARY ===")
    print(result["final_summary"])

6.3 Example Output

=== TOPIC ===
Benefits and risks of autonomous delivery robots in cities

=== RESEARCH NOTES ===
- Autonomous delivery robots can reduce the cost of last-mile delivery.
- They may improve delivery speed for short urban routes.
- These robots can help reduce human workload for repetitive local deliveries.
- Challenges include pedestrian safety, sidewalk congestion, and navigation in complex environments.
- Regulatory uncertainty can slow adoption across different cities.
- Weather, vandalism, and theft can affect operational reliability.

=== FINAL SUMMARY ===
Title:
Autonomous Delivery Robots in Cities

Summary:
Autonomous delivery robots can make last-mile logistics more efficient by lowering costs and handling short urban deliveries. They may also reduce routine workload for human workers. However, their use introduces challenges related to pedestrian safety, sidewalk congestion, regulation, and operational reliability.

Key Takeaways:
- They can improve efficiency in urban delivery operations.
- Adoption depends on safety, regulation, and public-space management.
- Real-world reliability is affected by weather and security risks.

7. Hands-On Exercise 2: Add Evaluation with Automated Checks

Now we evaluate the workflow using simple deterministic checks.

We will test for:

presence of required sections
summary length
whether research notes contain enough bullet points

7.1 Python Script: Automated Evaluation

"""
automated_evaluation.py

Adds simple automated checks for a two-agent workflow.

Checks:
- Research agent produced at least 4 bullet points
- Final summary contains required sections
- Summary section is not empty

Before running:
    pip install openai
    export OPENAI_API_KEY="your_api_key_here"
"""

import re
from openai import OpenAI

client = OpenAI()
MODEL = "gpt-5.4-mini"


def call_model(system_prompt: str, user_prompt: str) -> str:
    """
    Calls the OpenAI Responses API and returns plain text output.
    """
    response = client.responses.create(
        model=MODEL,
        input=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
    )
    return response.output_text.strip()


def research_agent(topic: str) -> str:
    """
    Produces concise bullet points about a topic.
    """
    system_prompt = (
        "You are a research assistant. "
        "Produce 4 to 6 concise factual bullet points about the given topic. "
        "Use one bullet per line starting with '- '."
    )
    user_prompt = f"Topic: {topic}"
    return call_model(system_prompt, user_prompt)


def writer_agent(topic: str, research_notes: str) -> str:
    """
    Produces a structured summary with fixed sections.
    """
    system_prompt = (
        "You are a technical writer. "
        "Using the provided research notes, write a short summary with exactly "
        "these sections:\n"
        "Title:\n"
        "Summary:\n"
        "Key Takeaways:\n"
        "Under 'Key Takeaways', include 2 or 3 bullet points."
    )
    user_prompt = (
        f"Topic: {topic}\n\n"
        f"Research Notes:\n{research_notes}\n\n"
        "Write the final structured summary."
    )
    return call_model(system_prompt, user_prompt)


def count_bullets(text: str) -> int:
    """
    Counts lines that start with '- '.
    """
    return sum(1 for line in text.splitlines() if line.strip().startswith("- "))


def contains_required_sections(text: str) -> bool:
    """
    Checks whether all required section headers exist.
    """
    required_headers = ["Title:", "Summary:", "Key Takeaways:"]
    return all(header in text for header in required_headers)


def extract_summary_section(text: str) -> str:
    """
    Extracts the content after 'Summary:' and before 'Key Takeaways:'.
    Returns an empty string if parsing fails.
    """
    match = re.search(r"Summary:\s*(.*?)\s*Key Takeaways:", text, re.DOTALL)
    return match.group(1).strip() if match else ""


def evaluate_output(research_notes: str, final_summary: str) -> dict:
    """
    Runs deterministic checks and returns a score summary.
    """
    checks = {
        "research_has_at_least_4_bullets": count_bullets(research_notes) >= 4,
        "final_has_required_sections": contains_required_sections(final_summary),
        "summary_section_not_empty": len(extract_summary_section(final_summary)) > 0,
    }

    score = sum(checks.values())
    max_score = len(checks)

    return {
        "checks": checks,
        "score": score,
        "max_score": max_score,
    }


def run_workflow(topic: str) -> dict:
    """
    Runs the workflow and evaluates the output.
    """
    research_notes = research_agent(topic)
    final_summary = writer_agent(topic, research_notes)
    evaluation = evaluate_output(research_notes, final_summary)

    return {
        "topic": topic,
        "research_notes": research_notes,
        "final_summary": final_summary,
        "evaluation": evaluation,
    }


if __name__ == "__main__":
    topic = "How electric buses affect public transportation systems"
    result = run_workflow(topic)

    print("=== TOPIC ===")
    print(result["topic"])

    print("\n=== RESEARCH NOTES ===")
    print(result["research_notes"])

    print("\n=== FINAL SUMMARY ===")
    print(result["final_summary"])

    print("\n=== EVALUATION ===")
    for check_name, passed in result["evaluation"]["checks"].items():
        print(f"{check_name}: {'PASS' if passed else 'FAIL'}")

    print(
        f"\nOverall Score: "
        f"{result['evaluation']['score']}/{result['evaluation']['max_score']}"
    )

7.2 Example Output

=== EVALUATION ===
research_has_at_least_4_bullets: PASS
final_has_required_sections: PASS
summary_section_not_empty: PASS

Overall Score: 3/3

8. Hands-On Exercise 3: Use an LLM Judge for Rubric Scoring

Now we add a model-based evaluator. This is useful for dimensions such as:

clarity
coverage
faithfulness to research notes
usefulness of final summary

The evaluator model will score the output against a rubric.

8.1 Rubric Design

We will score each category from 1 to 5:

Clarity: Is the summary easy to understand?
Coverage: Does it include important points from the research notes?
Faithfulness: Does it avoid introducing unsupported claims?
Structure: Does it follow the requested format well?

8.2 Python Script: LLM-as-Judge

"""
llm_judge_evaluation.py

Uses the OpenAI Responses API both for generation and for evaluation.

Workflow:
1. Research Agent creates notes
2. Writer Agent creates a structured summary
3. Judge Agent scores the final summary using a rubric

Before running:
    pip install openai
    export OPENAI_API_KEY="your_api_key_here"
"""

import json
from openai import OpenAI

client = OpenAI()
MODEL = "gpt-5.4-mini"


def call_model(system_prompt: str, user_prompt: str) -> str:
    """
    Calls the OpenAI Responses API and returns plain text output.
    """
    response = client.responses.create(
        model=MODEL,
        input=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
    )
    return response.output_text.strip()


def research_agent(topic: str) -> str:
    """
    Produces concise factual notes on the topic.
    """
    system_prompt = (
        "You are a research assistant. "
        "Produce 4 to 6 concise factual bullet points about the given topic. "
        "Use one bullet per line starting with '- '."
    )
    user_prompt = f"Topic: {topic}"
    return call_model(system_prompt, user_prompt)


def writer_agent(topic: str, research_notes: str) -> str:
    """
    Produces the final structured summary.
    """
    system_prompt = (
        "You are a technical writer. "
        "Using the provided research notes, write a short summary with exactly "
        "these sections:\n"
        "Title:\n"
        "Summary:\n"
        "Key Takeaways:\n"
        "Under 'Key Takeaways', include 2 or 3 bullet points."
    )
    user_prompt = (
        f"Topic: {topic}\n\n"
        f"Research Notes:\n{research_notes}\n\n"
        "Write the final structured summary."
    )
    return call_model(system_prompt, user_prompt)


def judge_output(topic: str, research_notes: str, final_summary: str) -> dict:
    """
    Uses an LLM judge to score the final summary against a rubric.
    The model is instructed to return valid JSON only.
    """
    system_prompt = (
        "You are an evaluator for a multi-agent workflow. "
        "Score the final summary using this rubric from 1 to 5:\n"
        "- clarity\n"
        "- coverage\n"
        "- faithfulness_to_notes\n"
        "- structure\n\n"
        "Return ONLY valid JSON with this exact schema:\n"
        "{\n"
        '  "clarity": <int>,\n'
        '  "coverage": <int>,\n'
        '  "faithfulness_to_notes": <int>,\n'
        '  "structure": <int>,\n'
        '  "overall_comment": "<short string>"\n'
        "}\n"
        "Do not include markdown fences."
    )

    user_prompt = (
        f"Topic: {topic}\n\n"
        f"Research Notes:\n{research_notes}\n\n"
        f"Final Summary:\n{final_summary}\n\n"
        "Evaluate the summary."
    )

    raw_result = call_model(system_prompt, user_prompt)

    try:
        return json.loads(raw_result)
    except json.JSONDecodeError:
        return {
            "clarity": 0,
            "coverage": 0,
            "faithfulness_to_notes": 0,
            "structure": 0,
            "overall_comment": f"Failed to parse judge output: {raw_result}",
        }


def run_workflow(topic: str) -> dict:
    """
    Runs generation and judge evaluation.
    """
    research_notes = research_agent(topic)
    final_summary = writer_agent(topic, research_notes)
    judge_scores = judge_output(topic, research_notes, final_summary)

    return {
        "topic": topic,
        "research_notes": research_notes,
        "final_summary": final_summary,
        "judge_scores": judge_scores,
    }


if __name__ == "__main__":
    topic = "The role of telemedicine in rural healthcare access"
    result = run_workflow(topic)

    print("=== TOPIC ===")
    print(result["topic"])

    print("\n=== RESEARCH NOTES ===")
    print(result["research_notes"])

    print("\n=== FINAL SUMMARY ===")
    print(result["final_summary"])

    print("\n=== JUDGE SCORES ===")
    print(json.dumps(result["judge_scores"], indent=2))

8.3 Example Output

{
  "clarity": 5,
  "coverage": 4,
  "faithfulness_to_notes": 5,
  "structure": 5,
  "overall_comment": "Clear and well-structured summary that uses the research notes appropriately, though one additional point from the notes could be incorporated."
}

9. Hands-On Exercise 4: Evaluate Multiple Test Cases

A good evaluation should not rely on a single example. In this exercise, learners will run the workflow across multiple topics and collect summary metrics.

9.1 Python Script: Batch Evaluation Harness

"""
batch_evaluation_harness.py

Runs a simple two-agent workflow over multiple test cases and evaluates
each result with both automated checks and an LLM judge.

Before running:
    pip install openai
    export OPENAI_API_KEY="your_api_key_here"
"""

import json
import re
import time
from statistics import mean
from openai import OpenAI

client = OpenAI()
MODEL = "gpt-5.4-mini"


def call_model(system_prompt: str, user_prompt: str) -> str:
    """
    Calls the OpenAI Responses API and returns plain text output.
    """
    response = client.responses.create(
        model=MODEL,
        input=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
    )
    return response.output_text.strip()


def research_agent(topic: str) -> str:
    """
    Produces bullet-point research notes.
    """
    system_prompt = (
        "You are a research assistant. "
        "Produce 4 to 6 concise factual bullet points about the given topic. "
        "Use one bullet per line starting with '- '."
    )
    user_prompt = f"Topic: {topic}"
    return call_model(system_prompt, user_prompt)


def writer_agent(topic: str, research_notes: str) -> str:
    """
    Produces the final structured summary.
    """
    system_prompt = (
        "You are a technical writer. "
        "Using the provided research notes, write a short summary with exactly "
        "these sections:\n"
        "Title:\n"
        "Summary:\n"
        "Key Takeaways:\n"
        "Under 'Key Takeaways', include 2 or 3 bullet points."
    )
    user_prompt = (
        f"Topic: {topic}\n\n"
        f"Research Notes:\n{research_notes}\n\n"
        "Write the final structured summary."
    )
    return call_model(system_prompt, user_prompt)


def count_bullets(text: str) -> int:
    """
    Counts lines that start with '- '.
    """
    return sum(1 for line in text.splitlines() if line.strip().startswith("- "))


def contains_required_sections(text: str) -> bool:
    """
    Verifies presence of required section headers.
    """
    required_headers = ["Title:", "Summary:", "Key Takeaways:"]
    return all(header in text for header in required_headers)


def extract_summary_section(text: str) -> str:
    """
    Extracts the content between 'Summary:' and 'Key Takeaways:'.
    """
    match = re.search(r"Summary:\s*(.*?)\s*Key Takeaways:", text, re.DOTALL)
    return match.group(1).strip() if match else ""


def automated_eval(research_notes: str, final_summary: str) -> dict:
    """
    Runs deterministic validation checks.
    """
    checks = {
        "research_has_at_least_4_bullets": count_bullets(research_notes) >= 4,
        "final_has_required_sections": contains_required_sections(final_summary),
        "summary_section_not_empty": len(extract_summary_section(final_summary)) > 0,
    }
    return {
        "checks": checks,
        "score": sum(checks.values()),
        "max_score": len(checks),
    }


def judge_eval(topic: str, research_notes: str, final_summary: str) -> dict:
    """
    Uses an LLM judge to score the output.
    """
    system_prompt = (
        "You are an evaluator for a multi-agent workflow. "
        "Score the final summary using this rubric from 1 to 5:\n"
        "- clarity\n"
        "- coverage\n"
        "- faithfulness_to_notes\n"
        "- structure\n\n"
        "Return ONLY valid JSON with this exact schema:\n"
        "{\n"
        '  "clarity": <int>,\n'
        '  "coverage": <int>,\n'
        '  "faithfulness_to_notes": <int>,\n'
        '  "structure": <int>,\n'
        '  "overall_comment": "<short string>"\n'
        "}\n"
        "Do not include markdown fences."
    )
    user_prompt = (
        f"Topic: {topic}\n\n"
        f"Research Notes:\n{research_notes}\n\n"
        f"Final Summary:\n{final_summary}\n\n"
        "Evaluate the summary."
    )

    raw_result = call_model(system_prompt, user_prompt)

    try:
        return json.loads(raw_result)
    except json.JSONDecodeError:
        return {
            "clarity": 0,
            "coverage": 0,
            "faithfulness_to_notes": 0,
            "structure": 0,
            "overall_comment": f"Failed to parse judge output: {raw_result}",
        }


def evaluate_topic(topic: str) -> dict:
    """
    Runs the full workflow for a single topic and records latency.
    """
    start_time = time.time()

    research_notes = research_agent(topic)
    final_summary = writer_agent(topic, research_notes)
    auto_scores = automated_eval(research_notes, final_summary)
    judge_scores = judge_eval(topic, research_notes, final_summary)

    elapsed_seconds = round(time.time() - start_time, 2)

    return {
        "topic": topic,
        "research_notes": research_notes,
        "final_summary": final_summary,
        "automated_evaluation": auto_scores,
        "judge_evaluation": judge_scores,
        "latency_seconds": elapsed_seconds,
    }


def summarize_results(results: list[dict]) -> dict:
    """
    Aggregates scores across test cases.
    """
    auto_scores = [r["automated_evaluation"]["score"] for r in results]
    auto_max = results[0]["automated_evaluation"]["max_score"] if results else 0

    clarity_scores = [r["judge_evaluation"]["clarity"] for r in results]
    coverage_scores = [r["judge_evaluation"]["coverage"] for r in results]
    faithfulness_scores = [r["judge_evaluation"]["faithfulness_to_notes"] for r in results]
    structure_scores = [r["judge_evaluation"]["structure"] for r in results]
    latencies = [r["latency_seconds"] for r in results]

    return {
        "num_cases": len(results),
        "avg_automated_score": round(mean(auto_scores), 2) if auto_scores else 0,
        "automated_score_max": auto_max,
        "avg_clarity": round(mean(clarity_scores), 2) if clarity_scores else 0,
        "avg_coverage": round(mean(coverage_scores), 2) if coverage_scores else 0,
        "avg_faithfulness": round(mean(faithfulness_scores), 2) if faithfulness_scores else 0,
        "avg_structure": round(mean(structure_scores), 2) if structure_scores else 0,
        "avg_latency_seconds": round(mean(latencies), 2) if latencies else 0,
    }


if __name__ == "__main__":
    topics = [
        "Benefits and challenges of vertical farming",
        "How wearable health devices support preventive care",
        "The impact of remote work on urban transportation",
        "Why data quality matters in machine learning projects",
    ]

    all_results = []
    for topic in topics:
        print(f"Running evaluation for: {topic}")
        result = evaluate_topic(topic)
        all_results.append(result)

    summary = summarize_results(all_results)

    print("\n=== AGGREGATE SUMMARY ===")
    print(json.dumps(summary, indent=2))

    print("\n=== SAMPLE RESULT ===")
    print(json.dumps(all_results[0], indent=2))

9.2 Example Aggregate Output

{
  "num_cases": 4,
  "avg_automated_score": 3.0,
  "automated_score_max": 3,
  "avg_clarity": 4.75,
  "avg_coverage": 4.25,
  "avg_faithfulness": 4.75,
  "avg_structure": 5.0,
  "avg_latency_seconds": 3.18
}

10. Interpreting Results

Once you have data, the next step is deciding what it means.

Example interpretation patterns

High structure, low coverage

The writer follows formatting rules
But important research points are missing
Possible fix: improve handoff quality or strengthen writer instructions

High coverage, low faithfulness

The final answer includes many points
But adds unsupported details
Possible fix: instruct the writer to use only supplied notes

Good quality, high latency

The workflow is effective but slow
Possible fix: simplify agent steps or reduce review rounds

Strong average, weak edge-case performance

The system works on standard tasks
But fails on ambiguous or noisy inputs
Possible fix: expand test coverage and harden prompts

11. Discussion Prompts

Use these for live discussion or reflection:

What should count as success in a multi-agent system: final answer quality or collaboration quality?
When is a reviewer agent actually useful?
How would you compare a two-agent and three-agent architecture fairly?
What kinds of tasks benefit most from intermediate-output evaluation?
What are the dangers of relying only on LLM judges?

12. Best Practices for Evaluating Multi-Agent Systems

Start with a small, representative benchmark
Score both final outputs and intermediate artifacts
Combine automated checks with qualitative review
Measure cost and latency alongside quality
Use pairwise comparisons when testing variants
Keep prompts, inputs, and outputs logged for debugging
Evaluate across multiple runs and task types
Treat evaluator prompts as part of the system design

13. Suggested Instructor-Led Walkthrough

A practical classroom flow for the hands-on portion:

Run the two-agent pipeline on one topic
Inspect intermediate research notes
Apply automated checks
Add LLM judge scoring
Run across multiple topics
Compare results and diagnose where failures originate
Brainstorm prompt or workflow improvements

14. Wrap-Up

In this session, learners moved from the idea of “Did the model answer correctly?” to a more realistic question: “How well did the multi-agent system work as a coordinated process?” That shift is essential in agentic development.

Key takeaway:

Evaluating multi-agent systems requires measuring both outcome quality and interaction quality.
A good evaluation setup helps you improve prompts, redesign workflows, reduce cost, and make agent systems more reliable.

Useful Resources

OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
OpenAI API docs: https://platform.openai.com/docs
OpenAI Python SDK: https://github.com/openai/openai-python
Prompt engineering guide: https://platform.openai.com/docs/guides/prompt-engineering
Python re module docs: https://docs.python.org/3/library/re.html
Python json module docs: https://docs.python.org/3/library/json.html
Python statistics module docs: https://docs.python.org/3/library/statistics.html

Optional Homework

Add a Reviewer Agent and evaluate whether it improves final quality enough to justify added latency.
Compare two prompt versions for the Writer Agent using the same benchmark.
Add CSV or JSONL logging to store all runs for later analysis.
Create edge-case tasks designed to trigger coordination failures.
Build a pairwise evaluator that compares outputs from two workflow variants.

Quick Recap

Multi-agent evaluation is broader than final-answer evaluation
Key dimensions include success, coordination, latency, cost, and robustness
Automated checks are useful but limited
LLM judges help evaluate nuanced qualities
Batch evaluation reveals patterns hidden by single examples
Logged intermediate outputs make debugging much easier

Back to Chapter | Back to Master Plan | Previous Session