Skip to content

Session 4: Evaluating Multi-Agent Performance

Synopsis

Focuses on how to assess collaborative quality, communication overhead, task completion, and system efficiency. Learners apply earlier evaluation methods to more complex distributed agent behaviors.

Session Content

Session 4: Evaluating Multi-Agent Performance

Session Overview

In this session, learners will explore how to evaluate multi-agent systems in a structured, repeatable, and practical way. The focus is on measuring the quality, reliability, and efficiency of agent collaboration rather than evaluating only single-model outputs. By the end of the session, learners will understand evaluation dimensions for multi-agent workflows, common failure modes, and how to build a lightweight evaluation harness in Python using the OpenAI Responses API with gpt-5.4-mini.

Learning Objectives

By the end of this session, learners will be able to:

  • Explain why multi-agent evaluation is different from single-agent or single-prompt evaluation
  • Identify key evaluation dimensions such as task success, coordination quality, latency, cost, and robustness
  • Recognize common failure patterns in multi-agent systems
  • Design simple test cases and scoring rubrics for multi-agent workflows
  • Implement a basic evaluator in Python using the OpenAI SDK and Responses API
  • Analyze agent outputs and compare alternative multi-agent designs

Agenda for a 45-Minute Session

  • 0–5 min: Why evaluating multi-agent systems is hard
  • 5–15 min: Core evaluation dimensions and failure modes
  • 15–25 min: Evaluation strategies: human review, rubric scoring, LLM-as-judge, and automated checks
  • 25–40 min: Hands-on: build a simple evaluation harness for a two-agent workflow
  • 40–45 min: Wrap-up, discussion, and extensions

1. Why Multi-Agent Evaluation Is Different

A single-agent system usually produces one output for one task. Evaluation often asks:

  • Was the answer correct?
  • Was it helpful?
  • Was it safe?
  • Was it concise?

A multi-agent system introduces more complexity:

  • Multiple intermediate outputs exist
  • Agents may depend on each other’s work
  • Errors can propagate across steps
  • Coordination may be successful even when the final answer is weak, or vice versa
  • Two systems may achieve the same result with very different costs and latencies

Example

Suppose you have:

  • Research Agent: gathers facts
  • Writer Agent: drafts a summary
  • Reviewer Agent: checks for omissions or inaccuracies

A weak final answer could be caused by:

  • Poor fact gathering by the research agent
  • Good research but poor synthesis by the writer
  • Good draft but ineffective review
  • Correct review feedback that the writer ignored
  • Excessive back-and-forth causing latency/cost issues

So evaluation must consider:

  • Final output quality
  • Intermediate output quality
  • Coordination effectiveness
  • Resource usage
  • Stability across repeated runs

2. Core Evaluation Dimensions

When evaluating a multi-agent system, define metrics across several categories.

2.1 Task Success

This is the most important dimension.

Questions to ask:

  • Did the system solve the task?
  • Was the answer accurate?
  • Did it satisfy user constraints?
  • Was the requested format followed?

Example metrics

  • Binary success/failure
  • Accuracy score from 1–5
  • Format compliance: pass/fail
  • Coverage score: how many required points were addressed

2.2 Coordination Quality

Multi-agent systems should work together effectively.

Questions to ask:

  • Did agents pass useful context to one another?
  • Did downstream agents use upstream outputs correctly?
  • Was work duplicated unnecessarily?
  • Did the workflow reduce confusion or increase it?

Indicators of poor coordination

  • Repeating the same work in multiple agents
  • Contradictory outputs between agents
  • Missing handoff details
  • Reviewer flags issues already solved
  • Planner creates steps that executors ignore

2.3 Latency

Multi-agent systems often trade quality for speed.

Questions to ask:

  • How long did the full workflow take?
  • Which agent caused the bottleneck?
  • Did multiple rounds of refinement improve results enough to justify the delay?

Common measurements

  • Total wall-clock time
  • Time per agent step
  • Number of model calls
  • Number of sequential rounds

2.4 Cost

More agents usually mean more tokens and more API calls.

Questions to ask:

  • Is the quality improvement worth the added cost?
  • Which steps consume the most tokens?
  • Can some agents be simplified or removed?

Common measurements

  • Number of requests
  • Input/output token estimates
  • Relative cost per run
  • Cost per successful task

2.5 Robustness

A good multi-agent system should behave reasonably across different inputs.

Questions to ask:

  • Does it work only on easy cases?
  • Does it fail badly on ambiguous tasks?
  • How sensitive is it to prompt wording?
  • Does one agent collapse when upstream output is slightly malformed?

What robustness testing looks like

  • Easy/medium/hard task sets
  • Noisy or incomplete inputs
  • Edge cases
  • Repeated runs with minor prompt variations

2.6 Interpretability and Debuggability

You should be able to understand why the system failed.

Questions to ask:

  • Can you inspect each agent’s output?
  • Can you trace where the failure began?
  • Do prompts and outputs make debugging possible?

This is one of the major benefits of multi-agent design: failures can be localized if the workflow is transparent.


3. Common Failure Modes in Multi-Agent Systems

Below are recurring patterns you should watch for.

3.1 Error Propagation

A mistake made early is amplified later.

Example:
The planner misunderstands the task, and all subsequent agents execute the wrong plan.


3.2 Hallucinated Handoffs

An agent claims that another agent found something that was never actually produced.

Example:
The writer says “Based on the researcher’s evidence…” even though the researcher provided no such evidence.


3.3 Over-Decomposition

The task is split into too many agents or too many steps.

Symptoms:

  • High latency
  • High cost
  • Minimal quality gains
  • Excessive repetition

3.4 Under-Specification

Agents are not given clear roles, boundaries, or output schemas.

Symptoms:

  • Overlapping work
  • Inconsistent formats
  • Missing key information in handoffs

3.5 Conflicting Objectives

Different agents optimize for different things.

Example:
A summarizer aims to be concise while a reviewer demands exhaustive detail.


3.6 Reviewer Ineffectiveness

The reviewer agent exists but adds little value.

Symptoms:

  • Generic feedback only
  • Misses important errors
  • Suggests changes that do not improve outcomes

4. Evaluation Strategies

No single evaluation method is enough. In practice, combine several.

4.1 Human Evaluation

Humans review outputs against a rubric.

Strengths

  • Good for nuanced quality judgments
  • Useful for early-stage systems
  • Can detect real-world usefulness

Weaknesses

  • Slow
  • Expensive
  • Hard to scale consistently

4.2 Rubric-Based Scoring

Define explicit criteria and assign scores.

Example rubric for a research-summary workflow

  • Accuracy: 0–5
  • Coverage: 0–5
  • Clarity: 0–5
  • Proper structure: 0–3
  • Evidence usage: 0–5

This improves consistency and can later be used by human reviewers or LLM judges.


4.3 Automated Checks

Useful when outputs must satisfy strict requirements.

Examples

  • JSON schema validation
  • Presence/absence of required sections
  • Keyword or entity matching
  • Unit-test-like assertions
  • Regex checks for format compliance

Automated checks are fast and objective, but they usually capture only part of quality.


4.4 LLM-as-Judge

A model evaluates outputs according to a prompt and rubric.

Good uses

  • Comparing two system variants
  • Scoring qualitative dimensions
  • Classifying whether requirements were met

Risks

  • Judge bias
  • Inconsistent scoring
  • Overly generous assessments
  • Prompt sensitivity

Best practice

Use LLM judges together with:

  • fixed rubrics
  • spot-checked human review
  • simple automated checks

4.5 Comparative Evaluation

Instead of asking “Is this output good?”, ask:

  • Which of these two systems performed better?
  • Which version was more accurate, useful, or concise?

Pairwise comparison often produces more reliable signal than absolute scoring.


5. Designing an Evaluation Plan

A practical evaluation plan usually includes the following pieces.

5.1 Define the Workflow

Write down:

  • agent roles
  • inputs and outputs
  • handoff format
  • number of turns or rounds
  • expected final deliverable

5.2 Create a Test Set

Build a small but diverse benchmark.

Example categories

  • straightforward tasks
  • ambiguous tasks
  • tasks with incomplete information
  • tasks requiring strict formatting
  • failure-triggering edge cases

Start with 5–10 cases before scaling up.


5.3 Define Success Criteria

For each task, specify:

  • must-have requirements
  • optional nice-to-have qualities
  • expected structure
  • acceptable failure boundaries

5.4 Decide Metrics

Typical metrics include:

  • final score
  • success rate
  • average latency
  • average number of calls
  • pass rate on automated checks
  • reviewer usefulness score

5.5 Log Everything

To debug and compare systems, log:

  • prompts
  • outputs
  • timestamps
  • agent names
  • evaluation scores
  • failures and exceptions

6. Hands-On Exercise 1: Build a Simple Two-Agent Workflow

In this exercise, learners will implement:

  • a Research Agent
  • a Writer Agent
  • a lightweight evaluation setup

The task: produce a short structured summary from a topic prompt.

This is intentionally simple so the evaluation concepts stay clear.


6.1 Setup

Install the OpenAI Python SDK:

pip install openai

Set your API key:

export OPENAI_API_KEY="your_api_key_here"

6.2 Python Script: Two-Agent Workflow

"""
two_agent_workflow.py

A simple multi-agent workflow using the OpenAI Responses API.
Agents:
1. Research Agent: extracts key points on a topic
2. Writer Agent: turns key points into a structured summary

Model used:
- gpt-5.4-mini

Before running:
    pip install openai
    export OPENAI_API_KEY="your_api_key_here"
"""

from openai import OpenAI

# Create a reusable client instance.
client = OpenAI()

MODEL = "gpt-5.4-mini"


def call_model(system_prompt: str, user_prompt: str) -> str:
    """
    Calls the OpenAI Responses API and returns plain text output.

    Args:
        system_prompt: The system instruction defining the agent role.
        user_prompt: The task-specific prompt.

    Returns:
        Model-generated text as a string.
    """
    response = client.responses.create(
        model=MODEL,
        input=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
    )
    return response.output_text.strip()


def research_agent(topic: str) -> str:
    """
    Produces concise bullet points about a topic.
    This simulates an upstream information-gathering agent.
    """
    system_prompt = (
        "You are a research assistant. "
        "Produce 4 to 6 concise factual bullet points about the given topic. "
        "Be clear, avoid unsupported claims, and do not write a full essay."
    )
    user_prompt = f"Topic: {topic}"
    return call_model(system_prompt, user_prompt)


def writer_agent(topic: str, research_notes: str) -> str:
    """
    Produces a structured summary using the research notes.
    This simulates a downstream synthesis agent.
    """
    system_prompt = (
        "You are a technical writer. "
        "Using the provided research notes, write a short summary with exactly "
        "these sections:\n"
        "Title:\n"
        "Summary:\n"
        "Key Takeaways:\n"
        "The writing should be clear and concise."
    )
    user_prompt = (
        f"Topic: {topic}\n\n"
        f"Research Notes:\n{research_notes}\n\n"
        "Write the final structured summary."
    )
    return call_model(system_prompt, user_prompt)


def run_workflow(topic: str) -> dict:
    """
    Runs the two-agent pipeline and returns all intermediate and final outputs.

    Args:
        topic: The topic to summarize.

    Returns:
        A dictionary containing the topic, research notes, and final summary.
    """
    research_notes = research_agent(topic)
    final_summary = writer_agent(topic, research_notes)

    return {
        "topic": topic,
        "research_notes": research_notes,
        "final_summary": final_summary,
    }


if __name__ == "__main__":
    topic = "Benefits and risks of autonomous delivery robots in cities"
    result = run_workflow(topic)

    print("=== TOPIC ===")
    print(result["topic"])
    print("\n=== RESEARCH NOTES ===")
    print(result["research_notes"])
    print("\n=== FINAL SUMMARY ===")
    print(result["final_summary"])

6.3 Example Output

=== TOPIC ===
Benefits and risks of autonomous delivery robots in cities

=== RESEARCH NOTES ===
- Autonomous delivery robots can reduce the cost of last-mile delivery.
- They may improve delivery speed for short urban routes.
- These robots can help reduce human workload for repetitive local deliveries.
- Challenges include pedestrian safety, sidewalk congestion, and navigation in complex environments.
- Regulatory uncertainty can slow adoption across different cities.
- Weather, vandalism, and theft can affect operational reliability.

=== FINAL SUMMARY ===
Title:
Autonomous Delivery Robots in Cities

Summary:
Autonomous delivery robots can make last-mile logistics more efficient by lowering costs and handling short urban deliveries. They may also reduce routine workload for human workers. However, their use introduces challenges related to pedestrian safety, sidewalk congestion, regulation, and operational reliability.

Key Takeaways:
- They can improve efficiency in urban delivery operations.
- Adoption depends on safety, regulation, and public-space management.
- Real-world reliability is affected by weather and security risks.

7. Hands-On Exercise 2: Add Evaluation with Automated Checks

Now we evaluate the workflow using simple deterministic checks.

We will test for:

  • presence of required sections
  • summary length
  • whether research notes contain enough bullet points

7.1 Python Script: Automated Evaluation

"""
automated_evaluation.py

Adds simple automated checks for a two-agent workflow.

Checks:
- Research agent produced at least 4 bullet points
- Final summary contains required sections
- Summary section is not empty

Before running:
    pip install openai
    export OPENAI_API_KEY="your_api_key_here"
"""

import re
from openai import OpenAI

client = OpenAI()
MODEL = "gpt-5.4-mini"


def call_model(system_prompt: str, user_prompt: str) -> str:
    """
    Calls the OpenAI Responses API and returns plain text output.
    """
    response = client.responses.create(
        model=MODEL,
        input=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
    )
    return response.output_text.strip()


def research_agent(topic: str) -> str:
    """
    Produces concise bullet points about a topic.
    """
    system_prompt = (
        "You are a research assistant. "
        "Produce 4 to 6 concise factual bullet points about the given topic. "
        "Use one bullet per line starting with '- '."
    )
    user_prompt = f"Topic: {topic}"
    return call_model(system_prompt, user_prompt)


def writer_agent(topic: str, research_notes: str) -> str:
    """
    Produces a structured summary with fixed sections.
    """
    system_prompt = (
        "You are a technical writer. "
        "Using the provided research notes, write a short summary with exactly "
        "these sections:\n"
        "Title:\n"
        "Summary:\n"
        "Key Takeaways:\n"
        "Under 'Key Takeaways', include 2 or 3 bullet points."
    )
    user_prompt = (
        f"Topic: {topic}\n\n"
        f"Research Notes:\n{research_notes}\n\n"
        "Write the final structured summary."
    )
    return call_model(system_prompt, user_prompt)


def count_bullets(text: str) -> int:
    """
    Counts lines that start with '- '.
    """
    return sum(1 for line in text.splitlines() if line.strip().startswith("- "))


def contains_required_sections(text: str) -> bool:
    """
    Checks whether all required section headers exist.
    """
    required_headers = ["Title:", "Summary:", "Key Takeaways:"]
    return all(header in text for header in required_headers)


def extract_summary_section(text: str) -> str:
    """
    Extracts the content after 'Summary:' and before 'Key Takeaways:'.
    Returns an empty string if parsing fails.
    """
    match = re.search(r"Summary:\s*(.*?)\s*Key Takeaways:", text, re.DOTALL)
    return match.group(1).strip() if match else ""


def evaluate_output(research_notes: str, final_summary: str) -> dict:
    """
    Runs deterministic checks and returns a score summary.
    """
    checks = {
        "research_has_at_least_4_bullets": count_bullets(research_notes) >= 4,
        "final_has_required_sections": contains_required_sections(final_summary),
        "summary_section_not_empty": len(extract_summary_section(final_summary)) > 0,
    }

    score = sum(checks.values())
    max_score = len(checks)

    return {
        "checks": checks,
        "score": score,
        "max_score": max_score,
    }


def run_workflow(topic: str) -> dict:
    """
    Runs the workflow and evaluates the output.
    """
    research_notes = research_agent(topic)
    final_summary = writer_agent(topic, research_notes)
    evaluation = evaluate_output(research_notes, final_summary)

    return {
        "topic": topic,
        "research_notes": research_notes,
        "final_summary": final_summary,
        "evaluation": evaluation,
    }


if __name__ == "__main__":
    topic = "How electric buses affect public transportation systems"
    result = run_workflow(topic)

    print("=== TOPIC ===")
    print(result["topic"])

    print("\n=== RESEARCH NOTES ===")
    print(result["research_notes"])

    print("\n=== FINAL SUMMARY ===")
    print(result["final_summary"])

    print("\n=== EVALUATION ===")
    for check_name, passed in result["evaluation"]["checks"].items():
        print(f"{check_name}: {'PASS' if passed else 'FAIL'}")

    print(
        f"\nOverall Score: "
        f"{result['evaluation']['score']}/{result['evaluation']['max_score']}"
    )

7.2 Example Output

=== EVALUATION ===
research_has_at_least_4_bullets: PASS
final_has_required_sections: PASS
summary_section_not_empty: PASS

Overall Score: 3/3

8. Hands-On Exercise 3: Use an LLM Judge for Rubric Scoring

Now we add a model-based evaluator. This is useful for dimensions such as:

  • clarity
  • coverage
  • faithfulness to research notes
  • usefulness of final summary

The evaluator model will score the output against a rubric.


8.1 Rubric Design

We will score each category from 1 to 5:

  • Clarity: Is the summary easy to understand?
  • Coverage: Does it include important points from the research notes?
  • Faithfulness: Does it avoid introducing unsupported claims?
  • Structure: Does it follow the requested format well?

8.2 Python Script: LLM-as-Judge

"""
llm_judge_evaluation.py

Uses the OpenAI Responses API both for generation and for evaluation.

Workflow:
1. Research Agent creates notes
2. Writer Agent creates a structured summary
3. Judge Agent scores the final summary using a rubric

Before running:
    pip install openai
    export OPENAI_API_KEY="your_api_key_here"
"""

import json
from openai import OpenAI

client = OpenAI()
MODEL = "gpt-5.4-mini"


def call_model(system_prompt: str, user_prompt: str) -> str:
    """
    Calls the OpenAI Responses API and returns plain text output.
    """
    response = client.responses.create(
        model=MODEL,
        input=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
    )
    return response.output_text.strip()


def research_agent(topic: str) -> str:
    """
    Produces concise factual notes on the topic.
    """
    system_prompt = (
        "You are a research assistant. "
        "Produce 4 to 6 concise factual bullet points about the given topic. "
        "Use one bullet per line starting with '- '."
    )
    user_prompt = f"Topic: {topic}"
    return call_model(system_prompt, user_prompt)


def writer_agent(topic: str, research_notes: str) -> str:
    """
    Produces the final structured summary.
    """
    system_prompt = (
        "You are a technical writer. "
        "Using the provided research notes, write a short summary with exactly "
        "these sections:\n"
        "Title:\n"
        "Summary:\n"
        "Key Takeaways:\n"
        "Under 'Key Takeaways', include 2 or 3 bullet points."
    )
    user_prompt = (
        f"Topic: {topic}\n\n"
        f"Research Notes:\n{research_notes}\n\n"
        "Write the final structured summary."
    )
    return call_model(system_prompt, user_prompt)


def judge_output(topic: str, research_notes: str, final_summary: str) -> dict:
    """
    Uses an LLM judge to score the final summary against a rubric.
    The model is instructed to return valid JSON only.
    """
    system_prompt = (
        "You are an evaluator for a multi-agent workflow. "
        "Score the final summary using this rubric from 1 to 5:\n"
        "- clarity\n"
        "- coverage\n"
        "- faithfulness_to_notes\n"
        "- structure\n\n"
        "Return ONLY valid JSON with this exact schema:\n"
        "{\n"
        '  "clarity": <int>,\n'
        '  "coverage": <int>,\n'
        '  "faithfulness_to_notes": <int>,\n'
        '  "structure": <int>,\n'
        '  "overall_comment": "<short string>"\n'
        "}\n"
        "Do not include markdown fences."
    )

    user_prompt = (
        f"Topic: {topic}\n\n"
        f"Research Notes:\n{research_notes}\n\n"
        f"Final Summary:\n{final_summary}\n\n"
        "Evaluate the summary."
    )

    raw_result = call_model(system_prompt, user_prompt)

    try:
        return json.loads(raw_result)
    except json.JSONDecodeError:
        return {
            "clarity": 0,
            "coverage": 0,
            "faithfulness_to_notes": 0,
            "structure": 0,
            "overall_comment": f"Failed to parse judge output: {raw_result}",
        }


def run_workflow(topic: str) -> dict:
    """
    Runs generation and judge evaluation.
    """
    research_notes = research_agent(topic)
    final_summary = writer_agent(topic, research_notes)
    judge_scores = judge_output(topic, research_notes, final_summary)

    return {
        "topic": topic,
        "research_notes": research_notes,
        "final_summary": final_summary,
        "judge_scores": judge_scores,
    }


if __name__ == "__main__":
    topic = "The role of telemedicine in rural healthcare access"
    result = run_workflow(topic)

    print("=== TOPIC ===")
    print(result["topic"])

    print("\n=== RESEARCH NOTES ===")
    print(result["research_notes"])

    print("\n=== FINAL SUMMARY ===")
    print(result["final_summary"])

    print("\n=== JUDGE SCORES ===")
    print(json.dumps(result["judge_scores"], indent=2))

8.3 Example Output

{
  "clarity": 5,
  "coverage": 4,
  "faithfulness_to_notes": 5,
  "structure": 5,
  "overall_comment": "Clear and well-structured summary that uses the research notes appropriately, though one additional point from the notes could be incorporated."
}

9. Hands-On Exercise 4: Evaluate Multiple Test Cases

A good evaluation should not rely on a single example. In this exercise, learners will run the workflow across multiple topics and collect summary metrics.


9.1 Python Script: Batch Evaluation Harness

"""
batch_evaluation_harness.py

Runs a simple two-agent workflow over multiple test cases and evaluates
each result with both automated checks and an LLM judge.

Before running:
    pip install openai
    export OPENAI_API_KEY="your_api_key_here"
"""

import json
import re
import time
from statistics import mean
from openai import OpenAI

client = OpenAI()
MODEL = "gpt-5.4-mini"


def call_model(system_prompt: str, user_prompt: str) -> str:
    """
    Calls the OpenAI Responses API and returns plain text output.
    """
    response = client.responses.create(
        model=MODEL,
        input=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
    )
    return response.output_text.strip()


def research_agent(topic: str) -> str:
    """
    Produces bullet-point research notes.
    """
    system_prompt = (
        "You are a research assistant. "
        "Produce 4 to 6 concise factual bullet points about the given topic. "
        "Use one bullet per line starting with '- '."
    )
    user_prompt = f"Topic: {topic}"
    return call_model(system_prompt, user_prompt)


def writer_agent(topic: str, research_notes: str) -> str:
    """
    Produces the final structured summary.
    """
    system_prompt = (
        "You are a technical writer. "
        "Using the provided research notes, write a short summary with exactly "
        "these sections:\n"
        "Title:\n"
        "Summary:\n"
        "Key Takeaways:\n"
        "Under 'Key Takeaways', include 2 or 3 bullet points."
    )
    user_prompt = (
        f"Topic: {topic}\n\n"
        f"Research Notes:\n{research_notes}\n\n"
        "Write the final structured summary."
    )
    return call_model(system_prompt, user_prompt)


def count_bullets(text: str) -> int:
    """
    Counts lines that start with '- '.
    """
    return sum(1 for line in text.splitlines() if line.strip().startswith("- "))


def contains_required_sections(text: str) -> bool:
    """
    Verifies presence of required section headers.
    """
    required_headers = ["Title:", "Summary:", "Key Takeaways:"]
    return all(header in text for header in required_headers)


def extract_summary_section(text: str) -> str:
    """
    Extracts the content between 'Summary:' and 'Key Takeaways:'.
    """
    match = re.search(r"Summary:\s*(.*?)\s*Key Takeaways:", text, re.DOTALL)
    return match.group(1).strip() if match else ""


def automated_eval(research_notes: str, final_summary: str) -> dict:
    """
    Runs deterministic validation checks.
    """
    checks = {
        "research_has_at_least_4_bullets": count_bullets(research_notes) >= 4,
        "final_has_required_sections": contains_required_sections(final_summary),
        "summary_section_not_empty": len(extract_summary_section(final_summary)) > 0,
    }
    return {
        "checks": checks,
        "score": sum(checks.values()),
        "max_score": len(checks),
    }


def judge_eval(topic: str, research_notes: str, final_summary: str) -> dict:
    """
    Uses an LLM judge to score the output.
    """
    system_prompt = (
        "You are an evaluator for a multi-agent workflow. "
        "Score the final summary using this rubric from 1 to 5:\n"
        "- clarity\n"
        "- coverage\n"
        "- faithfulness_to_notes\n"
        "- structure\n\n"
        "Return ONLY valid JSON with this exact schema:\n"
        "{\n"
        '  "clarity": <int>,\n'
        '  "coverage": <int>,\n'
        '  "faithfulness_to_notes": <int>,\n'
        '  "structure": <int>,\n'
        '  "overall_comment": "<short string>"\n'
        "}\n"
        "Do not include markdown fences."
    )
    user_prompt = (
        f"Topic: {topic}\n\n"
        f"Research Notes:\n{research_notes}\n\n"
        f"Final Summary:\n{final_summary}\n\n"
        "Evaluate the summary."
    )

    raw_result = call_model(system_prompt, user_prompt)

    try:
        return json.loads(raw_result)
    except json.JSONDecodeError:
        return {
            "clarity": 0,
            "coverage": 0,
            "faithfulness_to_notes": 0,
            "structure": 0,
            "overall_comment": f"Failed to parse judge output: {raw_result}",
        }


def evaluate_topic(topic: str) -> dict:
    """
    Runs the full workflow for a single topic and records latency.
    """
    start_time = time.time()

    research_notes = research_agent(topic)
    final_summary = writer_agent(topic, research_notes)
    auto_scores = automated_eval(research_notes, final_summary)
    judge_scores = judge_eval(topic, research_notes, final_summary)

    elapsed_seconds = round(time.time() - start_time, 2)

    return {
        "topic": topic,
        "research_notes": research_notes,
        "final_summary": final_summary,
        "automated_evaluation": auto_scores,
        "judge_evaluation": judge_scores,
        "latency_seconds": elapsed_seconds,
    }


def summarize_results(results: list[dict]) -> dict:
    """
    Aggregates scores across test cases.
    """
    auto_scores = [r["automated_evaluation"]["score"] for r in results]
    auto_max = results[0]["automated_evaluation"]["max_score"] if results else 0

    clarity_scores = [r["judge_evaluation"]["clarity"] for r in results]
    coverage_scores = [r["judge_evaluation"]["coverage"] for r in results]
    faithfulness_scores = [r["judge_evaluation"]["faithfulness_to_notes"] for r in results]
    structure_scores = [r["judge_evaluation"]["structure"] for r in results]
    latencies = [r["latency_seconds"] for r in results]

    return {
        "num_cases": len(results),
        "avg_automated_score": round(mean(auto_scores), 2) if auto_scores else 0,
        "automated_score_max": auto_max,
        "avg_clarity": round(mean(clarity_scores), 2) if clarity_scores else 0,
        "avg_coverage": round(mean(coverage_scores), 2) if coverage_scores else 0,
        "avg_faithfulness": round(mean(faithfulness_scores), 2) if faithfulness_scores else 0,
        "avg_structure": round(mean(structure_scores), 2) if structure_scores else 0,
        "avg_latency_seconds": round(mean(latencies), 2) if latencies else 0,
    }


if __name__ == "__main__":
    topics = [
        "Benefits and challenges of vertical farming",
        "How wearable health devices support preventive care",
        "The impact of remote work on urban transportation",
        "Why data quality matters in machine learning projects",
    ]

    all_results = []
    for topic in topics:
        print(f"Running evaluation for: {topic}")
        result = evaluate_topic(topic)
        all_results.append(result)

    summary = summarize_results(all_results)

    print("\n=== AGGREGATE SUMMARY ===")
    print(json.dumps(summary, indent=2))

    print("\n=== SAMPLE RESULT ===")
    print(json.dumps(all_results[0], indent=2))

9.2 Example Aggregate Output

{
  "num_cases": 4,
  "avg_automated_score": 3.0,
  "automated_score_max": 3,
  "avg_clarity": 4.75,
  "avg_coverage": 4.25,
  "avg_faithfulness": 4.75,
  "avg_structure": 5.0,
  "avg_latency_seconds": 3.18
}

10. Interpreting Results

Once you have data, the next step is deciding what it means.

Example interpretation patterns

High structure, low coverage

  • The writer follows formatting rules
  • But important research points are missing
  • Possible fix: improve handoff quality or strengthen writer instructions

High coverage, low faithfulness

  • The final answer includes many points
  • But adds unsupported details
  • Possible fix: instruct the writer to use only supplied notes

Good quality, high latency

  • The workflow is effective but slow
  • Possible fix: simplify agent steps or reduce review rounds

Strong average, weak edge-case performance

  • The system works on standard tasks
  • But fails on ambiguous or noisy inputs
  • Possible fix: expand test coverage and harden prompts

11. Discussion Prompts

Use these for live discussion or reflection:

  1. What should count as success in a multi-agent system: final answer quality or collaboration quality?
  2. When is a reviewer agent actually useful?
  3. How would you compare a two-agent and three-agent architecture fairly?
  4. What kinds of tasks benefit most from intermediate-output evaluation?
  5. What are the dangers of relying only on LLM judges?

12. Best Practices for Evaluating Multi-Agent Systems

  • Start with a small, representative benchmark
  • Score both final outputs and intermediate artifacts
  • Combine automated checks with qualitative review
  • Measure cost and latency alongside quality
  • Use pairwise comparisons when testing variants
  • Keep prompts, inputs, and outputs logged for debugging
  • Evaluate across multiple runs and task types
  • Treat evaluator prompts as part of the system design

13. Suggested Instructor-Led Walkthrough

A practical classroom flow for the hands-on portion:

  1. Run the two-agent pipeline on one topic
  2. Inspect intermediate research notes
  3. Apply automated checks
  4. Add LLM judge scoring
  5. Run across multiple topics
  6. Compare results and diagnose where failures originate
  7. Brainstorm prompt or workflow improvements

14. Wrap-Up

In this session, learners moved from the idea of “Did the model answer correctly?” to a more realistic question: “How well did the multi-agent system work as a coordinated process?” That shift is essential in agentic development.

Key takeaway:

Evaluating multi-agent systems requires measuring both outcome quality and interaction quality.
A good evaluation setup helps you improve prompts, redesign workflows, reduce cost, and make agent systems more reliable.


Useful Resources

  • OpenAI Responses API migration guide: https://developers.openai.com/api/docs/guides/migrate-to-responses
  • OpenAI API docs: https://platform.openai.com/docs
  • OpenAI Python SDK: https://github.com/openai/openai-python
  • Prompt engineering guide: https://platform.openai.com/docs/guides/prompt-engineering
  • Python re module docs: https://docs.python.org/3/library/re.html
  • Python json module docs: https://docs.python.org/3/library/json.html
  • Python statistics module docs: https://docs.python.org/3/library/statistics.html

Optional Homework

  1. Add a Reviewer Agent and evaluate whether it improves final quality enough to justify added latency.
  2. Compare two prompt versions for the Writer Agent using the same benchmark.
  3. Add CSV or JSONL logging to store all runs for later analysis.
  4. Create edge-case tasks designed to trigger coordination failures.
  5. Build a pairwise evaluator that compares outputs from two workflow variants.

Quick Recap

  • Multi-agent evaluation is broader than final-answer evaluation
  • Key dimensions include success, coordination, latency, cost, and robustness
  • Automated checks are useful but limited
  • LLM judges help evaluate nuanced qualities
  • Batch evaluation reveals patterns hidden by single examples
  • Logged intermediate outputs make debugging much easier

Back to Chapter | Back to Master Plan | Previous Session