FAQs about evaluation AI agents

Which LLMs to use in setting up the agents?

We advise using cheap LLMs as evaluation agents, and a more powerful LLM as a meta evaluation agent. Research also has shown that using LLMs from different LLM families (e.g., GPT, Llama, Gemini, Mistral, etc.) leads to better performance. Research has shown that this is not only cheaper, but also more accurate than using one LLM as a judge.

How many rounds of discussion between agents should be used?

We recommend using at least 1 round of discussion for “easy” tasks, and at least 2 rounds of discussions for “hard” tasks. This is because one round is sufficient for LLMs to correct mistakes in the easy tasks. In hard tasks, LLMs need to discuss for longer to correct their mistakes and decide on a label.

Why are these agents, and not just LLMs?

Unlike simple LLMs, these agents debate with each other, can change their outputs over multiple rounds, learn from human feedback, use tools, and improve over time.

How are multi-agent evaluators different from LLMs as a judge?

Multi-agent evaluators use two or more LLM-based agents for evaluations and one meta evaluation agent for making the final decision. Research has shown that this is a more accurate evaluation method than using a single LLM as a judge. Unlike simple LLMs, these agents debate with each other, can change their outputs over multiple rounds, learn from human feedback, use tools, and improve over time.

You may note that the most basic version of a multi AI agent system with only a single judge agent (and no meta evaluation agent) defaults to an LLM-as-a-judge; thus multi-agent systems are a superset of LLM-as-a-judge.

How are multi-agent evaluators different from juries of LLMs?

Multi-agent evaluators are like juries on steroids — these agents debate with each other over multiple rounds, can correct their outputs over discussions, learn from human feedback, use tools, and improve over time.