Using the AI agents to conduct scalable evaluations

You can use the AI agents to evaluate your AI application on a dataset by heading to the Evaluation Studio.

Just like with a single LLM, you select the benchmark dataset, the AI app to be evaluated, and now, the agents to be used for evaluations. Each test case in the benchmark will be pinged to the AI app to get a response, which is then evaluated by the multi AI agent evaluators. This then gives a scorecard and a detailed breakdown of the test cases where the AI app failed.