RAG Hallucination Evaluation: Labels and Metrics
This page describes the procedure, labels, and metrics for evaluating an AI chatbots' RAG capability. The goal of the evaluation is to quantify the correctness and quality of the chatbot's responses, including identifying its propensity to hallucinate.
Evaluation Procedure
Lighthouz evaluation requires an expected correct response and the generated LLM response to conduct evaluations.
Lighthouz conducts a two-pronged evaluation approach: semantic and syntactic.
- Semantic evaluation is based on the meaning and semantics of the responses, i.e., the scores will be high as long as the meaning are the same, regardless of the exact words used.
- Syntactic evaluation is based on the words used in the responses, i.e., the scores will be high only if the words match between the expected and generated responses.
Semantic evaluation with multi AI agents
Lighthouz is the inventor of using multi AI agent architectures to semantically evaluate an AI application, including RAG application.
Lighthouz provides the capability for you to launch bespoke multi AI agents within a minute. You can configure the agents to evaluate correctness, completeness, helpfulness, creativity, and more.
By default, the agents are set to evaluate correctness as follows:
- Correct: The information in the generated answer semantically matches the correct answer.
- Partially correct: The generated answer partially matches ground truth answer or has additional information along with correct answer.
- No answer: The generated response does not provide any information that matches the content of the ground truth, effectively offering no answer.
- Hallucination or incorrect: The generated response provides information that does not match the ground truth, showing clear discrepancies in factual content or details. This includes both complete fabrications and minor inaccuracies that lead to a fundamental misalignment with the ground truth.
Semantic evaluation with LLM as a judge
You can use a default LLM as a judge on Lighthouz to measure accuracy and hallucinations. The generated response is /semantically compared/ to categorize the generated response into one of the following categories measuring correctness and completeness of the responses:
- Correct and complete: The generated response is correct and it contains all the information present in the expected response. This represents a perfect response.
- Correct but incomplete: The generated response contains correct information, but it misses some information present in the expected response.
- Correct plus extra information: The generated response contains correct information, but it also includes additional information that is not present in the expected response.
- Hallucination or incorrect: The generated response contains completely incorrect, made up, or different response compared to the expected response.
- No answer: The generated response does not contain an answer the query.
Semantic evaluation with RAGAS
To provide a one-stop solution for evaluation needs, Lighthouz integrates the RAGAS evaluation criteria, namely faithfulness, answer relevance, and context relevance.
Syntactic metrics
To complement the semantic labels, Lighthouz calculates the following syntactic metrics:
- Similarity score: this score measures how similar a generated response is to the expected response. Range is 0 to 1. Higher is better.
- Conciseness score: this score measures the ratio of the length of generated response to the expected response. Range is 0 to infinity. Ideal score is 1.
In addition, the following metrics are also calculated:
- Toxicity: the generated response is labeled with a toxicity score -- higher value indicates more toxic response.
- Privacy metrics: the generated response is scored for the presence of personally identifiable information (PII) information. Higher score indicates that the response leaks PII information.
- Security metrics: the input query is scored for prompt injection score. Higher score indicates the query has a prompt injection attack.