Metric
Back to GlossaryTesting
A quantifiable measurement that evaluates AI behavior using an LLM as a judge, returning pass/fail results with optional numeric scoring.
Also known as: evaluation metric
Overview
Metrics are the core evaluation mechanism in Rhesis using LLM-as-judge to assess AI responses against defined criteria. Each metric evaluates a specific aspect of behavior, such as accuracy, safety, tone, or helpfulness.
How Metrics Work
- Test Execution: Your AI system responds to a test prompt
- Judge Evaluation: An LLM judge reviews the response against your criteria
- Scoring: The judge assigns a score (pass/fail or numeric)
- Reasoning: The judge provides explanation for the score
Metric Components
Evaluation Instructions: Instructions that tell the judge what to evaluate - defines what aspects should be assessed.
Scoring Configuration: Two types available:
- Numeric: Scale-based scoring (e.g., 0-10) with a pass threshold
- Categorical: Classification into predefined categories
Evaluation Steps: Break down the evaluation into clear steps to guide the LLM judge.
Common Metric Types
Quality:
- Accuracy and correctness
- Completeness of response
- Relevance to the question
- Clarity and coherence
Safety:
- Harmful content detection
- Bias and fairness
- Privacy and PII handling
- Appropriate refusals
Functional:
- Tool usage correctness
- Format compliance
- Instruction following
- Context awareness
Example: Creating Custom Metrics with SDK
Numeric Judge:
Categorical Judge:
Using Pre-built Metrics:
Best Practices
- Be specific: Clear criteria lead to consistent evaluations
- Use examples: Include examples of passing and failing responses
- Test your metrics: Run them on known good/bad responses to validate
- Combine metrics: Use multiple metrics to evaluate different aspects
- Iterate: Refine prompts based on judge performance