Metric

Back to Glossary Testing

A quantifiable measurement that evaluates AI behavior using an LLM as a judge, returning pass/fail results with optional numeric scoring.

Also known as: evaluation metric

Overview

Metrics are the core evaluation mechanism in Rhesis using LLM-as-judge to assess AI responses against defined criteria. Each metric evaluates a specific aspect of behavior, such as accuracy, safety, tone, or helpfulness.

How Metrics Work

Test Execution: Your AI system responds to a test prompt
Judge Evaluation: An LLM judge reviews the response against your criteria
Scoring: The judge assigns a score (pass/fail or numeric)
Reasoning: The judge provides explanation for the score

Metric Components

Evaluation Instructions: Instructions that tell the judge what to evaluate - defines what aspects should be assessed.

Scoring Configuration: Two types available:

Numeric: Scale-based scoring (e.g., 0-10) with a pass threshold
Categorical: Classification into predefined categories

Evaluation Steps: Break down the evaluation into clear steps to guide the LLM judge.

Common Metric Types

Quality:

Accuracy and correctness
Completeness of response
Relevance to the question
Clarity and coherence

Safety:

Harmful content detection
Bias and fairness
Privacy and PII handling
Appropriate refusals

Functional:

Tool usage correctness
Format compliance
Instruction following
Context awareness

Example: Creating Custom Metrics with SDK

Numeric Judge:

python
from rhesis.sdk.metrics import NumericJudge

accuracy_metric = NumericJudge(
      name="factual_accuracy",
      evaluation_prompt="Evaluate if the response is factually accurate.",
      evaluation_steps="""
          1. Identify factual claims in the response
          2. Verify accuracy of each claim
          3. Check for misleading or incomplete information
          4. Assign score based on overall accuracy
      """,
      min_score=0.0,
      max_score=10.0,
      threshold=7.0
)

# Evaluate a response
result = accuracy_metric.evaluate(
      input="What is the capital of France?",
      output="The capital of France is Paris."
)
print(f"Score: {result.score}")

Categorical Judge:

python
from rhesis.sdk.metrics import CategoricalJudge

tone_metric = CategoricalJudge(
      name="tone_classifier",
      evaluation_prompt="Classify the tone of the response.",
      categories=["professional", "casual", "technical", "friendly"],
      passing_categories=["professional", "technical"]
)

Using Pre-built Metrics:

python
from rhesis.sdk.metrics import DeepEvalAnswerRelevancy

metric = DeepEvalAnswerRelevancy(threshold=0.7)
result = metric.evaluate(
      input="What is photosynthesis?",
      output="Photosynthesis is how plants convert light into energy."
)

Best Practices

Be specific: Clear criteria lead to consistent evaluations
Use examples: Include examples of passing and failing responses
Test your metrics: Run them on known good/bad responses to validate
Combine metrics: Use multiple metrics to evaluate different aspects
Iterate: Refine prompts based on judge performance

Documentation

/platform/metrics