Skip to Content
GlossaryMetric - Glossary

Metric

Back to GlossaryTesting

A quantifiable measurement that evaluates AI behavior using an LLM as a judge, returning pass/fail results with optional numeric scoring.

Also known as: evaluation metric

Overview

Metrics are the core evaluation mechanism in Rhesis using LLM-as-judge to assess AI responses against defined criteria. Each metric evaluates a specific aspect of behavior, such as accuracy, safety, tone, or helpfulness.

How Metrics Work

  1. Test Execution: Your AI system responds to a test prompt
  2. Judge Evaluation: An LLM judge reviews the response against your criteria
  3. Scoring: The judge assigns a score (pass/fail or numeric)
  4. Reasoning: The judge provides explanation for the score

Metric Components

Evaluation Instructions: Instructions that tell the judge what to evaluate - defines what aspects should be assessed.

Scoring Configuration: Two types available:

  • Numeric: Scale-based scoring (e.g., 0-10) with a pass threshold
  • Categorical: Classification into predefined categories

Evaluation Steps: Break down the evaluation into clear steps to guide the LLM judge.

Common Metric Types

Quality:

  • Accuracy and correctness
  • Completeness of response
  • Relevance to the question
  • Clarity and coherence

Safety:

  • Harmful content detection
  • Bias and fairness
  • Privacy and PII handling
  • Appropriate refusals

Functional:

  • Tool usage correctness
  • Format compliance
  • Instruction following
  • Context awareness

Example: Creating Custom Metrics with SDK

Numeric Judge:

python
from rhesis.sdk.metrics import NumericJudge

accuracy_metric = NumericJudge(
      name="factual_accuracy",
      evaluation_prompt="Evaluate if the response is factually accurate.",
      evaluation_steps="""
          1. Identify factual claims in the response
          2. Verify accuracy of each claim
          3. Check for misleading or incomplete information
          4. Assign score based on overall accuracy
      """,
      min_score=0.0,
      max_score=10.0,
      threshold=7.0
)

# Evaluate a response
result = accuracy_metric.evaluate(
      input="What is the capital of France?",
      output="The capital of France is Paris."
)
print(f"Score: {result.score}")

Categorical Judge:

python
from rhesis.sdk.metrics import CategoricalJudge

tone_metric = CategoricalJudge(
      name="tone_classifier",
      evaluation_prompt="Classify the tone of the response.",
      categories=["professional", "casual", "technical", "friendly"],
      passing_categories=["professional", "technical"]
)

Using Pre-built Metrics:

python
from rhesis.sdk.metrics import DeepEvalAnswerRelevancy

metric = DeepEvalAnswerRelevancy(threshold=0.7)
result = metric.evaluate(
      input="What is photosynthesis?",
      output="Photosynthesis is how plants convert light into energy."
)

Best Practices

  • Be specific: Clear criteria lead to consistent evaluations
  • Use examples: Include examples of passing and failing responses
  • Test your metrics: Run them on known good/bad responses to validate
  • Combine metrics: Use multiple metrics to evaluate different aspects
  • Iterate: Refine prompts based on judge performance

Documentation

Related Terms