Metrics
Overview
The Rhesis SDK provides a comprehensive metrics system for evaluating LLM-based systems. The metrics module supports multiple evaluation frameworks and allows you to create custom metrics tailored to your specific use cases. The metrics module is integrated with the backend, allowing you to work with metrics directly from the platform.
Rhesis integrates with the following open-source evaluation frameworks:
-
DeepEval - Apache License 2.0
The LLM Evaluation Framework by Confident AI -
DeepTeam - Apache License 2.0
The LLM Read Teaming Framework by Confident AI -
Ragas - Apache License 2.0
Supercharge Your LLM Application Evaluations by Exploding Gradients
These tools are used through their public APIs. The original licenses and copyright notices can be found in their respective repositories. Rhesis is not affiliated with these projects.
Supported metrics
DeepEval metrics
| Metric | Description | Requires Context | Requires Ground Truth |
|---|---|---|---|
DeepEvalAnswerRelevancy | Measures answer relevance to the question | No | No |
DeepEvalFaithfulness | Checks if answer is grounded in context | Yes | No |
DeepEvalContextualRelevancy | Evaluates context relevance to question | Yes | No |
DeepEvalContextualPrecision | Measures precision of retrieved context | Yes | Yes |
DeepEvalContextualRecall | Measures recall of retrieved context | Yes | Yes |
DeepEvalBias | Detects biased content | No | No |
DeepEvalToxicity | Detects toxic content | No | No |
DeepEvalPIILeakage | Detects personally identifiable information | No | No |
Ragas metrics
| Metric | Description | Requires Context | Requires Ground Truth |
|---|---|---|---|
RagasAnswerAccuracy | Measures answer accuracy | No | Yes |
RagasContextRelevance | Evaluates context relevance | Yes | No |
RagasFaithfulness | Checks answer groundedness | Yes | No |
RagasAspectCritic | Custom aspect-based evaluation | No | No |
Rhesis custom metrics
| Metric | Description | Configuration |
|---|---|---|
NumericJudge | LLM-based numeric scoring | Min/max score, threshold, custom prompts |
CategoricalJudge | LLM-based categorical classification | Categories, passing categories, custom prompts |
If any metrics are missing from the list, or you would like to use a different provider, please let us know by creating an issue on GitHub .
Quick Start
Using DeepEval metrics
from rhesis.sdk.metrics import DeepEvalAnswerRelevancy
# Initialize metric
metric = DeepEvalAnswerRelevancy(threshold=0.7)
# Evaluate
result = metric.evaluate(
input="What is the capital of France?",
output="The capital of France is Paris."
)
print(f"Score: {result.score}")
print(f"Passed: {result.details['is_successful']}")Using Ragas metrics
from rhesis.sdk.metrics import RagasFaithfulness
# Initialize metric
metric = RagasFaithfulness(threshold=0.8)
# Evaluate with context
result = metric.evaluate(
input="What is photosynthesis?",
output="Photosynthesis is the process by which plants convert light "
"into energy.",
context=["Photosynthesis occurs in chloroplasts...",
"Plants use sunlight..."]
)
print(f"Score: {result.score}")Creating custom metrics
You can create custom metrics using the NumericJudge and CategoricalJudge classes.
NumericJudge returns the score as a number (for example from 0 to 1), while CategoricalJudge returns the score as a category (for example “good”, “fair”, “poor”).
Each of these requires an evaluation_prompt. For better results, you can also specify evaluation_steps, reasoning, and evaluation_examples.
Numeric judge
Numeric Judge requires 4 specific parameters to be defined: min_score, max_score, threshold, and threshold_operator.
from rhesis.sdk.metrics import NumericJudge
# Define custom numeric metric
metric = NumericJudge(
name="response_clarity",
evaluation_prompt="Rate how clear and understandable the response is.",
evaluation_steps="1. Check sentence structure\n"
"2. Evaluate word choice\n"
"3. Assess overall clarity",
min_score=0.0,
max_score=10.0,
threshold=7.0
)
# Evaluate
result = metric.evaluate(
input="Explain quantum computing",
output="Quantum computers use qubits to process information...",
expected_output="A quantum computer uses quantum mechanics..."
)Categorical judge
Categorical Judge requires you to specify categories and passing_categories.
from rhesis.sdk.metrics import CategoricalJudge
# Define custom categorical metric
metric = CategoricalJudge(
name="tone_classifier",
evaluation_prompt="Classify the tone of the response.",
categories=["professional", "casual", "technical", "friendly"],
passing_categories=["professional", "technical"]
)
# Evaluate
result = metric.evaluate(
input="Describe machine learning",
output="Machine learning is a subset of AI...",
expected_output="ML enables systems to learn from data..."
)
print(f"Category: {result.score}")
print(f"Passed: {result.details['is_successful']}")Understanding results
All metrics return a MetricResult object:
result = metric.evaluate(input="...", output="...")
# Access score
# Numeric score or categorical value
print(result.score)
# Access details
print(result.details)
# {
# 'score': 0.85,
# 'reason': 'The response is highly relevant...',
# 'is_successful': True,
# 'threshold': 0.7,
# 'score_type': 'numeric'
# }Configuring models
All metrics require an LLM model to perform the evaluation. If no model is specified, the default model will be used. You can specify the model using the model argument. You can either pass a model object or the model name.
For more information about models, see the Models Documentation.
from rhesis.sdk.metrics import DeepEvalAnswerRelevancy
from rhesis.sdk.models import get_model
# Use specific model
model = get_model("gemini")
metric = DeepEvalAnswerRelevancy(threshold=0.7, model=model)
# Or pass model name directly
metric = DeepEvalAnswerRelevancy(threshold=0.7, model="gpt-4")Advanced configuration
Serialization
Custom metrics can be serialized or deserialized. To do so, use the from_config and to_config methods, which operate using a config dataclass. You can also use the from_dict and to_dict methods.
metric = NumericJudge(
name="response_clarity",
evaluation_prompt="Rate how clear and understandable the response is.",
evaluation_steps="1. Check sentence structure\n"
"2. Evaluate word choice\n"
"3. Assess overall clarity",
min_score=0.0,
max_score=10.0,
threshold=7.0
)
config = metric.to_config()
metric = NumericJudge.from_config(config)Platform Integration
Metrics can be managed both in the platform and in the SDK. To facilitate this workflow, the SDK provides
push and pull methods to synchronize metrics with the platform.
Pushing Metrics
To push a metric to the platform, you need to specify the following attributes:
metric = NumericJudge(
name="response_clarity",
description="Rate how clear and understandable the response is.",
metric_type="classification",
requires_ground_truth=True,
requires_context=False,
evaluation_prompt="Rate how clear and understandable the response is.",
evaluation_steps="1. Check sentence structure\n"
"2. Evaluate word choice\n"
"3. Assess overall clarity",
min_score=0.0,
max_score=10.0,
threshold=7.0
)
metric.push()Pulling Metrics
To pull metrics from the platform, use the pull method and specify the metric name. If the name is not unique, you must also specify the metric ID.
metric = NumericJudge.pull(name="response_clarity")See Also
- Models Documentation - Configure LLM models for evaluation
- Installation - Setup instructions
- GitHub Repository - Source code and examples