Overview
The Rhesis SDK provides a comprehensive metrics system for evaluating LLM-based systems. The metrics module supports multiple evaluation frameworks and allows you to create custom metrics tailored to your specific use cases. The metrics module is integrated with the backend, allowing you to work with metrics directly from the platform.
Metric Types
Rhesis SDK supports two types of metrics:
Single-Turn Metrics
Single-turn metrics evaluate individual exchanges between user input and system output. These metrics are ideal for assessing:
- RAG Systems: Context relevance, faithfulness, and answer accuracy
- Response Quality: Clarity, relevance, and accuracy of individual responses
- Safety & Compliance: Bias, toxicity, PII leakage, and other safety concerns
- Custom Evaluations: Domain-specific quality assessments
View Single-Turn Metrics Documentation →
Conversational Metrics
Conversational metrics (multi-turn metrics) evaluate the quality of interactions across multiple conversation turns. These metrics are ideal for assessing:
- Conversation Flow: Turn relevancy and coherence across dialogue
- Goal Achievement: Whether objectives are met throughout the conversation
- Role Adherence: Consistency in maintaining assigned roles
- Knowledge Retention: Ability to recall and reference earlier conversation context
- Tool Usage: Appropriate selection and utilization of available tools
- Conversation Completeness: Whether conversations reach satisfactory conclusions
View Conversational Metrics Documentation →
Metric Scopes
Every metric has a metric_scope that controls where and when it runs. The three scope values are:
Single-Turn— runs during single-turn test evaluation and per-turn trace evaluationMulti-Turn— runs during multi-turn test evaluation and conversation-level trace evaluationTrace— enables automatic evaluation against live traces (see below)
A metric can have any combination of these scopes. Scopes are additive: including more values makes the metric eligible in more contexts.
Default Scopes by Metric Class
| Metric Class | Default Scope | Notes |
|---|---|---|
NumericJudge | Single-Turn, Multi-Turn | See note below on multi-turn behavior |
CategoricalJudge | Single-Turn, Multi-Turn | See note below on multi-turn behavior |
ConversationalJudge | Single-Turn, Multi-Turn | Receives structured ConversationHistory |
GoalAchievementJudge | Single-Turn, Multi-Turn | Receives structured ConversationHistory |
GarakDetectorMetric | Single-Turn only | Operates on individual prompt/response pairs |
How single-turn metrics work in multi-turn tests: When a NumericJudge or CategoricalJudge is used in a multi-turn evaluation, the full conversation is serialized to plain text and passed as the output parameter. The metric does not receive a structured ConversationHistory object — it evaluates the conversation as a single text blob. This means the evaluation quality depends entirely on the evaluation_prompt you write. For turn-aware evaluation (e.g., analyzing coherence between specific turns), use a ConversationalJudge instead, which receives the full structured conversation with individual turns.
You can override the default scope when creating a metric:
Trace Scope
The Trace scope enables a metric for automatic evaluation against live production traces. Unlike Single-Turn and Multi-Turn (which apply during test execution), Trace activates the background evaluation pipeline that processes traces after ingestion.
Combine Trace with Single-Turn or Multi-Turn to control which evaluation phase applies the metric:
| Scope combination | When it runs | Use case |
|---|---|---|
["Trace", "Single-Turn"] | Immediately after each turn | Per-turn guardrails: safety, toxicity, response quality |
["Trace", "Multi-Turn"] | After conversation inactivity timeout | Full-conversation analysis: coherence, goal achievement |
["Trace"] alone | Per-turn on single-turn traces; per-conversation on multi-turn | General-purpose metrics that adapt to the trace type |
Order within the list does not matter. Adding Trace to a metric that already has Single-Turn and Multi-Turn makes it eligible for both test execution and trace evaluation.
For full details on how trace metrics evaluation works — including the two-phase pipeline, debounce timing, project configuration, and first-turn handling — see the Trace Metrics documentation.
Framework Integration
Rhesis integrates with the following open-source evaluation frameworks:
- DeepEval - Apache License 2.0
The LLM Evaluation Framework by Confident AI - DeepTeam - Apache License 2.0
The LLM Red Teaming Framework by Confident AI - Ragas - Apache License 2.0
Supercharge Your LLM Application Evaluations by Exploding Gradients - Garak - Apache License 2.0
LLM Vulnerability Scanner by NVIDIA
These tools are used through their public APIs. The original licenses and copyright notices can be found in their respective repositories. Rhesis is not affiliated with these projects.
Quick Example
API Key Required: All examples require a valid Rhesis API key. Set your API key using:
For more information, see the Installation & Setup guide.
Single-Turn Evaluation
Conversational Evaluation
Custom Metrics
In addition to framework-provided metrics, Rhesis offers custom metric builders:
For Single-Turn Evaluation
NumericJudge: Create custom numeric scoring metrics (e.g., 0-10 scale)CategoricalJudge: Create custom categorical classification metrics
For Conversational Evaluation
ConversationalJudge: Create custom conversational quality metricsGoalAchievementJudge: Evaluate goal achievement with custom criteria
Platform Integration
Metrics can be managed both in the platform and in the SDK. The SDK provides push and pull methods to synchronize metrics with the platform.
Next Steps
- Single-Turn Metrics - Learn about all available single-turn metrics
- Conversational Metrics - Learn about all available conversational metrics
- Trace Metrics - Automatic evaluation on live production traces
- Models Documentation - Configure LLM models for evaluation
- Installation & Setup - Setup instructions
Need Help?
If any metrics are missing from the list, or you would like to use a different provider, please let us know by creating an issue on GitHub .