Single-Turn Metrics
Overview
Single-turn metrics evaluate individual exchanges between user input and system output. These metrics are ideal for assessing the quality of standalone responses, RAG systems, and classification tasks.
API Key Required: All examples in this documentation require a valid Rhesis API key. Set your API key using:
For more information, see the Installation & Setup guide.
Rhesis integrates with the following open-source evaluation frameworks:
- DeepEval - Apache License 2.0 The LLM Evaluation Framework by Confident AI
- DeepTeam - Apache License 2.0 The LLM Red Teaming Framework by Confident AI
- Ragas - Apache License 2.0 Supercharge Your LLM Application Evaluations by Exploding Gradients
- Garak - Apache License 2.0 LLM Vulnerability Scanner by NVIDIA
These tools are used through their public APIs. The original licenses and copyright notices can be found in their respective repositories. Rhesis is not affiliated with these projects.
Supported Metrics
DeepEval Metrics
| Metric | Description | Requires Context | Requires Ground Truth | Reference |
|---|---|---|---|---|
DeepEvalAnswerRelevancy | Measures answer relevance to the question | No | No | Docs |
DeepEvalFaithfulness | Checks if answer is grounded in context | Yes | No | Docs |
DeepEvalContextualRelevancy | Evaluates context relevance to question | Yes | No | Docs |
DeepEvalContextualPrecision | Measures precision of retrieved context | Yes | Yes | Docs |
DeepEvalContextualRecall | Measures recall of retrieved context | Yes | Yes | Docs |
DeepEvalBias | Detects biased content in responses | No | No | Docs |
DeepEvalToxicity | Detects toxic content in responses | No | No | Docs |
DeepEvalPIILeakage | Detects personally identifiable information | No | No | Docs |
DeepEvalRoleViolation | Detects when assistant violates assigned role | No | No | Docs |
DeepEvalMisuse | Detects potential misuse of the system | No | No | Docs |
DeepEvalNonAdvice | Ensures assistant doesn’t give restricted advice | No | No | Docs |
DeepTeam Metrics
| Metric | Description | Requires Context | Requires Ground Truth | Reference |
|---|---|---|---|---|
DeepTeamSafety | Detects safety violations | No | No | Docs |
DeepTeamIllegal | Detects illegal content or requests | No | No | Docs |
Ragas Metrics
| Metric | Description | Requires Context | Requires Ground Truth | Reference |
|---|---|---|---|---|
RagasContextRelevance | Evaluates context relevance to question | Yes | No | Docs |
RagasAnswerAccuracy | Measures answer accuracy against ground truth | No | Yes | Docs |
RagasFaithfulness | Checks if answer is grounded in context | Yes | No | Docs |
RagasAspectCritic | Custom aspect-based evaluation | No | No | Docs |
Garak Metrics
Optional Dependency: Garak is an optional dependency. To use Garak metrics, install the SDK with the garak extra:
pip install rhesis-sdk[garak]Or if using uv:
uv add rhesis-sdk[garak]Garak is an LLM vulnerability scanner that detects security weaknesses and jailbreak vulnerabilities. The metrics return numeric scores from 0.0 to 1.0, where lower scores indicate safer responses.
The table below reflects the canonical metric names as of Garak v0.14. Previous names
(e.g. MisleadingClaim, XSSDetector, LeakReplayDetector) are still accepted as legacy
aliases and continue to work without any code changes.
| Metric | Description | Detection Type | Reference |
|---|---|---|---|
MitigationBypass | Detects jailbreak attempts and guardrail bypasses | Pattern matching | Docs |
Continuation | Detects continuation of harmful content after partial unsafe prompts | Pattern matching | Docs |
MustRefuteClaimKW | Detects misleading or false claims | Pattern matching | Docs |
PerspectiveToxicity | Detects toxic content using Google Perspective API | API-based | Docs |
XSS | Detects cross-site scripting payload generation | Pattern matching | Docs |
StartsWith | Detects training data leakage through replay | Pattern matching | Docs |
DontStartYes | Detects snowball factual errors and hallucination propagation | Pattern matching | Docs |
PythonPypi | Detects hallucinated Python package names | Pattern matching | Docs |
Anthro | Detects Language Model Risk Cards (LMRC) violations | Pattern matching | Docs |
AnyCode | Detects functional malicious code generation | Pattern matching | Docs |
EICAR | Detects known malicious content signatures | Pattern matching | Docs |
Escaped | Detects ANSI escape sequence injection in outputs | Pattern matching | Docs |
ApiKey | Detects API key and credential leakage | Pattern matching | Docs |
RepeatDiverges | Detects training data memorization via repetitive divergence | Pattern matching | Docs |
PythonCodeExecution | Detects exploit code generation | Pattern matching | Docs |
FileIsExecutable | Detects malicious file format content generation | Pattern matching | Docs |
Do Not Answer detection: Garak’s ToxicCommentModel is a local model that cannot be
run ad hoc. Both the Rhesis SDK and the platform automatically substitute
PerspectiveToxicity (Google Perspective API) for do-not-answer and toxicity detection.
Rhesis Cloud provides this key automatically. Self-hosted deployments require a
PERSPECTIVE_API_KEY environment variable.
Rhesis Custom Metrics
| Metric | Description | Configuration |
|---|---|---|
NumericJudge | LLM-based numeric scoring (e.g., 0-10 scale) | Min/max score, threshold, custom prompts |
CategoricalJudge | LLM-based categorical classification | Categories, passing categories, custom prompts |
If any metrics are missing from the list, or you would like to use a different provider, please let us know by creating an issue on GitHub .
Quick Start
Using DeepEval Metrics
Using Ragas Metrics
Using Garak Metrics
You can also use the GarakDetectorMetric class directly for more control:
Creating Custom Metrics
You can create custom metrics using the NumericJudge and CategoricalJudge classes.
Numeric Judge
NumericJudge returns a numeric score (e.g., from 0 to 10) and requires four specific parameters: min_score, max_score, threshold, and threshold_operator.
Using Metadata in Evaluation
Both NumericJudge and CategoricalJudge accept an optional metadata parameter. When provided, the metadata JSON is included in the evaluation prompt as “Response Metadata”, allowing the judge model to reason about structured data alongside the text response.
This is useful for evaluating aspects like token efficiency, latency, confidence scores, or any structured data returned by your endpoint.
When running tests through the platform, metadata is extracted automatically from your endpoint’s response using the metadata field in the response mapping. The metadata is then passed to each metric during evaluation.
Categorical Judge
CategoricalJudge returns a categorical value and requires you to specify categories and passing_categories.
Understanding Results
All metrics return a MetricResult object:
Configuring Models
All metrics require an LLM model to perform the evaluation. If no model is specified, the default model will be used. You can specify the model using the model argument.
For more information about models, see the Models Documentation.
Advanced Configuration
Serialization
Custom metrics can be serialized and deserialized using the from_config/to_config or from_dict/to_dict methods.
Platform Integration
Metrics can be managed both in the platform and in the SDK. The SDK provides push and pull methods to synchronize metrics with the platform.
Pushing Metrics
To push a metric to the platform:
Pulling Metrics
To pull metrics from the platform, use the pull method and specify the metric name. If the name is not unique, you must also specify the metric ID.
See Also
- Conversational Metrics - Multi-turn conversation evaluation
- Models Documentation - Configure LLM models for evaluation
- Installation & Setup - Setup instructions
- GitHub Repository - Source code and examples