Building Custom Metrics with the Rhesis SDK
While Rhesis provides integrations with DeepEval, Ragas, and other evaluation frameworks, you’ll often need custom metrics tailored to your specific use case. This guide shows you how to create custom metrics for evaluating LLM responses and conversations.
Prerequisites
LLM Service Required: Custom metrics use LLM models to perform evaluations. You need to configure an LLM service:
Option 1: Rhesis API (Default)
Set your Rhesis API key to use the default evaluation service:
Get your API key from app.rhesis.ai or configure your self-hosted instance following the Installation & Setup guide.
Option 2: Other LLM Providers
You can use any supported LLM provider (OpenAI, Azure OpenAI, Google Gemini, Anthropic, etc.) by configuring the appropriate API keys and passing the model to your metrics. See the Models Documentation for details.
What You’ll Learn
This guide covers:
- NumericJudge: Create metrics that score responses on a numeric scale (0-10, 1-5, etc.)
- CategoricalJudge: Build classifiers that categorize responses (tone, intent, safety levels)
- ConversationalJudge: Evaluate multi-turn conversation quality and coherence
- GoalAchievementJudge: Assess whether specific goals were achieved in conversations
- Model Configuration: Choose and configure LLM models for evaluation
- Platform Integration: Push and pull metrics to/from the Rhesis platform
- Best Practices: Tips for crafting effective evaluation prompts and criteria
Overview of Custom Metrics
Rhesis provides four custom metric builders:
Single-Turn Metrics
- NumericJudge: Returns numeric scores (e.g., 0-10 scale) for quality assessment
- CategoricalJudge: Returns categorical classifications (e.g., “professional”, “casual”, “inappropriate”)
Conversational Metrics
- ConversationalJudge: Evaluates multi-turn conversation quality
- GoalAchievementJudge: Assesses whether specific goals were achieved in a conversation
Creating a Numeric Judge
NumericJudge is ideal when you need to score responses on a numeric scale. Common use cases include rating clarity, professionalism, technical accuracy, or any subjective quality measure.
Basic Example
Advanced Configuration
Add evaluation steps to guide the LLM’s assessment:
Real-World Example: Customer Support Quality
Creating a Categorical Judge
CategoricalJudge classifies responses into predefined categories. This is perfect for tone detection, content classification, intent recognition, or compliance checking.
Basic Example
Real-World Example: Content Safety Classification
Multi-Category Intent Classifier
Creating Conversational Metrics
For evaluating multi-turn conversations, use ConversationalJudge and GoalAchievementJudge. These metrics assess dialogue quality across multiple exchanges.
Conversational Judge
Goal Achievement Judge
GoalAchievementJudge evaluates whether specific objectives were met during a conversation.
Real-World Example: Customer Retention
Configuring Evaluation Models
All custom metrics use LLM models to perform evaluations. You can specify which model to use:
For detailed model configuration, see the Models Documentation.
Platform Integration
Custom metrics can be synchronized with the Rhesis platform for centralized management.
Pushing Metrics to Platform
Pulling Metrics from Platform
Serialization and Storage
Save and load metric configurations for version control or sharing:
Best Practices
Crafting Effective Evaluation Prompts
- Be Specific: Clearly define what you’re evaluating
- Provide Context: Include relevant background or guidelines
- Break Down Steps: Use evaluation_steps for complex assessments
- Set Appropriate Thresholds: Test and adjust based on results
- Choose the Right Model: More complex evaluations may need more capable models
Example: Well-Structured Metric
Next Steps
Need Help?
If you have questions or need assistance creating custom metrics for your use case, reach out on GitHub or join our community on Discord .