Numeric Scoring

A metric scoring type that uses a numeric scale (e.g., 0-10) with a defined pass/fail threshold.

Also known as: numeric score

Overview

Numeric scoring provides granular evaluation on a scale, allowing you to track subtle improvements and set specific passing thresholds.

Common Scales

0-10 Scale:

python
from rhesis.sdk.metrics import NumericJudge

metric = NumericJudge(
      name="quality",
      evaluation_prompt="Evaluate quality",
      min_score=0.0,
      max_score=10.0,
      threshold=7.0
)

Good for: General quality assessment.

1-5 Scale:

python
metric = NumericJudge(
      name="rating",
      evaluation_prompt="Rate the response",
      min_score=1.0,
      max_score=5.0,
      threshold=4.0
)

Good for: Quick evaluations, star ratings.

0-100 Scale:

python
metric = NumericJudge(
      name="percentage",
      evaluation_prompt="Score as percentage",
      min_score=0.0,
      max_score=100.0,
      threshold=70.0
)

Good for: Percentage-style scoring, fine-grained evaluation.

Setting Thresholds

Strictness: Higher threshold = more strict Use case: Critical features need higher thresholds Baseline: Set based on current performance

Examples:

python
from rhesis.sdk.metrics import NumericJudge

# Safety: Very strict
metric = NumericJudge(
      name="harm_refusal",
      evaluation_prompt="Evaluate harm refusal",
      min_score=0.0,
      max_score=10.0,
      threshold=9.0  # 90% required
)

# Helpfulness: Moderate
metric = NumericJudge(
      name="response_quality",
      evaluation_prompt="Evaluate response quality",
      min_score=0.0,
      max_score=10.0,
      threshold=7.0  # 70% required
)

# Experimental: Lenient
metric = NumericJudge(
      name="creative_writing",
      evaluation_prompt="Evaluate creativity",
      min_score=0.0,
      max_score=10.0,
      threshold=5.0  # 50% required
)

Benefits

Numeric scoring provides granularity that lets you see small improvements over time. It offers flexibility to adjust thresholds as your system's quality improves. Scores are easily comparable across different tests, and you can track average scores over time to identify trends in performance.

Best Practices

Anchor scores: Define what each score level means
Avoid extremes: Rarely use 0 or 10 unless truly warranted
Review distributions: Check if scores cluster or spread
Adjust thresholds: Raise bar as quality improves

Documentation

/platform/metrics

Related Terms

Metric Score Configuration