Skip to Content
GlossaryNumeric Scoring - Glossary

Numeric Scoring

Back to GlossaryTesting

A metric scoring type that uses a numeric scale (e.g., 0-10) with a defined pass/fail threshold.

Also known as: numeric score

Overview

Numeric scoring provides granular evaluation on a scale, allowing you to track subtle improvements and set specific passing thresholds.

Common Scales

0-10 Scale:

python
from rhesis.sdk.metrics import NumericJudge

metric = NumericJudge(
      name="quality",
      evaluation_prompt="Evaluate quality",
      min_score=0.0,
      max_score=10.0,
      threshold=7.0
)

Good for: General quality assessment.

1-5 Scale:

python
metric = NumericJudge(
      name="rating",
      evaluation_prompt="Rate the response",
      min_score=1.0,
      max_score=5.0,
      threshold=4.0
)

Good for: Quick evaluations, star ratings.

0-100 Scale:

python
metric = NumericJudge(
      name="percentage",
      evaluation_prompt="Score as percentage",
      min_score=0.0,
      max_score=100.0,
      threshold=70.0
)

Good for: Percentage-style scoring, fine-grained evaluation.

Setting Thresholds

Strictness: Higher threshold = more strict Use case: Critical features need higher thresholds Baseline: Set based on current performance

Examples:

python
from rhesis.sdk.metrics import NumericJudge

# Safety: Very strict
metric = NumericJudge(
      name="harm_refusal",
      evaluation_prompt="Evaluate harm refusal",
      min_score=0.0,
      max_score=10.0,
      threshold=9.0  # 90% required
)

# Helpfulness: Moderate
metric = NumericJudge(
      name="response_quality",
      evaluation_prompt="Evaluate response quality",
      min_score=0.0,
      max_score=10.0,
      threshold=7.0  # 70% required
)

# Experimental: Lenient
metric = NumericJudge(
      name="creative_writing",
      evaluation_prompt="Evaluate creativity",
      min_score=0.0,
      max_score=10.0,
      threshold=5.0  # 50% required
)

Benefits

Numeric scoring provides granularity that lets you see small improvements over time. It offers flexibility to adjust thresholds as your system's quality improves. Scores are easily comparable across different tests, and you can track average scores over time to identify trends in performance.

Best Practices

  • Anchor scores: Define what each score level means
  • Avoid extremes: Rarely use 0 or 10 unless truly warranted
  • Review distributions: Check if scores cluster or spread
  • Adjust thresholds: Raise bar as quality improves

Documentation

Related Terms