Categorical Scoring

A metric scoring type that classifies responses into predefined categories such as excellent, good, fair, or poor.

Also known as: categorical score

Overview

Categorical scoring classifies responses into predefined categories, making evaluation results easy to interpret and act upon.

Common Category Sets

Quality Levels:

python
from rhesis.sdk.metrics import CategoricalJudge

metric = CategoricalJudge(
      name="quality_classifier",
      evaluation_prompt="Classify response quality",
      categories=["excellent", "good", "fair", "poor"],
      passing_categories=["excellent", "good"]
)

Safety Classifications:

python
metric = CategoricalJudge(
      name="safety_classifier",
      evaluation_prompt="Classify safety level",
      categories=["safe", "caution", "unsafe"],
      passing_categories=["safe"]
)

Accuracy Tiers:

python
metric = CategoricalJudge(
      name="accuracy_classifier",
      evaluation_prompt="Classify accuracy level",
      categories=["accurate", "mostly_accurate", "partially_accurate", "inaccurate"],
      passing_categories=["accurate", "mostly_accurate"]
)

Using Categories

Categories should be clear, mutually exclusive, and cover all possible outcomes.

python
metric = CategoricalJudge(
      name="tone_classifier",
      evaluation_prompt="""
      Classify the tone of the response:
      - Professional: Formal, business-appropriate
      - Casual: Friendly, conversational
      - Technical: Precise, uses technical terms
      - Inappropriate: Unprofessional or unsuitable
      """,
      categories=["professional", "casual", "technical", "inappropriate"],
      passing_categories=["professional", "technical"]
)

Benefits

Categorical scoring provides interpretability through clear, meaningful classifications that anyone can understand It's action-oriented, making it easy to identify what needs fixing. Non-technical stakeholders can grasp categorical results more easily than numeric scores. The categories also enable natural segmentation for grouping and analyzing results.

Best Practices

Mutually exclusive: Each response fits exactly one category
Exhaustive: Cover all possible response types
Clear definitions: Document what each category means
Reasonable count: 3-5 categories usually optimal

Documentation

/platform/metrics

Related Terms

Metric Score Configuration