Rescoring

Back to Glossary Results Analysis

Re-evaluating stored test outputs against metrics without re-invoking the endpoint, enabling cost-efficient experimentation with metric configurations.

Also known as: rescore, output reuse, re-scoring

Overview

Normally, evaluating a test requires two steps: (1) invoking your endpoint to generate a response, and (2) running metrics on that response. Rescoring skips the first step—it reuses the output that was already captured in a previous test run and runs only the metric evaluation.

When to Use Rescoring

Adding new metrics: You have an existing test run and want to evaluate it against a new metric without regenerating responses.

Changing metric configuration: You want to see how adjusting a metric's threshold or configuration affects pass/fail outcomes for outputs you already have.

Metric experimentation: You are exploring different evaluation strategies (different prompts, different judges, different scoring models) and want to compare results on the same set of outputs.

Cost optimization: LLM endpoint calls can be expensive. Rescoring lets you experiment with evaluation logic without paying for new generations.

How It Works

When you rescore a test run, Rhesis:

Retrieves the stored outputs from the previous execution
Runs the selected metrics against those outputs
Produces new scores and a new pass/fail determination
Stores the results alongside the original run for comparison

Scoring Target

The "Scoring Target" option in the test execution UI controls whether Rhesis:

Invokes the endpoint: Sends inputs to your endpoint to generate fresh outputs, then evaluates them
Uses stored outputs: Skips endpoint invocation and evaluates the outputs from a previous run

Limitations

Rescoring is only applicable for metrics that evaluate the output itself (correctness, tone, safety, etc.). Metrics that require fresh endpoint data—such as latency—cannot be meaningfully rescored from stored outputs.

Best Practices

Use rescoring to evaluate the impact of metric threshold changes before applying them to all future runs
Keep original test runs intact before rescoring so you can compare results across different metric configurations
Avoid rescoring latency-based metrics, as meaningful latency data requires fresh endpoint invocations
Use rescoring to onboard new metrics incrementally without re-running expensive endpoint calls