Rescoring
Re-evaluating stored test outputs against metrics without re-invoking the endpoint, enabling cost-efficient experimentation with metric configurations.
Overview
Normally, evaluating a test requires two steps: (1) invoking your endpoint to generate a response, and (2) running metrics on that response. Rescoring skips the first step—it reuses the output that was already captured in a previous test run and runs only the metric evaluation.
When to Use Rescoring
Adding new metrics: You have an existing test run and want to evaluate it against a new metric without regenerating responses.
Changing metric configuration: You want to see how adjusting a metric's threshold or configuration affects pass/fail outcomes for outputs you already have.
Metric experimentation: You are exploring different evaluation strategies (different prompts, different judges, different scoring models) and want to compare results on the same set of outputs.
Cost optimization: LLM endpoint calls can be expensive. Rescoring lets you experiment with evaluation logic without paying for new generations.
How It Works
When you rescore a test run, Rhesis:
- Retrieves the stored outputs from the previous execution
- Runs the selected metrics against those outputs
- Produces new scores and a new pass/fail determination
- Stores the results alongside the original run for comparison
Scoring Target
The "Scoring Target" option in the test execution UI controls whether Rhesis:
- Invokes the endpoint: Sends inputs to your endpoint to generate fresh outputs, then evaluates them
- Uses stored outputs: Skips endpoint invocation and evaluates the outputs from a previous run
Limitations
Rescoring is only applicable for metrics that evaluate the output itself (correctness, tone, safety, etc.). Metrics that require fresh endpoint data—such as latency—cannot be meaningfully rescored from stored outputs.
Best Practices
- Use rescoring to evaluate the impact of metric threshold changes before applying them to all future runs
- Keep original test runs intact before rescoring so you can compare results across different metric configurations
- Avoid rescoring latency-based metrics, as meaningful latency data requires fresh endpoint invocations
- Use rescoring to onboard new metrics incrementally without re-running expensive endpoint calls