Test Runs

View execution results when you run a test set against an endpoint. Each test run contains all individual test results with detailed metrics, conversation history, and performance data.

What are Test Runs? A test run is created when you execute a test set against an endpoint. It captures all test results, execution metadata, and evaluation metrics for analysis.

Test Runs Page

By clicking on a test run, you’ll see the overview of how the test run went. From here you can:

Review: Manually revise the automated test evaluation
Compare: Compare against a baseline test run
Re-run: Execute the test set again with the same configuration
Download: Export test run data as CSV

Test Run Detail Page

Test Run Detail View

The test run detail page includes an expanded Overview and Conversation workflow for debugging failures faster:

Go to Test: Open the original test definition directly from a result detail
Context panel: Expand/collapse captured context items when available
Metadata panel: Expand/collapse structured metadata for the result
Files / Output Files panels: Inspect and download both input attachments and generated output files
JSON-aware rendering: Response content that is valid JSON is pretty-printed in the overview tab

For multi-turn tests, the Conversation tab supports trace-driven debugging:

Click assistant responses to open the trace drawer
Navigate turn-by-turn while keeping test result context visible

Multi-turn Result Details (v0.6.7+)

For multi-turn tests, the run detail includes conversation-specific execution data:

Turns used: How many turns were consumed during execution
Configured max turns: The max turn budget from test configuration
Goal achieved flag: Whether the multi-turn goal was met
Conversation-linked trace access: Open the trace drawer to inspect tree, sequence, graph, and conversation views for the related trace

Review Test Runs

Reviews allow human evaluators to validate or override automated test evaluations.

Select a Test Run from the Test Runs overview page
Click on the Reviews tab, then Add Your Review
Select a Review Status:
- Pass: The test passes the metric
- Fail: The test fails the metric
Add a comment explaining your review decision

Review targets and override behavior (v0.6.9+)

Reviews can now target different parts of a test result, not only the overall result.

Target type	What it reviews	Reference format
`test_result`	Overall pass/fail decision for the test result	`null`
`metric`	A specific metric result in `test_metrics`	Metric name
`turn`	A specific turn in multi-turn conversation output	Turn label such as `Turn 3`

When a review targets a metric or turn, Rhesis stores an override marker on the affected object and recomputes effective status in the test run views.

In review comments, type @ to mention metrics or turns. The first metric or turn mention determines the review target automatically.

review-mention-format.txt
@[Goal Achievement](metric:goal-achievement)
@[Turn 2](turn:turn-2)

Compare Test Runs

Compare test runs to identify regressions and improvements between executions.

How to Compare:

Click the Compare button (top right, next to Download)
Select a baseline test run to compare against
View the test-by-test comparison of the test set

Comparison Filters:

Use filters to focus on specific changes:

All Tests: Show all tests from both runs
Improved: Tests that now pass but failed in baseline
Regressed: Tests that now fail but passed in baseline
Unchanged: Tests with the same pass/fail status

Test Run Metrics

When a test run executes, Rhesis evaluates each test against a set of metrics. The metrics used follow a priority hierarchy that allows flexibility at different levels.

Metrics Resolution Hierarchy

Rhesis resolves which metrics to use based on the following priority order:

Priority	Source	Description
1 (Highest)	Execution-time metrics	Metrics specified when triggering the test run. These completely override all other metric configurations.
2	Test set metrics	Metrics configured on the test set entity. Override behavior-level defaults.
3 (Lowest)	Behavior metrics	Default metrics defined at the behavior level for each test.

Execution-time metrics completely override other levels. There is no merging of metrics between levels.

Specifying Execution-time Metrics

When executing a test set from the UI:

Open the Execute Test Set drawer from a test set detail page
In the Test Run Metrics section, select “Define metrics for this execution”
Click Add Metric to select which metrics to use
Only metrics applicable to your test set type (Single-Turn or Multi-Turn) will be shown

Execution-time metrics only apply to that specific test run. They are not saved to the test set configuration.

When to Use Each Level

Behavior metrics: Set default evaluation criteria for all tests with a specific behavior. This is the most common configuration.
Test set metrics: Override defaults for specialized test sets, such as Garak security tests that use detector metrics.
Execution-time metrics: Quick validation with specific metrics without modifying test set configuration. Useful for one-off experiments or A/B testing different evaluation criteria.

Next Steps - Review failed tests to understand issues - Compare against baseline runs to detect regressions - Add human reviews to test results - Export results for reporting or analysis