Test Runs
View execution results when you run a test set against an endpoint. Each test run contains all individual test results with detailed metrics, conversation history, and performance data.
What are Test Runs? A test run is created when you execute a test set against an endpoint. It captures all test results, execution metadata, and evaluation metrics for analysis.

By clicking on a test run, you’ll see the overview of how the test run went. From here you can:
- Review: Manually revise the automated test evaluation
- Compare: Compare against a baseline test run
- Re-run: Execute the test set again with the same configuration
- Download: Export test run data as CSV

Test Run Detail View
The test run detail page includes an expanded Overview and Conversation workflow for debugging failures faster:
- Go to Test: Open the original test definition directly from a result detail
- Context panel: Expand/collapse captured context items when available
- Metadata panel: Expand/collapse structured metadata for the result
- Files / Output Files panels: Inspect and download both input attachments and generated output files
- JSON-aware rendering: Response content that is valid JSON is pretty-printed in the overview tab
For multi-turn tests, the Conversation tab supports trace-driven debugging:
- Click assistant responses to open the trace drawer
- Navigate turn-by-turn while keeping test result context visible
Multi-turn Result Details (v0.6.7+)
For multi-turn tests, the run detail includes conversation-specific execution data:
- Turns used: How many turns were consumed during execution
- Configured max turns: The max turn budget from test configuration
- Goal achieved flag: Whether the multi-turn goal was met
- Conversation-linked trace access: Open the trace drawer to inspect tree, sequence, graph, and conversation views for the related trace
Review Test Runs
Reviews allow human evaluators to validate or override automated test evaluations.
- Select a Test Run from the Test Runs overview page
- Click on the Reviews tab, then Add Your Review
- Select a Review Status:
- Pass: The test passes the metric
- Fail: The test fails the metric
- Add a comment explaining your review decision
Review targets and override behavior (v0.6.9+)
Reviews can now target different parts of a test result, not only the overall result.
| Target type | What it reviews | Reference format |
|---|---|---|
test_result | Overall pass/fail decision for the test result | null |
metric | A specific metric result in test_metrics | Metric name |
turn | A specific turn in multi-turn conversation output | Turn label such as Turn 3 |
When a review targets a metric or turn, Rhesis stores an override marker on the affected object and recomputes effective status in the test run views.
In review comments, type @ to mention metrics or turns. The first metric or turn mention determines the
review target automatically.
Compare Test Runs
Compare test runs to identify regressions and improvements between executions.
How to Compare:
- Click the Compare button (top right, next to Download)
- Select a baseline test run to compare against
- View the test-by-test comparison of the test set
Comparison Filters:
Use filters to focus on specific changes:
- All Tests: Show all tests from both runs
- Improved: Tests that now pass but failed in baseline
- Regressed: Tests that now fail but passed in baseline
- Unchanged: Tests with the same pass/fail status
Test Run Metrics
When a test run executes, Rhesis evaluates each test against a set of metrics. The metrics used follow a priority hierarchy that allows flexibility at different levels.
Metrics Resolution Hierarchy
Rhesis resolves which metrics to use based on the following priority order:
| Priority | Source | Description |
|---|---|---|
| 1 (Highest) | Execution-time metrics | Metrics specified when triggering the test run. These completely override all other metric configurations. |
| 2 | Test set metrics | Metrics configured on the test set entity. Override behavior-level defaults. |
| 3 (Lowest) | Behavior metrics | Default metrics defined at the behavior level for each test. |
Execution-time metrics completely override other levels. There is no merging of metrics between levels.
Specifying Execution-time Metrics
When executing a test set from the UI:
- Open the Execute Test Set drawer from a test set detail page
- In the Test Run Metrics section, select “Define metrics for this execution”
- Click Add Metric to select which metrics to use
- Only metrics applicable to your test set type (Single-Turn or Multi-Turn) will be shown
Execution-time metrics only apply to that specific test run. They are not saved to the test set configuration.
When to Use Each Level
- Behavior metrics: Set default evaluation criteria for all tests with a specific behavior. This is the most common configuration.
- Test set metrics: Override defaults for specialized test sets, such as Garak security tests that use detector metrics.
- Execution-time metrics: Quick validation with specific metrics without modifying test set configuration. Useful for one-off experiments or A/B testing different evaluation criteria.
Next Steps - Review failed tests to understand issues - Compare against baseline runs to detect regressions - Add human reviews to test results - Export results for reporting or analysis