Test Runs & Results

Test runs track the execution of tests against endpoints. Each run produces test results containing the endpoint’s response, evaluation metrics, and review status.

TestRun

A TestRun represents a batch execution of tests. When you execute a test set against an endpoint, a test run is created to track the execution.

Properties

Property	Type	Description
`id`	`str`	Unique identifier
`name`	`str`	Display name
`test_configuration_id`	`str`	Associated test configuration
`status_id`	`str`	Current execution status
`user_id`	`str`	User who initiated the run
`organization_id`	`str`	Organization ID
`owner_id`	`str`	Owner of the test run
`assignee_id`	`str`	Assigned reviewer
`attributes`	`dict`	Custom attributes

Fetching Test Runs

fetch_test_runs.py
from rhesis.sdk.entities import TestRuns

# List all test runs
for run in TestRuns.all():
    print(f"{run.name}: {run.status_id}")

# Get by ID
run = TestRuns.pull(id="run-123")

# Filter by status
completed_runs = TestRuns.all(filter="status_id eq 'completed'")

Getting Test Results

Retrieve all results for a test run:

get_results.py
from rhesis.sdk.entities import TestRuns

run = TestRuns.pull(id="run-123")
results = run.get_test_results()

for result in results:
    print(f"Test: {result['test_id']}")
    print(f"Output: {result['test_output']}")
    print(f"Metrics: {result['test_metrics']}")

TestResult

A TestResult contains the output and evaluation for a single test execution.

Properties

Property	Type	Description
`id`	`str`	Unique identifier
`test_run_id`	`str`	Parent test run
`test_id`	`str`	Executed test
`prompt_id`	`str`	Test prompt
`status_id`	`str`	Result status
`status`	`Status`	Status object with name/description
`test_output`	`dict`	Endpoint response data
`test_metrics`	`dict`	Evaluation metric scores
`test_reviews`	`dict`	Human review data
`test_configuration_id`	`str`	Test configuration used

Fetching Test Results

fetch_results.py
from rhesis.sdk.entities import TestResults

# Get all results
all_results = TestResults.all()

# Get by ID
result = TestResults.pull(id="result-123")

# Filter by test run
run_results = TestResults.all(filter="test_run_id eq 'run-123'")

# Filter by status
failed_results = TestResults.all(filter="status_id eq 'failed'")

Working with Results

analyze_results.py
from rhesis.sdk.entities import TestResults

result = TestResults.pull(id="result-123")

# Access output
if result.test_output:
    print(f"Response: {result.test_output.get('output')}")
    print(f"Session: {result.test_output.get('session_id')}")

# Access metrics
if result.test_metrics:
    for metric_name, score in result.test_metrics.items():
        print(f"{metric_name}: {score}")

# Check status
if result.status:
    print(f"Status: {result.status.name}")

TestConfiguration

A TestConfiguration defines the settings for test execution, linking test sets to endpoints with specific parameters.

Properties

Property	Type	Description
`id`	`str`	Unique identifier
`endpoint_id`	`str`	Target endpoint (required)
`test_set_id`	`str`	Test set to execute
`category_id`	`str`	Filter by category
`topic_id`	`str`	Filter by topic
`prompt_id`	`str`	Specific prompt
`use_case_id`	`str`	Associated use case
`status_id`	`str`	Configuration status
`attributes`	`dict`	Custom settings

Creating Test Configurations

create_config.py
from rhesis.sdk.entities import TestConfiguration

config = TestConfiguration(
    endpoint_id="endpoint-123",
    test_set_id="test-set-456",
    attributes={
        "timeout": 30,
        "retry_count": 3,
    }
)

config.push()
print(f"Created configuration: {config.id}")

Getting Test Runs for a Configuration

config_runs.py
from rhesis.sdk.entities import TestConfigurations

config = TestConfigurations.pull(id="config-123")
runs = config.get_test_runs()

for run in runs:
    print(f"Run: {run['id']} - Status: {run['status_id']}")

Complete Workflow Example

complete_workflow.py
from rhesis.sdk.entities import TestSets, Endpoints, TestRuns, TestResults

# 1. Get test set and endpoint
test_set = TestSets.pull(name="Safety Evaluation")
endpoint = Endpoints.pull(name="Production Chatbot")

# 2. Execute tests (creates a test run)
execution = test_set.execute(endpoint)
print(f"Execution started: {execution}")

# 3. Inspect the last completed run
last = test_set.last_run(endpoint)
if last:
    print(f"Last run: {last['name']} — pass rate: {last['pass_rate']}")

    # 4. Re-score with different metrics (reuses existing outputs)
    rescore_result = test_set.rescore(endpoint, metrics=["Strictness"])
    print(f"Re-score submitted: {rescore_result}")

# 5. Fetch a specific test run and its results
runs = TestRuns.all(filter="contains(name, 'Safety')")
latest_run = runs[0] if runs else None

if latest_run:
    results = latest_run.get_test_results()
    passed = sum(1 for r in results if r.get('status_id') == 'passed')
    failed = sum(1 for r in results if r.get('status_id') == 'failed')
    print(f"Passed: {passed}, Failed: {failed}")

For the full execution API — including execution modes, metric overrides, re-scoring, and test set metric management — see Test Execution.

Next Steps - Create Test Sets for evaluation

Configure Endpoints to test against
Use Metrics to evaluate test results