Skip to Content
SDKTest Execution

Test Execution

The SDK provides methods on TestSet for executing tests against endpoints, re-scoring existing outputs with different metrics, and managing which metrics are assigned to a test set.

For background on execution concepts (metrics hierarchy, execution modes, output reuse) see the Platform Test Execution guide.

Executing Test Sets

Use execute() to run every test in a set against an endpoint. The endpoint is called for each test and the responses are scored against the configured metrics.

execute_basic.py
from rhesis.sdk.entities import TestSets, Endpoints

test_set = TestSets.pull(name="Safety Evaluation")
endpoint = Endpoints.pull(name="Production Chatbot")

# Execute with default settings (parallel mode)
result = test_set.execute(endpoint)
print(f"Execution submitted: {result}")

Execution Mode

Control whether tests are sent to the endpoint in parallel or one at a time. Use the ExecutionMode enum or strings "parallel" / "sequential".

execute_mode.py
from rhesis.sdk import ExecutionMode

# Parallel (default) — fast, best for stateless endpoints
result = test_set.execute(endpoint)

# Sequential — one test at a time, useful for rate-limited APIs
result = test_set.execute(endpoint, mode=ExecutionMode.SEQUENTIAL)
result = test_set.execute(endpoint, mode="sequential")  # string also valid
  • ExecutionMode.PARALLEL (default): Tests are dispatched concurrently for maximum throughput.
  • ExecutionMode.SEQUENTIAL: Tests run one after another. Use this for endpoints with strict rate limits or when order matters.

Invalid mode values raise ValueError.

Experiment Parameters

Pass an Experiment object to execute with its parameter values. You can also provide inline parameters that are committed as a new version before executing:

execute_experiment.py
from rhesis.sdk.entities import Experiment, TestSets, Endpoints

test_set = TestSets.pull(name="Safety Evaluation")
endpoint = Endpoints.pull(name="Production Chatbot")
exp = Experiment.publish(
    name="tuning-v3", project_id=pid,
    values={"model": "gpt-4o", "temperature": 0.7},
)

# Execute with the experiment's latest version
result = test_set.execute(endpoint, experiment=exp)

# Inline parameters — auto-commits, then executes
result = test_set.execute(
    endpoint, experiment=exp, parameters={"temperature": 0.9}
)

# Or use raw experiment_id (resolves to latest version automatically)
result = test_set.execute(endpoint, experiment_id="<uuid>")

You can also call exp.run(test_set, endpoint) for the same result from the experiment side.

Custom Metrics

Pass a metrics list to override the test set and behavior-level metrics for a single execution. Each item can be a dict (with at least an "id" key) or a metric name string (resolved automatically via the /metrics API).

execute_metrics.py
from rhesis.sdk import ExecutionMode

# Override metrics for this run only
result = test_set.execute(endpoint, metrics=[
    {"id": "abc-123", "name": "Accuracy"},
    "Toxicity",  # resolved by name
])

# Combine with sequential mode
result = test_set.execute(
    endpoint,
    mode=ExecutionMode.SEQUENTIAL,
    metrics=["Accuracy", "Toxicity"],
)

Execution-time metrics take the highest priority. When provided, they replace both the test set-level and behavior-level metrics for that run.

Re-scoring Existing Outputs

rescore() re-evaluates metrics on outputs from a previous test run without calling the endpoint again. This is useful when you want to apply new or different metrics to an existing set of responses.

rescore_basic.py
# Re-score the latest completed run (one-liner)
result = test_set.rescore(endpoint)

# Re-score with different metrics
result = test_set.rescore(endpoint, metrics=["Accuracy", "Toxicity"])

Specifying Which Run to Re-score

By default, rescore() uses the latest completed run for the test set / endpoint combination. You can also pass a specific run:

rescore_run.py
from rhesis.sdk.entities import TestRuns

# By name
result = test_set.rescore(endpoint, run="Safety - Run 42")

# By UUID string
result = test_set.rescore(endpoint, run="a1b2c3d4-e5f6-7890-abcd-ef1234567890")

# By TestRun object
run = TestRuns.pull(name="Safety - Run 42")
result = test_set.rescore(endpoint, run=run)

If no completed run exists for the combination, rescore() raises a ValueError.

Last Completed Run

last_run() returns a summary of the most recent completed test run for a given test set and endpoint. Returns None if no completed run exists.

last_run.py
last = test_set.last_run(endpoint)
if last:
    print(f"Run: {last['name']}")
    print(f"Pass rate: {last['pass_rate']}")
    print(f"Tests: {last['test_count']}")
else:
    print("No completed runs yet")

Combine with rescore() for an inspect-then-rescore workflow:

inspect_rescore.py
last = test_set.last_run(endpoint)
if last and last["pass_rate"] < 0.9:
    # Re-score with stricter metrics
    result = test_set.rescore(endpoint, run=last["id"], metrics=["Strictness"])
    print(f"Re-score submitted: {result}")

Managing Test Set Metrics

Manage which metrics are assigned to a test set. These metrics are used by default when a test set is executed without explicit per-execution metrics.

Get Current Metrics

get_metrics.py
metrics = test_set.get_metrics()
for m in metrics:
    print(f"  {m['name']} (id: {m['id']})")

Add Metrics

Add a single metric or a list. Each item can be a dict with an "id" key, a UUID string, or a metric name string.

add_metrics.py
# Single metric by name
test_set.add_metric("Accuracy")

# Single metric by UUID
test_set.add_metric("a1b2c3d4-e5f6-7890-abcd-ef1234567890")

# Single metric by dict
test_set.add_metric({"id": "abc-123"})

# Multiple metrics at once
test_set.add_metrics(["Accuracy", "Toxicity", "Relevance"])

Remove Metrics

remove_metrics.py
# Single metric
test_set.remove_metric("Accuracy")

# Multiple metrics
test_set.remove_metrics(["Toxicity", "Relevance"])

Complete Workflow

complete_execution.py
from rhesis.sdk.entities import TestSets, Endpoints

# Setup
test_set = TestSets.pull(name="Safety Evaluation")
endpoint = Endpoints.pull(name="Production Chatbot")

# 1. Assign metrics to the test set
test_set.add_metrics(["Accuracy", "Toxicity", "Relevance"])

# 2. Execute fresh
result = test_set.execute(endpoint)
print(f"Execution submitted: {result}")

# 3. Check the last run
last = test_set.last_run(endpoint)
if last:
    print(f"Last run: {last['name']} — pass rate: {last['pass_rate']}")

    # 4. Re-score with a new metric (without calling the endpoint again)
    result = test_set.rescore(endpoint, metrics=["Strictness"])
    print(f"Re-score submitted: {result}")

# 5. Clean up metrics
test_set.remove_metrics(["Accuracy", "Toxicity", "Relevance"])