Test Result Status

This guide explains how individual test result statuses are determined and what they represent in the Rhesis backend.

Overview

A test result’s status reflects whether the test passed all its metric evaluations. A test is only considered “Pass” if ALL its metrics are successful.

Status Types

Pass

Definition: All metrics evaluated successfully

When assigned:

Every metric in test_metrics.metrics has is_successful: true
At least one metric exists

Fail

Definition: One or more metrics failed

When assigned:

At least one metric in test_metrics.metrics has is_successful: false
Metrics exist but not all passed

Status Determination Logic

Initial Status Assignment (During Execution)

The status is determined once by analyzing the test_metrics field when creating or updating a test result:


# Check if ALL metrics passed
all_metrics_passed = all(
    metric_data.get('is_successful', False)
    for metric_data in metrics.values()
    if isinstance(metric_data, dict)
)
 
status = "Pass" if all_metrics_passed else "Fail"

Key principle: A single failed metric causes the entire test to fail.

Status as Source of Truth

Important architectural decision: Once the status is determined and stored, it becomes the single source of truth for all subsequent operations:

Statistics calculations use the stored status field
Email notifications use the stored status field
UI displays use the stored status field
API responses use the stored status field

Do NOT recalculate status from metrics in these contexts. The metrics remain available for:

Per-metric statistics and breakdowns
Detailed analysis of which specific metrics failed
Debugging and audit purposes

This design ensures:

Consistency: All parts of the system agree on test status
Performance: Avoid re-parsing JSONB metrics repeatedly
Historical accuracy: Status reflects the evaluation logic at execution time
Simplicity: Single source of truth for pass/fail decisions

Test Metrics Structure

Test results contain a test_metrics field with the following structure:


{
  "execution_time": 1.23,
  "metrics": {
    "Answer Relevancy": {
      "is_successful": true,
      "score": 0.85,
      "threshold": 0.7,
      "reason": "Answer is relevant to the question"
    },
    "Contextual Recall": {
      "is_successful": false,
      "score": 0.65,
      "threshold": 0.7,
      "reason": "Failed to recall sufficient context"
    },
    "Answer Fluency": {
      "is_successful": true,
      "score": 0.92,
      "threshold": 0.7
    }
  }
}

In this example:

2 metrics passed (is_successful: true)
1 metric failed (is_successful: false)
Result: Test status = Fail

Examples

Example 1: All Metrics Pass


{
  "test_metrics": {
    "metrics": {
      "Answer Relevancy": { "is_successful": true },
      "Contextual Recall": { "is_successful": true },
      "Answer Fluency": { "is_successful": true }
    }
  }
}

Status: Pass
Reason: All 3 metrics passed

Example 2: One Metric Fails


{
  "test_metrics": {
    "metrics": {
      "Answer Relevancy": { "is_successful": true },
      "Contextual Recall": { "is_successful": false },
      "Answer Fluency": { "is_successful": true }
    }
  }
}

Status: Fail
Reason: 1 out of 3 metrics failed

Example 3: All Metrics Fail


{
  "test_metrics": {
    "metrics": {
      "Answer Relevancy": { "is_successful": false },
      "Contextual Recall": { "is_successful": false },
      "Answer Fluency": { "is_successful": false }
    }
  }
}

Status: Fail
Reason: All metrics failed

Example 4: Single Metric Test


{
  "test_metrics": {
    "metrics": {
      "Refusal Detection": { "is_successful": true }
    }
  }
}

Status: Pass
Reason: The only metric passed

Automatic Status Setting

The test result status is automatically set in three scenarios:

1. Automated Test Execution

When tests run automatically via the worker:


# apps/backend/src/rhesis/backend/tasks/execution/test_execution.py
def create_test_result_record(..., metrics_results, ...):
    # Determine status based on whether all metrics passed
    all_metrics_passed = all(
        metric_data.get('is_successful', False)
        for metric_data in metrics_results.values()
        if isinstance(metric_data, dict)
    )
 
    status_value = "Pass" if all_metrics_passed else "Fail"

2. API POST /test_results

When creating a test result via API:


# apps/backend/src/rhesis/backend/app/routers/test_result.py
@router.post("/")
def create_test_result(test_result: schemas.TestResultCreate, ...):
    # Auto-set status if not provided
    if not test_result.status_id and test_result.test_metrics:
        metrics = test_result.test_metrics.get('metrics', {})
        if metrics:
            all_metrics_passed = all(...)
            status = "Pass" if all_metrics_passed else "Fail"

3. API PUT /test_results/{id}

When updating test metrics via API:


# apps/backend/src/rhesis/backend/app/routers/test_result.py
@router.put("/{test_result_id}")
def update_test_result(test_result: schemas.TestResultUpdate, ...):
    # Auto-update status if metrics changed but status not provided
    if test_result.test_metrics and not test_result.status_id:
        metrics = test_result.test_metrics.get('metrics', {})
        all_metrics_passed = all(...)
        status = "Pass" if all_metrics_passed else "Fail"

Note: You can manually override the status by explicitly providing status_id in API calls.

Status in Statistics

All statistics and reporting use the stored status field as the source of truth:

Email Notifications

Counts tests by using stored status:


# apps/backend/src/rhesis/backend/tasks/execution/result_processor.py
from rhesis.backend.app.constants import (
    categorize_test_result_status,
    STATUS_CATEGORY_PASSED,
    STATUS_CATEGORY_FAILED,
)
 
for result in test_results:
    status_category = categorize_test_result_status(result.status.name)
    if status_category == STATUS_CATEGORY_PASSED:
        tests_passed += 1
    elif status_category == STATUS_CATEGORY_FAILED:
        tests_failed += 1

Test Result Stats API

Returns pass/fail counts based on stored status:


# apps/backend/src/rhesis/backend/app/services/stats/test_result.py
from rhesis.backend.app.constants import (
    categorize_test_result_status,
    STATUS_CATEGORY_PASSED,
)
 
# Uses stored test status as source of truth
status_category = categorize_test_result_status(result.status.name)
test_passed_overall = status_category == STATUS_CATEGORY_PASSED
 
GET /test_results/stats
{
  "overall_pass_rates": {
    "total": 100,
    "passed": 75,    // Tests with status "Pass"
    "failed": 25,    // Tests with status "Fail"
    "pass_rate": 75.0
  }
}

Note: Per-metric statistics still analyze the test_metrics field to show which specific metrics passed or failed.

Frontend Display

The frontend also analyzes test_metrics to determine pass/fail:


// apps/frontend/src/app/(protected)/test-runs/[identifier]/components/TestsTableView.tsx
const originalPassed = passedMetrics === totalMetrics;

Execution Errors

If a test has no metrics or empty metrics, it’s counted as an execution error (not Pass or Fail):


{
  "test_metrics": null // or missing entirely
}

Treatment:

In test run stats: Counted as execution_errors
In email: May show as “Execution Errors” if any exist
Run status impact: May cause test run to be “Partial” or “Failed”

Key Distinctions