Skip to Content
DevelopmentBackendTest Reviews

Human Reviews for Test Results

Overview

Human reviews complement automated metrics by allowing human evaluators to review, adjust, or override automated test outcomes. Each test result can include one or more reviews, representing human judgments with structured metadata.

This schema separates automated metrics (test_metrics) from human-provided evaluations (test_reviews), enabling clearer data management, traceability, and aggregation.

Status: ✅ Fully Implemented and Tested


Design Goals

  • Separation of concerns: Keep human feedback separate from machine metrics.
  • Support multiple reviewers and rounds: Allow several humans to evaluate the same test.
  • Granular scope: Reviews can target specific metrics or the overall test.
  • Traceability: Include timestamps, reviewer identity, and status references.
  • Efficient access: Include top-level metadata for quick lookups and summaries.
  • Full CRUD operations: Create, read, update, and delete reviews via REST API.
  • Automatic metadata management: Metadata updates automatically on all operations.

Top-Level Structure

The test_reviews field is stored as a JSON object with two parts:

  • metadata: Summary information about the reviews collection.
  • reviews: List of individual review objects.

JSON Schema

{ "metadata": { "last_updated_at": "2025-10-10T14:15:00Z", "last_updated_by": { "user_id": "a1e2d3c4-5678-90ab-cdef-1234567890ab", "name": "Alice" }, "total_reviews": 3, "latest_status": { "status_id": "b6f1a2e3-9c4d-4f12-8b3f-123456789abc", "name": "success" }, "summary": "The latest review marks the test as successful after human validation." }, "reviews": [ { "review_id": "e9b5a2c7-13f2-4a91-94d0-4df8c1c5f0a1", "status": { "status_id": "b6f1a2e3-9c4d-4f12-8b3f-123456789abc", "name": "success" }, "user": { "user_id": "a1e2d3c4-5678-90ab-cdef-1234567890ab", "name": "Alice" }, "comments": "LLM refusal was appropriate for this case. Marking as successful.", "created_at": "2025-10-10T14:10:00Z", "updated_at": "2025-10-10T14:15:00Z", "target": { "type": "metric", "reference": "Refusal Detection" } } ] }

Field Reference

metadata

FieldTypeDescription
last_updated_atstring (ISO 8601)Timestamp when any review was last modified.
last_updated_byobjectContains the user ID and name of the last editor.
total_reviewsintegerTotal number of reviews in the list.
latest_statusobjectStatus of the most recent review (for quick summaries).
summarystringOptional short description or aggregated summary.

reviews[]

FieldTypeDescription
review_idUUIDUnique ID for this review entry.
statusobjectContains a status_id (UUID) and name (string).
userobjectContains a user_id (UUID) and name (string).
commentsstringFree-text explanation from the reviewer.
created_atstring (ISO 8601)Timestamp when the review was first created.
updated_atstring (ISO 8601)Timestamp when the review was last modified.
targetobjectDefines whether the review applies to a metric or the whole test.

Example target values

{ "type": "test", "reference": null } { "type": "metric", "reference": "Refusal Detection" }

Implementation Details

Database

Column: test_reviews (JSONB, nullable) Table: test_result

The column stores the complete review structure as JSON, providing flexibility for complex review scenarios without additional tables.

Backend Models

File: apps/backend/src/rhesis/backend/app/models/test_result.py

class TestResult(Base, ...): test_reviews = Column(JSONB) @property def last_review(self) -> Optional[Dict[str, Any]]: """Returns the most recent review based on updated_at timestamp""" # Returns latest review or None @property def matches_review(self) -> bool: """Check if test result status_id matches latest review status_id""" # Returns True/False

Derived Properties:

  • last_review: Returns the most recent review by updated_at timestamp
  • matches_review: Boolean indicating if the test result’s status_id matches the latest review’s status_id

Pydantic Schemas

File: apps/backend/src/rhesis/backend/app/schemas/test_result.py

# Review schemas class ReviewTargetCreate(Base): type: str # "test" or "metric" reference: Optional[str] = None # metric name or null class ReviewCreate(Base): status_id: UUID4 comments: str target: ReviewTargetCreate class ReviewUpdate(Base): status_id: Optional[UUID4] = None comments: Optional[str] = None target: Optional[ReviewTargetCreate] = None class ReviewResponse(Base): review_id: str status: Dict[str, Any] user: Dict[str, Any] comments: str created_at: str updated_at: str target: Dict[str, Any]

API Endpoints

All endpoints follow REST conventions and automatically manage metadata.

1. Create Review

Endpoint: POST /test_results/{test_result_id}/reviews

Request Body:

{ "status_id": "735acfa0-cca2-48a1-bb90-ba10b16f1cdb", "comments": "Test review - looks good after manual inspection", "target": { "type": "test", "reference": null } }

Response (201 Created):

{ "review_id": "c370e2d7-f526-41c4-94f0-a6f25735a9b9", "status": { "status_id": "735acfa0-cca2-48a1-bb90-ba10b16f1cdb", "name": "Pass" }, "user": { "user_id": "d7834188-a9aa-410c-a63d-89d6f487aed8", "name": "Harry Cruz" }, "comments": "Test review - looks good after manual inspection", "created_at": "2025-10-10T13:18:16.310822", "updated_at": "2025-10-10T13:18:16.310822", "target": { "type": "test", "reference": null } }

Features:

  • Auto-generates unique review_id
  • Sets both created_at and updated_at to current time
  • Auto-populates user info from authenticated user
  • Fetches and embeds status details from Status model
  • Updates test_reviews metadata automatically

2. Update Review

Endpoint: PUT /test_results/{test_result_id}/reviews/{review_id}

Request Body (all fields optional):

{ "comments": "UPDATED: After further review, this test passes all criteria" }

Response (200 OK):

{ "review_id": "c370e2d7-f526-41c4-94f0-a6f25735a9b9", "status": { "status_id": "735acfa0-cca2-48a1-bb90-ba10b16f1cdb", "name": "Pass" }, "user": { "user_id": "d7834188-a9aa-410c-a63d-89d6f487aed8", "name": "Harry Cruz" }, "comments": "UPDATED: After further review, this test passes all criteria", "created_at": "2025-10-10T13:18:16.310822", "updated_at": "2025-10-10T13:18:41.168149", "target": { "type": "test", "reference": null } }

Features:

  • Preserves created_at timestamp
  • Updates updated_at to current time
  • Updates only provided fields
  • Updates metadata automatically

3. Delete Review

Endpoint: DELETE /test_results/{test_result_id}/reviews/{review_id}

Response (200 OK):

{ "message": "Review deleted successfully", "review_id": "c370e2d7-f526-41c4-94f0-a6f25735a9b9", "deleted_review": { "review_id": "c370e2d7-f526-41c4-94f0-a6f25735a9b9", "status": {...}, "user": {...}, "comments": "UPDATED: After further review, this test passes all criteria", "created_at": "2025-10-10T13:18:16.310822", "updated_at": "2025-10-10T13:18:41.168149", "target": {...} } }

Features:

  • Removes review from reviews array
  • Updates metadata automatically
  • Handles empty state (when last review deleted, sets latest_status: null)
  • Returns deleted review data

4. Get Test Result with Reviews

Endpoint: GET /test_results/{test_result_id}

Response includes:

{ "id": "fe647ace-5364-4158-b634-9bc7c53d7905", "status_id": "735acfa0-cca2-48a1-bb90-ba10b16f1cdb", "test_reviews": { "metadata": { "last_updated_at": "2025-10-10T13:18:16.310860", "last_updated_by": { "user_id": "d7834188-a9aa-410c-a63d-89d6f487aed8", "name": "Harry Cruz" }, "total_reviews": 1, "latest_status": { "status_id": "735acfa0-cca2-48a1-bb90-ba10b16f1cdb", "name": "Pass" }, "summary": "Last updated by Harry Cruz" }, "reviews": [ { "review_id": "c370e2d7-f526-41c4-94f0-a6f25735a9b9", "status": {...}, "user": {...}, "comments": "...", "created_at": "2025-10-10T13:18:16.310822", "updated_at": "2025-10-10T13:18:16.310822", "target": {...} } ] }, "last_review": { "review_id": "c370e2d7-f526-41c4-94f0-a6f25735a9b9", "status": {...}, "user": {...}, "comments": "...", "created_at": "2025-10-10T13:18:16.310822", "updated_at": "2025-10-10T13:18:16.310822", "target": {...} }, "matches_review": true }

Derived Properties:

  • last_review: Most recent review by updated_at timestamp
  • matches_review: Boolean indicating if test result status matches latest review status

Use Cases

1. Override Automated Metrics

A human reviewer disagrees with an automated pass/fail and adds a review with a different status.

2. Multi-Reviewer Workflow

Multiple team members review the same test result, each adding their perspective.

3. Metric-Specific Reviews

Review specific metrics (e.g., “Answer Relevancy”) separately from overall test.

4. Audit Trail

Track who reviewed what and when, with full edit history via timestamps.

5. Status Conflict Detection

Use matches_review to identify cases where human review disagrees with automated result.


Implementation Notes

Metadata Management

The metadata section is automatically updated on every review operation:

  • Create: Initializes metadata with first review’s information
  • Update: Updates last_updated_at, last_updated_by, and latest_status
  • Delete: Updates metadata or clears latest_status if no reviews remain

JSONB Column Updates

When modifying reviews, use SQLAlchemy’s flag_modified to ensure changes are detected:

from sqlalchemy.orm.attributes import flag_modified # After modifying test_reviews flag_modified(test_result, "test_reviews") db.commit()

Empty State Handling

When the last review is deleted:

  • reviews array becomes empty
  • total_reviews = 0
  • latest_status = null
  • summary = “All reviews removed”
  • last_review derived property returns None
  • matches_review returns False

Testing

All functionality has been tested and verified:

✅ Create review with auto-generated ID and timestamps ✅ Update review with preserved created_at ✅ Delete review with metadata updates ✅ Multiple reviews support ✅ Both “test” and “metric” target types ✅ Empty state handling ✅ Derived properties (last_review, matches_review) ✅ Automatic metadata synchronization


Future Extensions

Potential enhancements:

  • confidence: Reviewer confidence score (0–1)
  • attachments: File attachments for evidence
  • review_type: Categorize reviews (approval, rejection, follow-up)
  • tags: Categorize reviews with tags
  • Review workflows and approval chains