Test Sets
What are Test Sets? Test Sets are groups of tests that can be executed together.
Test sets organize related tests into collections for batch execution. When you generate tests, all created tests are automatically grouped into a single test set. You can also manually assign tests to test sets or remove them as needed.
Test sets inherit shared types, behaviors, categories, topics and sources from their tests.
Test Sets Page
The Test Sets page is the central place to manage all your test sets. It displays a summary of test set activity in charts at the top, followed by a searchable, filterable grid of all test sets in your organization.
From this page you can:
- Create a new empty test set
- Import tests from a file or from Garak (see below)
- Execute one or more test sets against an endpoint
- Delete test sets you no longer need
Click any test set name to open its detail page, where you can inspect individual tests, view metrics, and manage tags.
Importing Test Sets
Rhesis provides two ways to bring existing tests into a test set without generating them from scratch.
From a File
Import tests from a CSV, Excel, JSON, or JSONL file. Rhesis analyses your file’s structure, suggests a column mapping to Rhesis test fields, and lets you review and adjust before committing. Both Single-Turn and Multi-Turn tests are supported.
From Garak
Garak is an open-source LLM vulnerability scanner. Rhesis integrates its full probe library directly into the platform — select the probes you want, and Rhesis creates one test set per probe, pre-populated with prompts. Garak detectors are automatically mapped as Rhesis metrics, so imported test sets are ready to evaluate immediately.
Probes come in two types: static (prompts bundled with the probe) and dynamic (prompts generated at runtime by an LLM). Both are supported.
Test Set Types
Every test set has a type that determines how its tests are executed:
| Type | Description |
|---|---|
| Single-Turn | Tests that evaluate individual prompt/response exchanges. Each test sends a single input and evaluates the response. Ideal for RAG systems, classification tasks, and standalone response quality. |
| Multi-Turn | Tests that evaluate conversational interactions across multiple turns. Each test defines a goal and the system conducts an automated multi-turn conversation to assess the endpoint’s behavior. Ideal for chatbots, agents, and dialogue systems. |
The test set type is set when the test set is created and determines which metrics can be applied during evaluation. When generating tests or importing from files, the type is inferred automatically from the tests: if any test is multi-turn, the test set is classified as Multi-Turn.
Executing Test Sets
Executing a test set runs all its tests against your AI application endpoint to see how your application responds. This creates a Test Run that captures all results.
To execute a test set, select it from the Test Sets page and configure:
Execution Target
- Project: The current project
- Endpoint: The endpoint within the project to execute tests against
Execution Mode
- Parallel (default): Tests run simultaneously for faster execution
- Sequential: Tests run one after another, better for rate-limited endpoints
Tags: Optional tags to categorize and find this test run
Next Steps
- Import from File to create a test set from CSV, Excel, JSON, or JSONL
- Import from Garak to import Garak vulnerability probes
- Generate tests to create test sets with AI
- View execution progress in Results Overview
- Track historical performance in Test Runs