Scenarios

Common requests and what to expect from each one.

In this page:

Full workflow: design and run a test suite
Direct operations
Testing with a knowledge source
Attaching a spec or requirements document
Comparing two test runs
Analyzing a past run
Auto-approve session

In the examples below, U is the user and is Architect.

Full workflow: design and run a test suite

The most common entry point. Describe your endpoint and what you care about.

Example

Test my travel booking chatbot. Focus on accuracy and safety.

I’ll explore the endpoint first. Would you prefer a Quick scan (domain only) or Comprehensive (full capability and boundary mapping)?

Quick is fine.

Done. The chatbot handles flight search, booking, and cancellations. It refuses travel advice outside its scope. I noticed some uncertainty around partial cancellations — should I include refund edge cases in the test suite?

Yes, include those.

Here's my plan:
• 3 behaviors: Handles Flight Search, Refuses Off-Topic, Handles Cancellations
• 2 test sets (30 tests total)
• 2 metrics: Accuracy, Refusal Correctness

Shall I proceed?

Go ahead.

→ See Workflow for what happens after approval.

Direct operations

You can skip the full workflow and ask Architect to do specific things with existing entities.

List test sets:

Show me my test sets.

You have 4 test sets:
• Flight Search Core — 32 tests
• Refusal Suite — 15 tests
• Edge Cases — 8 tests
• Cancellation Flow — 12 tests

Update a metric:

Make the Accuracy metric stricter — raise the threshold to 0.85.

Updated Accuracy — threshold is now 0.85.

Link a behavior to a metric:

Link Handles Cancellations to the Refusal Correctness metric.

Done — Handles Cancellations is now evaluated by Refusal Correctness.

Describe a behavior:

What does the Refuses Off-Topic behavior cover?

Refuses Off-Topic checks that the chatbot declines requests outside travel booking — such as weather, restaurant advice, or general knowledge questions.

Testing with a knowledge source

Ground test generation in your own documentation. Reference the source by name.

Use our product FAQ to generate tests for the support chatbot.

Found “Product FAQ” in your knowledge library. I’ll use it to generate grounded single-turn tests. Shall I plan the full suite?

Knowledge sources only apply to single-turn test generation. Multi-turn tests are not grounded.

Attaching a spec or requirements document

Upload a document and let Architect read it before planning. Works well for OpenAPI specs, requirements docs, or policy PDFs.

[attaches openapi-spec.yaml]
Build a test suite based on this API spec.

I've read the spec. It defines 8 endpoints across 3 domains.
Here's what I'd focus on for testing...

→ Chat Features — File attachments lists supported formats and size limits.

Comparing two test runs

Compare the last two runs.

Run 14 vs Run 13 — overall: 78% → 82% (improved)

• Handles Flight Search: 90% → 90% (unchanged)
• Refuses Off-Topic: 60% → 75% (improved)
• Handles Cancellations: 83% → 80% (slight regression)

The Refusal Correctness metric drove most of the Off-Topic improvement. Cancellation handling regressed on Accuracy — two tests that were passing now fail.

Analyzing a past run

Analyze the last test run.

Run 14 — 78% overall.

By behavior:
• Handles Flight Search: 90%
• Refuses Off-Topic: 60% ← below threshold
• Handles Cancellations: 83%

Off-Topic refusals are failing primarily on Refusal Correctness. The evaluator notes that the chatbot gives partial answers rather than clean refusals on 6 of the 10 failing tests. Tightening the system prompt for out-of-scope topics should address this.

Auto-approve session

Turn on auto-approve when you want to move fast through a well-understood workflow.

Toggle Auto-approve in the chat header.
Ask Architect to build and run a test suite.
Architect creates, then executes without pausing for confirmation at each step.

Auto-approve skips all confirmations for the session — including creation and execution. See Chat Features — Auto-approve.