Skip to Content
DocsArchitectScenarios

Scenarios

Common requests and what to expect from each one.

In this page:

In the examples below, U is the user and is Architect.

Full workflow: design and run a test suite

The most common entry point. Describe your endpoint and what you care about.

Example
Test my travel booking chatbot. Focus on accuracy and safety.
U
I’ll explore the endpoint first. Would you prefer a Quick scan (domain only) or Comprehensive (full capability and boundary mapping)?
Quick is fine.
U
Done. The chatbot handles flight search, booking, and cancellations. It refuses travel advice outside its scope. I noticed some uncertainty around partial cancellations — should I include refund edge cases in the test suite?
Yes, include those.
U
Here's my plan: • 3 behaviors: Handles Flight Search, Refuses Off-Topic, Handles Cancellations • 2 test sets (30 tests total) • 2 metrics: Accuracy, Refusal Correctness Shall I proceed?
Go ahead.
U

→ See Workflow for what happens after approval.

Direct operations

You can skip the full workflow and ask Architect to do specific things with existing entities.

List test sets:

Show me my test sets.
U
You have 4 test sets: • Flight Search Core — 32 tests • Refusal Suite — 15 tests • Edge Cases — 8 tests • Cancellation Flow — 12 tests

Update a metric:

Make the Accuracy metric stricter — raise the threshold to 0.85.
U
Updated Accuracy — threshold is now 0.85.

Link a behavior to a metric:

Link Handles Cancellations to the Refusal Correctness metric.
U
Done — Handles Cancellations is now evaluated by Refusal Correctness.

Describe a behavior:

What does the Refuses Off-Topic behavior cover?
U
Refuses Off-Topic checks that the chatbot declines requests outside travel booking — such as weather, restaurant advice, or general knowledge questions.

Testing with a knowledge source

Ground test generation in your own documentation. Reference the source by name.

Use our product FAQ to generate tests for the support chatbot.
U
Found “Product FAQ” in your knowledge library. I’ll use it to generate grounded single-turn tests. Shall I plan the full suite?

Knowledge sources only apply to single-turn test generation. Multi-turn tests are not grounded.

Attaching a spec or requirements document

Upload a document and let Architect read it before planning. Works well for OpenAPI specs, requirements docs, or policy PDFs.

[attaches openapi-spec.yaml] Build a test suite based on this API spec.
U
I've read the spec. It defines 8 endpoints across 3 domains. Here's what I'd focus on for testing...

Chat Features — File attachments lists supported formats and size limits.

Comparing two test runs

Compare the last two runs.
U
Run 14 vs Run 13 — overall: 78% → 82% (improved) • Handles Flight Search: 90% → 90% (unchanged) • Refuses Off-Topic: 60% → 75% (improved) • Handles Cancellations: 83% → 80% (slight regression) The Refusal Correctness metric drove most of the Off-Topic improvement. Cancellation handling regressed on Accuracy — two tests that were passing now fail.

Analyzing a past run

Analyze the last test run.
U
Run 14 — 78% overall. By behavior: • Handles Flight Search: 90% • Refuses Off-Topic: 60% ← below threshold • Handles Cancellations: 83% Off-Topic refusals are failing primarily on Refusal Correctness. The evaluator notes that the chatbot gives partial answers rather than clean refusals on 6 of the 10 failing tests. Tightening the system prompt for out-of-scope topics should address this.

Auto-approve session

Turn on auto-approve when you want to move fast through a well-understood workflow.

  1. Toggle Auto-approve in the chat header.
  2. Ask Architect to build and run a test suite.
  3. Architect creates, then executes without pausing for confirmation at each step.

Auto-approve skips all confirmations for the session — including creation and execution. See Chat Features — Auto-approve.