Running and Analyzing

After you approve the plan, Architect runs the tests and presents results in a structured format. You can also ask it to analyze existing runs or compare two runs directly.

Running tests

When creation is complete, Architect offers to run the tests. Confirm, and it submits the test set to your endpoint. The mode chip switches to executing while it waits.

You don’t need to create a test configuration manually. Architect handles the setup and monitors the run automatically — there is no polling or waiting on your part.

When the run completes, Architect presents the results.

Results format

Results come in three layers, each more specific than the last:

Overall — a single pass/fail percentage for the full run. Sets the context before diving in.

Per-behavior — pass rate for each behavior you tested. Behaviors with a rate below threshold are flagged immediately.

Per-metric — within each behavior, how each metric scored. Lets you see whether a behavior failed on one specific criterion (e.g., tone) while passing others (e.g., accuracy).

Notable failures — the worst-performing tests, with the evaluator’s stated reason for marking them as failing. This tells you why something failed, not just that it did.

Interpreting failure patterns

Architect categorizes failure patterns and surfaces likely causes:

Pattern	What it means	What to do
All tests fail (0%)	Endpoint unreachable, response format changed, or auth issue	Check connectivity and endpoint response format
Most tests fail in one behavior	That specific behavior needs attention — the endpoint may be inconsistent in this area	Tighten the behavior description or review the endpoint’s handling
Single metric fails across behaviors	The evaluation criteria may be miscalibrated, or there’s a genuine weakness	Review the metric threshold; check if failures cluster on a topic
Tests fail on a narrow topic cluster	The endpoint has a knowledge gap	Consider targeted improvements or additional training data
Borderline pass rates (50–70%)	The endpoint is inconsistent — working sometimes but not reliably	Re-run with more tests to confirm; look for input patterns that trigger the inconsistency

Comparing two runs

Ask Architect to compare any two runs to see what changed:


"How does the latest run compare to the previous one?"
"Compare run A to run B."

Architect retrieves both run results and shows:

Overall trend — improved, regressed, or unchanged
Per-behavior changes — which behaviors got better or worse
Per-metric changes — which specific metrics drove the change
Notable regressions — tests that were passing and now fail
Notable improvements — tests that were failing and now pass

Analyzing an existing run

You don’t need to run a test yourself to get an analysis. Ask Architect about any past run:


"Analyze run 42."
"What do the results of the last test run look like?"

Architect fetches the results and presents the same three-layer summary — overall, behavior breakdown, metric breakdown — plus actionable suggestions based on what it finds.

Architect refers to runs by name or index. You don’t need to supply a run ID.