Goal Achievement

Back to Glossary Testing Fundamentals

A metric used by Penelope to evaluate whether a multi-turn conversation accomplished its stated goal, producing a conversation-level pass/fail outcome.

Also known as: goal achievement metric, goal achievement judge

Overview

Goal Achievement is the primary evaluation signal for multi-turn tests. Unlike single-turn metrics that evaluate one response in isolation, Goal Achievement evaluates the entire conversation as a whole: did the agent (your application) ultimately help the user accomplish what they set out to do?

How It Works

Goal Achievement uses an LLM-as-a-judge approach. After Penelope completes the multi-turn conversation, the evaluator receives:

The stated goal from the test definition
The full conversation transcript
Any context or criteria specified in the test

The judge then assesses whether the goal was achieved and returns a pass/fail determination with an explanation.

Role in the Three-Tier Metrics Model

Goal Achievement is one metric among many in the three-tier model (behavior > test set > execution). While it is the default and most important metric for multi-turn tests, all other metrics defined at the behavior, test set, or execution level also contribute to the overall pass/fail determination.

Defining a Clear Goal

The quality of Goal Achievement evaluation depends directly on how well the goal is specified in the test. A good goal is:

Specific: Describes exactly what the user needs to accomplish
Verifiable: An evaluator can determine from the transcript whether it was achieved
Outcome-focused: States the desired end state, not the process

Example of a clear goal: "The user needs to get a full refund for order #12345 without being asked to call customer service."

Example of an unclear goal: "The chatbot should be helpful."

Relationship to Penelope

Goal Achievement is evaluated by Penelope at the end of each multi-turn conversation. Penelope adapts its conversational strategy throughout the test based on the goal, then determines whether its interaction with your application achieved what was intended.

Best Practices

Write goals that describe the desired outcome, not the expected process (e.g. "user gets refund" not "agent follows refund steps")
Pilot new goal definitions in the Playground before adding them to a test set to validate they are evaluable
Combine Goal Achievement with behavior-level metrics for comprehensive multi-turn evaluation coverage
Keep goal statements concise and verifiable—ambiguous goals produce unreliable pass/fail determinations