Goal Achievement
A metric used by Penelope to evaluate whether a multi-turn conversation accomplished its stated goal, producing a conversation-level pass/fail outcome.
Overview
Goal Achievement is the primary evaluation signal for multi-turn tests. Unlike single-turn metrics that evaluate one response in isolation, Goal Achievement evaluates the entire conversation as a whole: did the agent (your application) ultimately help the user accomplish what they set out to do?
How It Works
Goal Achievement uses an LLM-as-a-judge approach. After Penelope completes the multi-turn conversation, the evaluator receives:
- The stated goal from the test definition
- The full conversation transcript
- Any context or criteria specified in the test
The judge then assesses whether the goal was achieved and returns a pass/fail determination with an explanation.
Role in the Three-Tier Metrics Model
Goal Achievement is one metric among many in the three-tier model (behavior > test set > execution). While it is the default and most important metric for multi-turn tests, all other metrics defined at the behavior, test set, or execution level also contribute to the overall pass/fail determination.
Defining a Clear Goal
The quality of Goal Achievement evaluation depends directly on how well the goal is specified in the test. A good goal is:
- Specific: Describes exactly what the user needs to accomplish
- Verifiable: An evaluator can determine from the transcript whether it was achieved
- Outcome-focused: States the desired end state, not the process
Example of a clear goal: "The user needs to get a full refund for order #12345 without being asked to call customer service."
Example of an unclear goal: "The chatbot should be helpful."
Relationship to Penelope
Goal Achievement is evaluated by Penelope at the end of each multi-turn conversation. Penelope adapts its conversational strategy throughout the test based on the goal, then determines whether its interaction with your application achieved what was intended.
Best Practices
- Write goals that describe the desired outcome, not the expected process (e.g. "user gets refund" not "agent follows refund steps")
- Pilot new goal definitions in the Playground before adding them to a test set to validate they are evaluable
- Combine Goal Achievement with behavior-level metrics for comprehensive multi-turn evaluation coverage
- Keep goal statements concise and verifiable—ambiguous goals produce unreliable pass/fail determinations