Skip to Content
GlossaryGoal Achievement - Glossary

Goal Achievement

Back to GlossaryTesting Fundamentals

A metric used by Penelope to evaluate whether a multi-turn conversation accomplished its stated goal, producing a conversation-level pass/fail outcome.

Also known as: goal achievement metric, goal achievement judge

Overview

Goal Achievement is the primary evaluation signal for multi-turn tests. Unlike single-turn metrics that evaluate one response in isolation, Goal Achievement evaluates the entire conversation as a whole: did the agent (your application) ultimately help the user accomplish what they set out to do?

How It Works

Goal Achievement uses an LLM-as-a-judge approach. After Penelope completes the multi-turn conversation, the evaluator receives:

  • The stated goal from the test definition
  • The full conversation transcript
  • Any context or criteria specified in the test

The judge then assesses whether the goal was achieved and returns a pass/fail determination with an explanation.

Role in the Three-Tier Metrics Model

Goal Achievement is one metric among many in the three-tier model (behavior > test set > execution). While it is the default and most important metric for multi-turn tests, all other metrics defined at the behavior, test set, or execution level also contribute to the overall pass/fail determination.

Defining a Clear Goal

The quality of Goal Achievement evaluation depends directly on how well the goal is specified in the test. A good goal is:

  • Specific: Describes exactly what the user needs to accomplish
  • Verifiable: An evaluator can determine from the transcript whether it was achieved
  • Outcome-focused: States the desired end state, not the process

Example of a clear goal: "The user needs to get a full refund for order #12345 without being asked to call customer service."

Example of an unclear goal: "The chatbot should be helpful."

Relationship to Penelope

Goal Achievement is evaluated by Penelope at the end of each multi-turn conversation. Penelope adapts its conversational strategy throughout the test based on the goal, then determines whether its interaction with your application achieved what was intended.

Best Practices

  • Write goals that describe the desired outcome, not the expected process (e.g. "user gets refund" not "agent follows refund steps")
  • Pilot new goal definitions in the Playground before adding them to a test set to validate they are evaluable
  • Combine Goal Achievement with behavior-level metrics for comprehensive multi-turn evaluation coverage
  • Keep goal statements concise and verifiable—ambiguous goals produce unreliable pass/fail determinations

Related Terms