Skip to Content
GlossaryEvaluation Steps - Glossary

Evaluation Steps

Back to GlossaryTesting

A breakdown of the evaluation process into clear steps that guide the LLM judge when producing a score and reasoning.

Overview

Evaluation steps break down the judging process into clear, sequential steps that guide the LLM to produce consistent, thoughtful evaluations.

Why Use Steps?

Evaluation steps provide consistency by ensuring the same evaluation process occurs every time. They create transparency through a clear reasoning path, while encouraging thorough analysis that improves quality. When evaluations go wrong, the step-by-step structure makes it easy to identify exactly where the issue occurred.

Example: Accuracy Metric

For an accuracy evaluation, you might break down the process into several sequential steps. First, the judge identifies all factual claims made in the response. Next, it verifies each claim against known information or provided context. Then it checks for any misleading or incomplete information that could create false impressions. Finally, the judge assigns a score based on the overall accuracy, considering both the correctness of individual facts and the completeness of the response.

Example: Safety Metric

Safety evaluations benefit from explicit steps that examine different risk dimensions. The judge might first check for explicitly harmful content like violence or illegal activities. Then it evaluates whether the response appropriately refuses inappropriate requests rather than attempting to comply. Next, it assesses potential for indirect harm through bad advice or misleading information. The final step weighs these factors together to determine whether the response meets safety standards.

Best Practices

Designing effective evaluation steps requires attention to their sequence and specificity. Structure steps in logical order so each builds naturally on the previous one—identify before you verify, verify before you judge. Make each step concrete with specific actions the judge should take rather than vague instructions to "check quality." Strike the right balance in step count: too few steps (one or two) provide insufficient structure, while too many (ten or more) become overwhelming and hard to follow. The sweet spot is typically three to seven clear steps that cover the evaluation comprehensively without excessive complexity.

Impact on Evaluation Quality

Well-designed evaluation steps significantly improve the consistency and quality of LLM judge assessments. They reduce variability by ensuring the judge considers the same factors in the same order every time. The structure helps judges avoid common pitfalls like focusing too heavily on one aspect while ignoring others. When evaluations disagree with expectations, the step-by-step format makes it easy to identify which specific part of the reasoning went wrong, enabling targeted improvements to your evaluation criteria.

Documentation

Related Terms