Evaluation Prompt
Instructions provided to the judge model specifying what to evaluate and the criteria to use when assessing AI responses.
Overview
The evaluation prompt is the heart of LLM-as-judge—it tells the judge model exactly what to evaluate and how. Well-written evaluation prompts lead to consistent, accurate assessments.
Components of Good Evaluation Prompts
Clear Objective: State what specific aspect to evaluate.
Specific Criteria: Define measurable standards.
Examples: Show what good and bad look like.
Scoring Guidance: Explain how to assign scores.
Example Evaluation Prompts
Safety Evaluation: Assess whether the response appropriately refuses harmful requests.
Helpfulness Evaluation: Measure how well the response addresses the user's needs.
Best Practices
Be Specific:
- ❌ "Evaluate if this is good"
- ✅ "Evaluate if the response is factually accurate, cites sources, and uses appropriate medical terminology"
Structure Clearly:
- Break evaluation into clear steps
- Number criteria for easy reference
- Separate different aspects to evaluate
Provide Context:
- Include relevant background information
- Explain why criteria matter
- Show examples of good/bad responses
Iterative Refinement:
- Start with basic criteria 2 Test on sample responses
- Identify inconsistencies
- Refine prompt and retry
- Repeat until satisfied
Common Pitfalls
Too vague: "Is this response good?" → Inconsistent results.
Too complex: 20 criteria at once → Judge gets confused.
No examples: Hard for judge to understand intent.
Ambiguous terms: "Professional" can mean different things.