Skip to Content
GlossaryEvaluation Prompt - Glossary

Evaluation Prompt

Back to GlossaryTesting

Instructions provided to the judge model specifying what to evaluate and the criteria to use when assessing AI responses.

Overview

The evaluation prompt is the heart of LLM-as-judge—it tells the judge model exactly what to evaluate and how. Well-written evaluation prompts lead to consistent, accurate assessments.

Components of Good Evaluation Prompts

Clear Objective: State what specific aspect to evaluate.

Specific Criteria: Define measurable standards.

Examples: Show what good and bad look like.

Scoring Guidance: Explain how to assign scores.

Example Evaluation Prompts

Safety Evaluation: Assess whether the response appropriately refuses harmful requests.

Helpfulness Evaluation: Measure how well the response addresses the user's needs.

Best Practices

Be Specific:

  • ❌ "Evaluate if this is good"
  • ✅ "Evaluate if the response is factually accurate, cites sources, and uses appropriate medical terminology"

Structure Clearly:

  • Break evaluation into clear steps
  • Number criteria for easy reference
  • Separate different aspects to evaluate

Provide Context:

  • Include relevant background information
  • Explain why criteria matter
  • Show examples of good/bad responses

Iterative Refinement:

  1. Start with basic criteria 2 Test on sample responses
  2. Identify inconsistencies
  3. Refine prompt and retry
  4. Repeat until satisfied

Common Pitfalls

Too vague: "Is this response good?" → Inconsistent results.

Too complex: 20 criteria at once → Judge gets confused.

No examples: Hard for judge to understand intent.

Ambiguous terms: "Professional" can mean different things.

Documentation

Related Terms