False Positive
When a test incorrectly passes or a metric incorrectly identifies something as acceptable when it should fail.
Overview
False positives and false negatives are evaluation errors in AI testing. A false positive occurs when your testing system marks something as passing or correct when it should fail, while a false negative marks something as failing when it should pass. Both types of errors can undermine the effectiveness of your testing strategy.
False Positives vs. False Negatives
A false positive means your test says "pass" but reality indicates it should fail. For example, a safety metric might incorrectly mark a harmful response as safe. The impact of false positives is serious: dangerous or poor-quality outputs reach production, you develop a false sense of security about your system's safety, users experience harm or bad experiences, and you miss opportunities to fix real issues because they weren't flagged.
A false negative means your test says "fail" but reality indicates it should pass. For example, a quality metric might incorrectly mark a good response as inadequate. The impact of false negatives includes blocking valid functionality, wasting time investigating non-issues, causing the team to lose confidence in tests, and slowing development velocity as you chase phantom problems.
Causes of False Positives/Negatives
Threshold issues are a common cause—when pass/fail thresholds are improperly set, good responses get rejected or bad ones get accepted. Vague evaluation criteria create problems when your evaluation prompts use unclear or ambiguous language, leaving the judge uncertain about what to look for. Judge model limitations mean that even well-designed evaluations can fail because the LLM judge itself makes errors, especially on subtle or complex cases.
Detecting False Positives/Negatives
Manual spot checks involve regularly reviewing a sample of test results to verify that passes are truly passes and failures are truly failures. This helps you catch systematic issues in your evaluation logic. A/B testing thresholds means trying different pass/fail threshold values and comparing the false positive and false negative rates to find optimal settings. Human validation brings in people to review borderline cases and build ground truth datasets that reveal where your automated metrics go wrong.
Reducing False Positives/Negatives
Improving evaluation prompts often has the biggest impact. Make criteria specific and unambiguous, provide clear examples of what should pass and fail, and break complex evaluations into multiple steps that the judge can follow systematically. Using multiple metrics provides redundancy—if several independent metrics must agree before passing a test, you reduce false positives, while requiring only one metric to pass reduces false negatives. Tuning thresholds based on cost involves setting higher thresholds when false positives are expensive (like safety issues) and lower thresholds when false negatives are expensive (like blocking legitimate features).
Best Practices
For safety-critical applications, favor false negatives over false positives. It's better to block good content than allow harmful content to reach users. Set high thresholds between 8.0 and 9.0 out of 10 to catch potential issues aggressively. Use multiple judges with several metrics that must agree before passing safety checks. Implement human review for edge cases where automated judgments might miss subtle safety concerns.
For feature development contexts, you might favor false positives to enable rapid iteration. It's better to allow experimentation and catch issues through other means than to block progress unnecessarily. Use moderate thresholds between 6.0 and 7.0 out of 10 to avoid flagging every minor imperfection. Iterate quickly by adjusting based on results rather than trying to perfect thresholds upfront. Consider progressive tightening where you start lenient and increase rigor over time as your system matures.
For general best practices across contexts, conduct regular validation by periodically checking results with humans to verify your automated metrics remain accurate. Write clear criteria using specific, unambiguous evaluation prompts that leave little room for interpretation. Document your rationale for why specific thresholds were chosen, making it easier to adjust them appropriately later. Track false positive and false negative rates over time to identify trends and degradation. Adjust as needed by updating thresholds based on findings from validation and production monitoring.