Pass/Fail Threshold

The minimum score required for a test to be considered passing, defined in the metric configuration.

Also known as: threshold, passing score

Overview

Thresholds determine the line between passing and failing tests. Setting appropriate thresholds is crucial for catching real issues without creating false alarms.

Setting Thresholds

When setting thresholds, consider criticality first. Safety-critical features like harmful content detection or PII handling should have high thresholds between 8.0 and 9.0 out of 10, ensuring you catch almost all potential issues even if it means some false alarms. Core functionality that users rely on should use moderate-to-high thresholds around 7.0 to 8.0, balancing quality with development velocity. Experimental or non-critical features can use lower thresholds around 5.0 to 6.0, allowing for faster iteration while still catching major issues.

Baseline performance provides another approach to threshold setting. Analyze your current test results to understand typical performance levels, then set thresholds slightly above your baseline to ensure improvement without being unrealistic. Start with moderate thresholds and adjust based on actual performance patterns you observe. If your system consistently scores around 7.5 on a metric, setting the threshold at 7.0 creates reasonable headroom while catching degradation.

Threshold Strategies

Strict thresholds set high bars that catch more potential issues but come with tradeoffs. They generate more false positives, flagging responses that might actually be acceptable. This works well for critical features where missing an issue is costly, but may block valid changes and slow development velocity. Use strict thresholds when the cost of letting problems through significantly exceeds the cost of investigating false alarms.

Lenient thresholds create fewer false alarms and allow faster iteration by not flagging minor imperfections. However, they may miss some genuine issues that fall between the threshold and perfect quality. Lenient thresholds suit experimental features where you're still exploring approaches and need room to iterate quickly. They also work when manual review processes catch issues that automated testing misses, providing a safety net beyond the threshold.

Monitoring Thresholds

Monitor your metrics' performance in the Rhesis platform to understand if thresholds are set appropriately. Track pass rates over time to see if they're stable, improving, or degrading. A pass rate consistently near 100% might indicate thresholds are too lenient, while rates consistently below 50% suggest thresholds are unrealistically strict. Identify patterns in failures by examining what types of inputs trigger threshold violations. Compare performance across different test sets to ensure thresholds work well for various scenarios, not just your main test suite. Adjust thresholds based on these findings, treating them as dynamic parameters that evolve with your system.

Adjusting Thresholds

Raise thresholds when your AI quality has demonstrably improved and can meet higher standards. If too many failures are slipping through to production, your thresholds may be too lenient and need tightening. As features mature from experimental to core functionality, gradually raise thresholds to match their increased criticality.

Lower thresholds when experiencing too many false alarms that waste time investigating non-issues. If valid functionality is being blocked by overly aggressive thresholds, reducing them enables development progress while still catching major problems. Sometimes initial thresholds prove too aggressive once you see real-world performance data, requiring adjustment downward to realistic levels.

Best Practices

Start with moderate thresholds rather than trying to be perfect immediately—you can always adjust later based on actual results. Review thresholds regularly since they should evolve as your AI system improves and requirements change. Use different thresholds per metric rather than one-size-fits-all, as different aspects of quality have different acceptable levels. Document your rationale for why each threshold was chosen, making future adjustments more informed. A/B test threshold changes by comparing results at different levels before committing to changes, ensuring adjustments actually improve your testing effectiveness.

Documentation

/platform/metrics

Related Terms

Metric Numeric Scoring