Precision and Recall
Complementary metrics measuring accuracy: precision shows how many positive predictions were correct, while recall shows how many actual positives were found.
Overview
Precision and recall are fundamental evaluation metrics that measure different aspects of accuracy. While originally from classification tasks, these concepts apply to evaluating LLM-based systems and testing metrics themselves.
Definitions
Precision answers the question: "Of the items we identified as positive, how many were actually positive?" In LLM testing, this means: of the responses your metric marked as passing, what percentage actually should pass? High precision means when your metric says something passes, it's usually right. Low precision means many things marked as passing shouldn't have been.
Recall answers: "Of all the actual positive items, how many did we identify?" For LLM testing: of all the responses that should pass, what percentage did your metric actually mark as passing? High recall means your metric catches most things that should pass. Low recall means many valid responses are incorrectly rejected.
Understanding the Trade-Off
Conservative systems with high precision and low recall only mark things as positive when very confident. They produce few false alarms since most flagged items are genuine issues, but they miss some cases as real issues slip through unmarked. This approach suits safety-critical systems where false positives are costly—you'd rather investigate a few real problems than wade through many false alarms.
Aggressive systems with low precision and high recall flag many items as positive, catching most issues and rarely missing real problems. However, they generate many false alarms by flagging things that aren't actually issues. This works for screening systems where missing issues is very costly and you have capacity to investigate false alarms.
Moderate systems balance both metrics, accepting some false alarms and some missed issues as a reasonable trade-off. This describes most production systems where perfect detection isn't feasible and you're optimizing for practical effectiveness.
Applying to LLM Testing
When evaluating your metrics, precision and recall reveal how well they perform. Calculate these by comparing your metric's judgments against ground truth data. If your safety metric marks 100 responses as passing but 20 of those should have failed, your precision is 80%. If there were actually 150 responses that should pass but your metric only caught 100, your recall is 67%.
Threshold adjustments directly impact the precision-recall trade-off. Raising thresholds increases precision but decreases recall—you'll have fewer false positives but more false negatives. Lowering thresholds does the opposite, catching more valid cases but also accepting more invalid ones. Understanding this relationship helps you tune thresholds appropriately for your use case.
Examples
Consider a safety metric evaluating whether responses are harmful. You test it on 200 responses: 150 are actually safe, and 50 are actually harmful. Your metric flags 60 responses as harmful. Of those 60, only 40 are actually harmful (the other 20 are false positives). This gives you a precision of 40/60 = 67%—when your metric says something is harmful, it's right two-thirds of the time. For recall, of the 50 actually harmful responses, you caught 40, giving recall of 40/50 = 80%. You're catching most harmful content but generating false alarms that waste review time.
Optimizing for Your Use Case
Optimize for high precision when manual review is expensive, with each flagged item requiring significant human attention. This matters when false alarms hurt trust, causing users to lose confidence in your system. High volume amplifies this—if you can't manually review everything flagged, you need high precision so flagged items are worth the attention.
Optimize for high recall when missing issues is costly and you can't afford to let problems through. This applies when manual review is feasible and you can handle false alarms without overwhelming your team. Safety-critical applications must catch all potential issues, making recall the priority even if it means investigating many false positives.
Best Practices
For measurement, use validation sets to test metrics on data with known ground truth rather than guessing at accuracy. Track both precision and recall rather than focusing on just one metric, as either alone gives an incomplete picture. Plot precision-recall curves to visualize the trade-off across different thresholds, helping identify optimal settings. Consider the cost of different error types, weighting precision versus recall based on business impact rather than treating them equally.
For improvement, adjust thresholds to tune the precision-recall balance for your specific needs. Improve evaluation prompts since better, clearer prompts can improve both metrics simultaneously. Use better judge models as more capable LLMs may understand nuances that improve both precision and recall. Refine evaluation criteria to be more specific and unambiguous, leading to better overall performance.
For reporting and communication, show both metrics rather than cherry-picking whichever looks better. Provide context explaining what the numbers mean in your specific application. Include F1 score for a balanced view of performance that accounts for both metrics. Track changes over time to monitor whether improvements actually help and to catch degradation early.