F1 Score
The harmonic mean of precision and recall, providing a single balanced metric that considers both false positives and false negatives.
Overview
The F1 score combines precision and recall into a single metric, useful when you need to balance both concerns and want one number to track. It's particularly valuable when you have an uneven class distribution or when both types of errors matter equally.
Formula
The F1 score is calculated as: F1 = 2 × (Precision × Recall) / (Precision + Recall). This can also be expressed in terms of true positives, false positives, and false negatives: F1 = 2TP / (2TP + FP + FN), where TP represents True Positives, FP represents False Positives, and FN represents False Negatives.
Why Harmonic Mean?
The F1 score uses the harmonic mean rather than the arithmetic mean because it penalizes extreme values. If either precision or recall is very low, the F1 score will also be low, even if the other metric is high. This ensures you can't game the metric by optimizing only one component while ignoring the other. For example, a system with 100% precision but 10% recall would have an arithmetic mean of 55% but an F1 score of only 18%, better reflecting its poor overall performance.
Calculating F1 Score
To calculate F1 score for your LLM evaluations, first determine precision and recall from your test results. Count your true positives (correctly identified passes), false positives (incorrectly identified passes), and false negatives (incorrectly identified failures). Calculate precision as TP/(TP+FP) and recall as TP/(TP+FN). Then apply the F1 formula to get your balanced score between 0 and 1, where 1 is perfect and 0 is complete failure.
Using F1 for Metric Evaluation
When evaluating your LLM judge metrics, F1 score helps you understand how well they balance catching problems (recall) with avoiding false alarms (precision). Track F1 scores across different test runs to see if metric refinements are improving overall accuracy. Compare F1 scores between different evaluation approaches or configurations to identify which performs best overall.
F1 Score Patterns
High F1 scores above 0.85 indicate that both precision and recall are performing well, meaning your metric catches most issues while avoiding false alarms. Moderate F1 scores between 0.60 and 0.75 suggest that one or both metrics need improvement—examine whether you're missing too many problems or flagging too many false positives. Low F1 scores below 0.50 reveal significant issues with precision, recall, or both, indicating your evaluation criteria need substantial refinement.
When to Use F1 Score
F1 score is most appropriate when both false positives and false negatives matter equally to your use case. It's ideal when you need a single number to track overall metric quality, making it easy to compare different configurations or approaches. F1 is particularly useful with unbalanced data where positive and negative classes have very different frequencies, as it accounts for both error types proportionally.
However, avoid relying solely on F1 when false positives and false negatives have different costs in your application. If missing a critical safety issue is far worse than a false alarm, focus more heavily on recall rather than the balanced F1. Similarly, if you primarily care about one metric, optimize for that directly rather than the composite score.
Improving F1 Score
To improve F1 score, first identify which component is the bottleneck. If precision is low, you're generating too many false positives—tighten your evaluation criteria or raise thresholds. If recall is low, you're missing too many problems—broaden your evaluation criteria or lower thresholds. Often you'll need to balance these competing concerns by adjusting threshold values to find the optimal F1 point. Consider using F-beta score variants if improving one metric is more important than the other.
F1 Variants
The F-beta score generalizes F1 by allowing you to weight precision and recall differently. With beta less than 1, you favor precision over recall. With beta greater than 1, you favor recall over precision. F2 score, for example, weights recall twice as heavily as precision, useful when missing problems is more costly than false alarms. F0.5 score weights precision more heavily, appropriate when false alarms are particularly expensive.
Best Practices
Always calculate F1 score on a validation set with known ground truth rather than your training data. Report all three components—F1, precision, and recall—to provide full context about performance. Track these metrics over time to monitor whether changes improve overall quality. Compare F1 scores across different configurations and thresholds to find optimal settings. When reporting F1 to stakeholders, explain what it means in your specific use case context, show the precision and recall breakdown separately, compare current scores to baseline performance, and set clear target thresholds for acceptable quality.