Confusion Matrix

Back to Glossary Testing

A table showing the performance of a classification system, displaying true positives, false positives, true negatives, and false negatives.

Also known as: error matrix, contingency table

Overview

A confusion matrix provides a complete picture of a classification system's performance, showing not just accuracy but the types of errors being made. In LLM testing, confusion matrices help evaluate metrics, understand failure patterns, and tune thresholds.

Structure

A confusion matrix contains four key values that describe all possible outcomes. True Positives are cases correctly identified as passing—these represent accurate positive predictions. False Positives are cases incorrectly identified as passing when they should fail—these are type I errors or false alarms. False Negatives are cases incorrectly identified as failing when they should pass—these are type II errors or missed detections. True Negatives are cases correctly identified as failing—accurate negative predictions.

Creating Confusion Matrices

To build a confusion matrix, you need ground truth labels and your system's predictions for the same test set. Compare each prediction against its true label and count how many fall into each of the four categories. The matrix visualization arranges these counts in a grid that makes patterns immediately visible.

Analyzing Confusion Matrices

A well-calibrated metric shows high numbers along the diagonal (true positives and true negatives) with low off-diagonal numbers (false positives and false negatives). This indicates the system correctly classifies most cases.

If you see many false positives with few false negatives, your metric is too lenient—accepting cases it should reject. This means you're letting poor quality slip through. Conversely, many false negatives with few false positives indicates a metric that's too strict, rejecting valid cases and potentially blocking good functionality.

Multi-Class Confusion Matrices

For categorical metrics with more than two classes, confusion matrices expand into larger grids. Each row represents actual classes while columns represent predicted classes. The diagonal still shows correct classifications, while off-diagonal cells reveal which specific classes get confused with each other. This helps identify systematic misclassification patterns.

Using Confusion Matrices for Improvement

Confusion matrices reveal specific issues to address. If certain types of errors dominate, you can focus improvements on those cases. Are false positives concentrated in specific categories? Target those with more careful criteria. Are false negatives occurring in edge cases? Add more nuanced evaluation steps.

Threshold optimization becomes data-driven when you can see how different thresholds affect the error distribution. Raising thresholds reduces false positives but increases false negatives. Lowering thresholds does the opposite. The confusion matrix shows you exactly what trade-off you're making.

Best Practices

Building reliable confusion matrices requires validation data with ground truth labels. You need sufficient sample size—at least 100 examples, preferably more—to get stable results. If your classes are imbalanced (many more positives than negatives, for example), account for this in your analysis.

When analyzing results, examine all four values rather than just overall accuracy. Identify patterns in which errors occur most frequently. Consider whether some errors are more costly than others in your application. Compare confusion matrices at different threshold settings to find optimal trade-offs.

For improvement, address the most problematic error type first. Tune thresholds based on what the matrix reveals about error patterns. Refine evaluation criteria to reduce systematic errors. Monitor how the confusion matrix changes over time as you make improvements, ensuring changes actually help rather than just shifting which types of errors occur.

Documentation

/platform/metrics