A/B Testing

Comparing two versions of a system by running identical tests against both to determine which performs better.

Also known as: split testing, comparison testing

Overview

A/B testing in LLM evaluation involves running the same tests against two different configurations to compare performance. This could be different models, prompts, endpoints, or evaluation approaches.

What to A/B Test

Model comparison tests different LLM versions or providers to see which performs better for your use case. You might compare GPT-4 versus Claude, or different parameter sizes of the same model family.

Prompt variations test different instruction phrasings, system messages, or few-shot examples. Small prompt changes can significantly impact output quality, making A/B testing valuable for optimization.

Configuration changes evaluate different temperature settings, max token limits, or other parameters. Testing helps you find the sweet spot between creativity and consistency for your specific application.

Statistical Significance

Not every performance difference is meaningful. Statistical testing helps determine whether observed differences are real effects or just random variation. Run enough tests (typically 100+) to achieve statistical power. Consider using techniques like t-tests or bootstrap sampling to assess significance.

A/B Testing Methodology

Fair comparison requires running both versions under identical conditions at the same time with the same load. Use the exact same test set for both configurations—any differences in inputs invalidate the comparison. Random ordering prevents bias from always running version A before version B.

Segment analysis breaks down results by category, topic, or other dimensions. Sometimes one version performs better overall but worse on specific important segments. Understanding these patterns helps make informed decisions.

Decision Framework

Establish success criteria before testing begins. Define what metrics matter (accuracy, latency, cost) and what improvements justify switching. Pre-defining criteria prevents cherry-picking favorable metrics after seeing results.

Consider both primary metrics (your main quality measure) and secondary metrics (cost, latency, user satisfaction). The best system overall might not win on every dimension. Understand the trade-offs you're making.

Common Pitfalls

Insufficient sample size leads to unreliable conclusions. What looks like a clear winner with 20 tests might disappear with 100 tests. Ensure your sample is large enough for statistical confidence.

Not accounting for randomness in LLM outputs means a single test run can show misleading differences. Run multiple iterations of each configuration to get stable averages that account for variability.

Cherry-picking results by choosing metrics or test subsets that favor your preferred option invalidates the comparison. Pre-define all evaluation criteria and commit to them regardless of outcomes.

Best Practices

For experimental design, control variables by changing only one thing between versions A and B. Use sufficient sample sizes—at least 100 tests for statistical power. Run multiple iterations to account for randomness in outputs. Pre-define success criteria and commit to them before seeing any results.

Ensuring fairness means running both versions simultaneously under the same conditions with identical test sets. Use random ordering rather than always testing in the same sequence. Test under realistic conditions including expected load and usage patterns.

For analysis and decision-making, apply statistical tests to verify that observed differences are significant. Break down results by segment to understand where differences occur. Examine full distributions of scores, not just averages. Consider trade-offs when one version wins on some metrics but loses on others. Document your decision rationale including what you tested and why you chose the winner.

Documentation

/platform/test-runs

Related Terms

Baseline Regression Testing Test Run