Temperature
A model parameter controlling randomness and creativity in LLM outputs, with lower values producing more deterministic responses and higher values producing more varied outputs.
Overview
Temperature is a sampling parameter that affects LLM output randomness. Understanding temperature is important for test design, metric consistency, and system configuration.
Temperature Values
Low temperature settings between 0.0 and 0.3 produce more deterministic, consistent, and focused outputs. The model tends to select the most probable next tokens, resulting in predictable responses that vary little between runs. This setting works well for factual question answering where consistency matters, structured data extraction that requires reliable formatting, consistent evaluation by judge models, and mathematical calculations where deterministic behavior is essential.
Medium temperature settings between 0.4 and 0.7 provide balanced creativity and consistency. Outputs remain generally on-topic and reliable while introducing some natural variation that can make responses feel more human and less robotic. This range suits most production applications including conversational AI, general chatbots, customer support systems, and typical user-facing applications where some variety improves user experience without sacrificing reliability.
High temperature settings between 0.8 and 1.0 or higher generate more creative, diverse, and unpredictable outputs. The model considers less probable tokens, leading to responses that can be surprising, novel, or off-the-wall. Use high temperatures for creative writing tasks, brainstorming sessions, test generation when you want diverse scenarios, and exploring possibilities where novelty matters more than consistency.
Temperature in Testing
When using LLMs for consistent evaluation, keep temperature very low (0.0-0.2) to ensure judge models make the same decisions on repeated evaluations. This reduces variability in metric scores and makes your testing more reliable. For diverse test generation, use higher temperatures (0.7-0.9) to create varied scenarios exploring different phrasings and edge cases. When testing temperature impact on your system, systematically vary the parameter to understand how it affects output quality, consistency, and user experience.
Temperature Selection Guidelines
For judge models in evaluation contexts, use temperatures between 0.0 and 0.2 to prioritize consistency. You want judges to make the same assessment repeatedly for the same input, so minimize randomness. For test generation, use temperatures between 0.7 and 0.9 to maximize diversity, exploring different ways users might express requests and discovering edge cases through varied generation. For production chatbots, choose temperatures between 0.5 and 0.7 for a balance—consistent enough to be reliable, varied enough to avoid feeling robotic.
Testing Considerations
When accounting for variability, recognize that higher temperatures introduce randomness. Run multiple test iterations with the same inputs to understand the range of possible outputs. When comparing performance at different temperatures, use the same temperature consistently across all comparison runs to isolate other variables. For temperature in baseline comparisons, document the temperature setting used for your baseline so future comparisons use matching configurations.
Common Pitfalls
Using inconsistent temperatures for judge models creates unreliable evaluations where the same response gets different scores across runs. This undermines trust in your metrics and makes it hard to know if changes actually improved performance. Using too-low temperatures for test generation produces boring, repetitive test cases that cluster around the most obvious scenarios, missing edge cases and varied phrasings that real users would provide.
Best Practices
For configuration selection, set judge models to 0.0-0.2 for consistency in evaluation. Configure test generation to 0.7-0.9 for diversity in scenarios. Set production chatbots to 0.5-0.7 for balanced performance. Use 0.0-0.3 for factual question-answering where correctness is paramount. Apply 0.8-1.0 for creative tasks where novelty and variety are valued.
For documentation and comparison, always record temperature settings with test results so you can interpret results correctly. Use the same temperature when comparing systems or configurations to ensure differences reflect actual changes rather than random variation. When using higher temperatures, run tests multiple times to account for variability and understand the distribution of possible outputs. Match temperature to task requirements rather than using one default for everything.
For monitoring and optimization, track temperature in metadata alongside test results. Conduct A/B tests comparing performance at different temperature settings to find optimal values for your use case. Gather user feedback on whether higher temperature outputs feel more natural or if users prefer the consistency of lower temperatures. Balance consistency needs against the naturalness that some randomness provides.