Skip to Content
GlossaryModel - Glossary

Model

Back to GlossaryConfiguration

An AI model configuration used for test generation, evaluation, or as a judge in metric assessments.

Also known as: AI model, LLM

Overview

Models in Rhesis serve multiple purposes: generating tests, evaluating responses as judges, and powering multi-turn test conversations. Configure models once and use them across different contexts.

Model Roles

Judge Models: Evaluate AI responses against metrics:

  • GPT-4 for nuanced evaluation
  • Claude for detailed reasoning
  • Gemini for multimodal judging

Test Generation Models: Create test cases from prompts and knowledge:

  • Generate diverse scenarios
  • Create edge cases
  • Produce realistic prompts

Multi-Turn Test Models: Power Penelope for conversational tests:

  • Adaptive dialogue
  • Goal-oriented conversations
  • Context-aware responses

Supported Providers

  • OpenAI: GPT-4, GPT-4 Turbo, GPT-3.5
  • Anthropic: Claude 3 Opus, Sonnet, Haiku
  • Google: Gemini Pro, Gemini Flash
  • Ollama: Local model execution
  • Hugging Face: Open-source models
  • Rhesis: Models served by Rhesis

Using Models with SDK

python
from rhesis.sdk.models import get_model

# Use default Rhesis model
model = get_model()

# Use specific provider default
model = get_model("gemini")

# Use specific model
model = get_model("gemini/gemini-2.0-flash")
# Or equivalently:
model = get_model(provider="gemini", model_name="gemini-2.0-flash")

# Use with synthesizers
from rhesis.sdk.synthesizers import PromptSynthesizer

synthesizer = PromptSynthesizer(
      prompt="Generate tests for a chatbot",
      model=model
)

# Use with metrics
from rhesis.sdk.metrics import NumericJudge

metric = NumericJudge(
      name="answer_quality",
      evaluation_prompt="Evaluate answer quality",
      min_score=0.0,
      max_score=10.0,
      threshold=7.0,
      model="gemini"  # Can pass model name or instance
)

Choosing Models

For Evaluation:

  • Accuracy: Use most capable models (GPT-4, Claude Opus)
  • Speed: Balance with GPT-4 Turbo or Gemini Flash
  • Cost: Use GPT-3.5 or local models for simple checks

For Test Generation:

  • Diversity: Higher temperature models
  • Speed: Fast models like Gemini Flash
  • Scale: Efficient models for bulk generation

Best Practices

  • Model selection: Match model capabilities to task complexity
  • Cost monitoring: Track usage and optimize model choice
  • Benchmark: Compare model performance on your use cases
  • Defaults: Use without arguments for sensible defaults

Documentation

Related Terms