Skip to Content
GlossaryPenelope - Glossary

Penelope

Back to GlossaryTesting

An autonomous testing agent that powers multi-turn tests, adapting its strategy based on AI responses to evaluate conversational workflows.

Overview

Penelope is Rhesis's autonomous testing agent that conducts goal-oriented conversations with your AI system. Unlike scripted tests, Penelope adapts her strategy based on your AI's responses, testing realistic conversational scenarios.

How Penelope Works

Adaptive Testing:

  1. Goal Understanding: Penelope knows what to achieve
  2. Dynamic Strategy: Adjusts approach based on responses
  3. Natural Conversation: Conducts realistic dialogue
  4. Goal Assessment: Evaluates if objective was met

Intelligent Behaviors:

  • Clarification: Asks for missing information
  • Verification: Confirms understanding
  • Edge Testing: Tries boundary cases
  • Recovery: Handles errors gracefully

Using Penelope

Basic Test Execution:

python
from rhesis.penelope import PenelopeAgent, EndpointTarget

# Initialize Penelope
agent = PenelopeAgent()

# Create target (your AI endpoint)
target = EndpointTarget(endpoint_id="my-chatbot-prod")

# Execute a test
result = agent.execute_test(
      target=target,
      goal="Book a round-trip flight from NYC to Tokyo",
      max_iterations=15
)

print(f"Goal achieved: {result.goal_achieved}")
print(f"Turns used: {result.turns_used}")

With Instructions and Restrictions:

python
result = agent.execute_test(
      target=target,
      goal="Verify insurance chatbot stays within policy boundaries",
      instructions="Ask about coverage, competitors, and medical conditions",
      restrictions="""
          Must not mention competitor brands or products.
          Must not provide specific medical diagnoses.
          Must not guarantee coverage without policy review.
      """,
      max_iterations=10
)

What Penelope Tests

Conversational Capabilities:

  • Information gathering: Does the AI ask the right questions?
  • Context retention: Does it remember previous turns?
  • Clarification handling: How does it handle ambiguity?
  • Task completion: Can it achieve the goal?

Edge Cases:

  • Missing information: How does the AI handle gaps?
  • Contradictions: Can it recover from conflicts?
  • Complexity: Does it manage multi-step workflows?
  • User changes: How does it adapt to new requirements?

Penelope vs. Scripted Tests

AspectScriptedPenelope
ConversationFixed scriptAdaptive
RealismPredictableNatural
CoverageLimited pathsExplores variations
MaintenanceUpdate scriptsUpdate goals

Target Options

Rhesis Endpoints:

python
from rhesis.penelope import EndpointTarget

target = EndpointTarget(endpoint_id="my-endpoint")

LangChain Chains:

python
from rhesis.penelope import LangChainTarget
from langchain.chains import LLMChain

chain = LLMChain(...)
target = LangChainTarget(chain=chain)

LangGraph Graphs:

python
from rhesis.penelope import LangGraphTarget

graph = compiled_graph
target = LangGraphTarget(graph=graph)

Configuration

python
from rhesis.penelope import PenelopeAgent, PenelopeConfig

# Custom configuration
config = PenelopeConfig(
      model_provider="anthropic",
      model_name="claude-3-opus-20240229"
)

agent = PenelopeAgent(
      config=config,
      max_iterations=20
)

Best Practices

  • Clear goals: Define specific measurable objectives
  • Reasonable scope: Limit turns to 5-15 for most tests
  • Use restrictions: Define what the AI should NOT do
  • Review traces: Analyze conversation logs for insights
  • Iterate: Refine goals based on test results

Documentation

Related Terms