Penelope

Back to Glossary Testing

An autonomous testing agent that powers multi-turn tests, adapting its strategy based on AI responses to evaluate conversational workflows.

Overview

Penelope is Rhesis's autonomous testing agent that conducts goal-oriented conversations with your AI system. Unlike scripted tests, Penelope adapts her strategy based on your AI's responses, testing realistic conversational scenarios.

How Penelope Works

Adaptive Testing:

Goal Understanding: Penelope knows what to achieve
Dynamic Strategy: Adjusts approach based on responses
Natural Conversation: Conducts realistic dialogue
Goal Assessment: Evaluates if objective was met

Intelligent Behaviors:

Clarification: Asks for missing information
Verification: Confirms understanding
Edge Testing: Tries boundary cases
Recovery: Handles errors gracefully

Using Penelope

Basic Test Execution:

python
from rhesis.penelope import PenelopeAgent, EndpointTarget

# Initialize Penelope
agent = PenelopeAgent()

# Create target (your AI endpoint)
target = EndpointTarget(endpoint_id="my-chatbot-prod")

# Execute a test
result = agent.execute_test(
      target=target,
      goal="Book a round-trip flight from NYC to Tokyo",
      max_iterations=15
)

print(f"Goal achieved: {result.goal_achieved}")
print(f"Turns used: {result.turns_used}")

With Instructions and Restrictions:

python
result = agent.execute_test(
      target=target,
      goal="Verify insurance chatbot stays within policy boundaries",
      instructions="Ask about coverage, competitors, and medical conditions",
      restrictions="""
          Must not mention competitor brands or products.
          Must not provide specific medical diagnoses.
          Must not guarantee coverage without policy review.
      """,
      max_iterations=10
)

What Penelope Tests

Conversational Capabilities:

Information gathering: Does the AI ask the right questions?
Context retention: Does it remember previous turns?
Clarification handling: How does it handle ambiguity?
Task completion: Can it achieve the goal?

Edge Cases:

Missing information: How does the AI handle gaps?
Contradictions: Can it recover from conflicts?
Complexity: Does it manage multi-step workflows?
User changes: How does it adapt to new requirements?

Penelope vs. Scripted Tests

Aspect	Scripted	Penelope
Conversation	Fixed script	Adaptive
Realism	Predictable	Natural
Coverage	Limited paths	Explores variations
Maintenance	Update scripts	Update goals

Target Options

Rhesis Endpoints:

python
from rhesis.penelope import EndpointTarget

target = EndpointTarget(endpoint_id="my-endpoint")

LangChain Chains:

python
from rhesis.penelope import LangChainTarget
from langchain.chains import LLMChain

chain = LLMChain(...)
target = LangChainTarget(chain=chain)

LangGraph Graphs:

python
from rhesis.penelope import LangGraphTarget

graph = compiled_graph
target = LangGraphTarget(graph=graph)

Configuration

python
from rhesis.penelope import PenelopeAgent, PenelopeConfig

# Custom configuration
config = PenelopeConfig(
      model_provider="anthropic",
      model_name="claude-3-opus-20240229"
)

agent = PenelopeAgent(
      config=config,
      max_iterations=20
)

Best Practices

Clear goals: Define specific measurable objectives
Reasonable scope: Limit turns to 5-15 for most tests
Use restrictions: Define what the AI should NOT do
Review traces: Analyze conversation logs for insights
Iterate: Refine goals based on test results

Documentation

/penelope