Extending Penelope

Understand Penelope’s architecture and learn how to extend it with custom tools for specialized testing needs.

Architecture Overview

Penelope follows a clean, modular architecture designed for extensibility and reliability.

Architecture
┌─────────────────────────────────────────┐
│         PenelopeAgent 🦸‍♀️                │
│  Orchestrates multi-turn testing        │
└─────────────────────────────────────────┘
         │
         ├── Test Configuration
         │   ├── Goal (what to achieve)
         │   ├── Instructions (how to test)
         │   └── Context (resources)
         │
         ├── Target Abstraction
         │   └── EndpointTarget (Rhesis)
         │
         ├── Tool System
         │   ├── TargetInteractionTool
         │   ├── AnalysisTool
         │   └── Custom Tools
         │
         └── Evaluation & Stopping
             ├── LLM-based goal checking
             ├── Max iterations
             └── Timeout

Core Components

PenelopeAgent

Main orchestrator coordinating test execution:

agent_flow.py
agent = PenelopeAgent(model=model, max_iterations=10)

result = agent.execute_test(
target=target,
goal="Test goal",
instructions="Optional instructions"
)

# Result contains full execution history

print(result.status) # success, failure, error, timeout
print(result.goal_achieved) # True/False
print(result.history) # Full conversation

TurnExecutor

Handles individual turn execution - reasoning, tool selection, and execution.

GoalEvaluator

LLM-based evaluation of goal achievement using structured output.

Targets

Abstraction for systems under test:

targets.py
from rhesis.penelope import EndpointTarget

# Rhesis endpoints

target = EndpointTarget(endpoint_id="your-endpoint-id")

# Future: LangChain, CrewAI, custom targets

# target = LangChainTarget(chain=my_chain)

# target = CrewAITarget(agent=my_agent)

Built-in Tools

Penelope includes three core tools:

Send Message to Target - Interacts with the system under test
Analyze Response - Evaluates target responses for goal criteria
Extract Information - Pulls specific data from responses

Execution Flow

Initialize - Agent receives goal, instructions, and context
Turn Loop - For each turn up to max_iterations:
- Agent reasons about current state
- Selects and executes tool
- Processes result
- Evaluates goal achievement
- Checks stopping conditions
Completion - Returns TestResult with full history

execution_detail.py
# Each turn produces structured output
for turn in result.history:
  print(f"Turn {turn.turn_number}")
  print(f"Reasoning: {turn.reasoning}")
  print(f"Action: {turn.action}")
  print(f"Output: {turn.action_output}")
  print(f"Goal Progress: {turn.goal_progress}")

Stopping Conditions

Tests stop when any condition is met:

stopping.py
agent = PenelopeAgent(
  max_iterations=20,      # Stop after 20 turns
  timeout_seconds=300     # Stop after 5 minutes
)

result = agent.execute_test(target=target, goal="...")

# Check why it stopped

if result.status == "success" and result.goal_achieved:
print("Goal achieved!")
elif result.status == "failure":
print("Max iterations reached")
elif result.status == "timeout":
print("Time limit exceeded")

Custom Tools

Extend Penelope’s capabilities by creating custom tools for specialized testing needs.

Tool Interface

All tools implement the Tool abstract base class:

tool_interface.py
from abc import ABC, abstractmethod
from rhesis.penelope.tools.base import Tool, ToolResult

class Tool(ABC):
@property
@abstractmethod
def name(self) -> str:
"""Unique identifier for the tool"""
pass

  @property
  @abstractmethod
  def description(self) -> str:
      """Detailed description with usage guidance"""
      pass

  @abstractmethod
  def execute(self, **kwargs) -> ToolResult:
      """Execute the tool with validated parameters"""
      pass

Parameter Validation: Tool parameters are automatically validated via Pydantic schemas. Your execute method receives validated inputs.

Creating a Custom Tool

Example: Database verification tool for testing data persistence.

database_tool.py
from rhesis.penelope.tools.base import Tool, ToolResult
import sqlite3

class DatabaseVerificationTool(Tool):
def **init**(self, db_path: str):
self.db_path = db_path

  @property
  def name(self) -> str:
      return "verify_database_state"

  @property
  def description(self) -> str:
      return """Verify backend database state during testing.

WHEN TO USE:
✓ Check if data was saved correctly
✓ Validate database state changes
✓ Verify data consistency

PARAMETERS:

- table_name: Database table to query
- record_id: Specific record ID to verify

EXAMPLE:
verify_database_state(
table_name="users",
record_id="user123"
)

Returns record data or error if not found."""

  def execute(self, table_name: str = "", record_id: str = "", **kwargs) -> ToolResult:
      if not table_name or not record_id:
          return ToolResult(
              success=False,
              output={"error": "table_name and record_id required"}
          )

      try:
          conn = sqlite3.connect(self.db_path)
          cursor = conn.cursor()
          cursor.execute(
              f"SELECT * FROM {table_name} WHERE id = ?",
              (record_id,)
          )
          result = cursor.fetchone()
          conn.close()

          if result:
              return ToolResult(
                  success=True,
                  output={"found": True, "record": dict(result)}
              )
          else:
              return ToolResult(
                  success=True,
                  output={"found": False, "message": f"No record found"}
              )
      except Exception as e:
          return ToolResult(success=False, output={"error": str(e)})

Using Custom Tools

use_custom_tool.py
from rhesis.penelope import PenelopeAgent, EndpointTarget

# Create tool instance

db_tool = DatabaseVerificationTool(db_path="test.db")

# Initialize agent with custom tool

agent = PenelopeAgent(
tools=[db_tool],
enable_transparency=True
)

# Execute test - Penelope can now use the database tool

result = agent.execute_test(
target=EndpointTarget(endpoint_id="your-endpoint-id"),
goal="Verify chatbot correctly saves user preferences to database",
instructions=""" 1. Ask chatbot to save a preference 2. Use verify_database_state to check if it was saved 3. Verify the saved data matches what was requested
"""
)

Writing Quality Tool Descriptions

Good descriptions help Penelope understand when and how to use your tool. Include:

Purpose - What the tool does
When to Use - Scenarios for using this tool
When NOT to Use - Scenarios to avoid
Parameters - Expected inputs with types
Examples - Real usage examples
Important Notes - Caveats and limitations

good_description.py
@property
def description(self) -> str:
  return """Check API endpoint health and response times.

WHEN TO USE:
✓ Verify system is responding
✓ Check performance degradation
✓ Validate API availability

WHEN NOT TO USE:
✗ Don't use for data retrieval
✗ Don't use for authentication checks

PARAMETERS:

- endpoint_url: Full URL to check (string, required)
- timeout_seconds: Request timeout (int, default: 5)

EXAMPLE:
check_api_health(
endpoint_url="https://api.example.com/health",
timeout_seconds=10
)

Returns: {"status": "ok", "response_time_ms": 145}

IMPORTANT:

- Only checks public endpoints
- Does not include authentication headers"""

Multiple Custom Tools

Add multiple tools for comprehensive testing:

multiple_tools.py
db_tool = DatabaseVerificationTool(db_path="test.db")
api_tool = APIMonitoringTool(base_url="https://api.example.com")
security_tool = SecurityScannerTool()

# Agent can use all tools

agent = PenelopeAgent(
tools=[db_tool, api_tool, security_tool],
enable_transparency=True,
max_iterations=20
)

result = agent.execute_test(
target=target,
goal="Comprehensive system validation",
instructions=""" 1. Verify API is responding (use check_api_health) 2. Test chatbot functionality 3. Check database state (use verify_database_state) 4. Run security scan (use run_security_scan)
"""
)

Best Practices

Clear Naming

naming.py
# Good: descriptive, action-oriented
"verify_database_state"
"check_api_health"
"validate_user_permissions"

# Bad: vague, unclear

"db_tool"
"api"
"check"

Handle Errors Gracefully

error_handling.py
def execute(self, **kwargs) -> ToolResult:
  try:
      result = perform_operation()
      return ToolResult(success=True, output=result)
  except ValueError as e:
      return ToolResult(
          success=False,
          output={"error": f"Invalid input: {e}"}
      )
  except Exception as e:
      return ToolResult(
          success=False,
          output={"error": f"Unexpected error: {e}"}
      )

Provide Rich Output

rich_output.py
# Good: structured and informative
return ToolResult(
  success=True,
  output={
      "status": "healthy",
      "response_time_ms": 145,
      "timestamp": "2024-01-15T10:30:00Z",
      "details": {"version": "1.2.3", "uptime": "5d 3h"}
  }
)

# Bad: minimal information

return ToolResult(success=True, output="ok")

Test Your Tools

test_custom_tool.py
import pytest
from my_tools import DatabaseVerificationTool

def test_database_tool_success():
tool = DatabaseVerificationTool(db_path="test.db")
result = tool.execute(table_name="users", record_id="123")

  assert result.success is True
  assert result.output["found"] is True

def test_database_tool_missing_params():
tool = DatabaseVerificationTool(db_path="test.db")
result = tool.execute(table_name="", record_id="")

  assert result.success is False
  assert "error" in result.output

Design Principles

Modularity - Clear separation of concerns (agent, executor, evaluator, tools)
Extensibility - Easy to add custom tools and targets
Observability - Full transparency into reasoning and execution
Type Safety - Pydantic validation throughout
Provider Agnostic - Works with any LLM provider

Real-World Examples

See complete implementations in the examples directory :

custom_tools.py - Database verification, API monitoring, security scanning
batch_testing.py - Batch test runner tool
platform_integration.py - TestSet loader tool

Next: Check out Examples to see custom tools in action, or learn about Configuration options.