Skip to Content
PenelopeExtending Penelope

Extending Penelope

Understand Penelope’s architecture and learn how to extend it with custom tools for specialized testing needs.

Architecture Overview

Penelope follows a clean, modular architecture designed for extensibility and reliability.

Architecture
┌─────────────────────────────────────────┐
         PenelopeAgent 🦸‍♀️
  Orchestrates multi-turn testing
└─────────────────────────────────────────┘

         ├── Test Configuration
   ├── Goal (what to achieve)
   ├── Instructions (how to test)
   └── Context (resources)

         ├── Target Abstraction
   └── EndpointTarget (Rhesis)

         ├── Tool System
   ├── TargetInteractionTool
   ├── AnalysisTool
   └── Custom Tools

         └── Evaluation & Stopping
             ├── LLM-based goal checking
             ├── Max iterations
             └── Timeout

Core Components

PenelopeAgent

Main orchestrator coordinating test execution:

agent_flow.py
agent = PenelopeAgent(model=model, max_iterations=10)

result = agent.execute_test(
    target=target,
    goal="Test goal",
    instructions="Optional instructions",
)

# Result contains full execution history
print(result.status)  # success, failure, error, timeout
print(result.goal_achieved)  # True/False
print(result.history)  # Full conversation

TurnExecutor

Handles individual turn execution - reasoning, tool selection, and execution.

GoalEvaluator

LLM-based evaluation of goal achievement using structured output.

Targets

Abstraction for systems under test:

targets.py
from rhesis.penelope import EndpointTarget

# Rhesis endpoints
target = EndpointTarget(endpoint_id="your-endpoint-id")

# Future: LangChain, CrewAI, custom targets
# target = LangChainTarget(chain=my_chain)
# target = CrewAITarget(agent=my_agent)

Built-in Tools

Penelope includes three core tools:

  1. Send Message to Target - Interacts with the system under test
  2. Analyze Response - Evaluates target responses for goal criteria
  3. Extract Information - Pulls specific data from responses

Execution Flow

  1. Initialize - Agent receives goal, instructions, and context
  2. Turn Loop - For each turn up to max_iterations:
    • Agent reasons about current state
    • Selects and executes tool
    • Processes result
    • Evaluates goal achievement
    • Checks stopping conditions
  3. Completion - Returns TestResult with full history
execution_detail.py
# Each turn produces structured output
for turn in result.history:
    print(f"Turn {turn.turn_number}")
    print(f"Reasoning: {turn.reasoning}")
    print(f"Action: {turn.action}")
    print(f"Output: {turn.action_output}")
    print(f"Goal Progress: {turn.goal_progress}")

Stopping Conditions

Tests stop when any condition is met:

stopping.py
agent = PenelopeAgent(
    max_iterations=20,  # Stop after 20 turns
    timeout_seconds=300,  # Stop after 5 minutes
)

result = agent.execute_test(target=target, goal="...")

# Check why it stopped
if result.status == "success" and result.goal_achieved:
    print("Goal achieved!")
elif result.status == "failure":
    print("Max iterations reached")
elif result.status == "timeout":
    print("Time limit exceeded")

Custom Tools

Extend Penelope’s capabilities by creating custom tools for specialized testing needs.

Tool Interface

All tools implement the Tool abstract base class:

tool_interface.py
from abc import ABC, abstractmethod
from rhesis.penelope.tools.base import Tool, ToolResult

class Tool(ABC):
    @property
    @abstractmethod
    def name(self) -> str:
        """Unique identifier for the tool"""
        pass

    @property
    @abstractmethod
    def description(self) -> str:
        """Detailed description with usage guidance"""
        pass

    @abstractmethod
    def execute(self, **kwargs) -> ToolResult:
        """Execute the tool with validated parameters"""
        pass

Parameter Validation: Tool parameters are automatically validated via Pydantic schemas. Your execute method receives validated inputs.

Creating a Custom Tool

Example: Database verification tool for testing data persistence.

database_tool.py
from rhesis.penelope.tools.base import Tool, ToolResult
import sqlite3

class DatabaseVerificationTool(Tool):
    def __init__(self, db_path: str):
        self.db_path = db_path

    @property
    def name(self) -> str:
        return "verify_database_state"

    @property
    def description(self) -> str:
        return """Verify backend database state during testing.

WHEN TO USE:
✓ Check if data was saved correctly
✓ Validate database state changes
✓ Verify data consistency

PARAMETERS:
- table_name: Database table to query
- record_id: Specific record ID to verify

EXAMPLE:
verify_database_state(
    table_name="users",
    record_id="user123"
)

Returns record data or error if not found."""

    def execute(self, table_name: str = "", record_id: str = "", **kwargs) -> ToolResult:
        if not table_name or not record_id:
            return ToolResult(
                success=False,
                output={"error": "table_name and record_id required"},
            )

        try:
            conn = sqlite3.connect(self.db_path)
            cursor = conn.cursor()
            cursor.execute(
                f"SELECT * FROM {table_name} WHERE id = ?",
                (record_id,),
            )
            result = cursor.fetchone()
            conn.close()

            if result:
                return ToolResult(
                    success=True,
                    output={"found": True, "record": dict(result)},
                )
            else:
                return ToolResult(
                    success=True,
                    output={"found": False, "message": f"No record found"},
                )
        except Exception as e:
            return ToolResult(success=False, output={"error": str(e)})

Using Custom Tools

use_custom_tool.py
from rhesis.penelope import PenelopeAgent, EndpointTarget

# Create tool instance
db_tool = DatabaseVerificationTool(db_path="test.db")

# Initialize agent with custom tool
agent = PenelopeAgent(
    tools=[db_tool],
    enable_transparency=True,
)

# Execute test - Penelope can now use the database tool
result = agent.execute_test(
    target=EndpointTarget(endpoint_id="your-endpoint-id"),
    goal="Verify chatbot correctly saves user preferences to database",
    instructions="""
    1. Ask chatbot to save a preference
    2. Use verify_database_state to check if it was saved
    3. Verify the saved data matches what was requested
    """,
)

Writing Quality Tool Descriptions

Good descriptions help Penelope understand when and how to use your tool. Include:

  1. Purpose - What the tool does
  2. When to Use - Scenarios for using this tool
  3. When NOT to Use - Scenarios to avoid
  4. Parameters - Expected inputs with types
  5. Examples - Real usage examples
  6. Important Notes - Caveats and limitations
good_description.py
@property
def description(self) -> str:
    return """Check API endpoint health and response times.

WHEN TO USE:
✓ Verify system is responding
✓ Check performance degradation
✓ Validate API availability

WHEN NOT TO USE:
✗ Don't use for data retrieval
✗ Don't use for authentication checks

PARAMETERS:
- endpoint_url: Full URL to check (string, required)
- timeout_seconds: Request timeout (int, default: 5)

EXAMPLE:
check_api_health(
    endpoint_url="https://api.example.com/health",
    timeout_seconds=10
)

Returns: {"status": "ok", "response_time_ms": 145}

IMPORTANT:
- Only checks public endpoints
- Does not include authentication headers"""

Multiple Custom Tools

Add multiple tools for comprehensive testing:

multiple_tools.py
db_tool = DatabaseVerificationTool(db_path="test.db")
api_tool = APIMonitoringTool(base_url="https://api.example.com")
security_tool = SecurityScannerTool()

# Agent can use all tools
agent = PenelopeAgent(
    tools=[db_tool, api_tool, security_tool],
    enable_transparency=True,
    max_iterations=20,
)

result = agent.execute_test(
    target=target,
    goal="Comprehensive system validation",
    instructions="""
    1. Verify API is responding (use check_api_health)
    2. Test chatbot functionality
    3. Check database state (use verify_database_state)
    4. Run security scan (use run_security_scan)
    """,
)

Best Practices

Clear Naming

naming.py
# Good: descriptive, action-oriented
"verify_database_state"
"check_api_health"
"validate_user_permissions"

# Bad: vague, unclear
"db_tool"
"api"
"check"

Handle Errors Gracefully

error_handling.py
def execute(self, **kwargs) -> ToolResult:
    try:
        result = perform_operation()
        return ToolResult(success=True, output=result)
    except ValueError as e:
        return ToolResult(
            success=False,
            output={"error": f"Invalid input: {e}"},
        )
    except Exception as e:
        return ToolResult(
            success=False,
            output={"error": f"Unexpected error: {e}"},
        )

Provide Rich Output

rich_output.py
# Good: structured and informative
return ToolResult(
    success=True,
    output={
        "status": "healthy",
        "response_time_ms": 145,
        "timestamp": "2024-01-15T10:30:00Z",
        "details": {"version": "1.2.3", "uptime": "5d 3h"},
    },
)

# Bad: minimal information
return ToolResult(success=True, output="ok")

Test Your Tools

test_custom_tool.py
import pytest
from my_tools import DatabaseVerificationTool

def test_database_tool_success():
    tool = DatabaseVerificationTool(db_path="test.db")
    result = tool.execute(table_name="users", record_id="123")

    assert result.success is True
    assert result.output["found"] is True

def test_database_tool_missing_params():
    tool = DatabaseVerificationTool(db_path="test.db")
    result = tool.execute(table_name="", record_id="")

    assert result.success is False
    assert "error" in result.output

Design Principles

  1. Modularity - Clear separation of concerns (agent, executor, evaluator, tools)
  2. Extensibility - Easy to add custom tools and targets
  3. Observability - Full transparency into reasoning and execution
  4. Type Safety - Pydantic validation throughout
  5. Provider Agnostic - Works with any LLM provider

Real-World Examples

See complete implementations in the examples directory :

  • custom_tools.py - Database verification, API monitoring, security scanning
  • batch_testing.py - Batch test runner tool
  • platform_integration.py - TestSet loader tool

Next: Check out Examples to see custom tools in action, or learn about Configuration options.