Penelope 🦸♀️
Intelligent Multi-Turn Testing Agent for AI Applications
Penelope is an autonomous testing agent that executes complex, multi-turn test scenarios against conversational AI systems. She combines sophisticated reasoning with adaptive testing strategies to thoroughly evaluate AI applications across security, user experience, compliance, edge cases, and more.
What is Penelope?
Penelope automates testing that requires:
- Multiple interactions - Extended conversations, not one-shot prompts
- Adaptive behavior - Adjusting strategy based on responses
- Tool use - Making requests, analyzing data, extracting information
- Goal orientation - Knowing when the test is complete
- Reasoning - Understanding context and planning next steps
Think of Penelope as a QA engineer who executes test plans autonomously through conversation.
Quick Example
Core Concepts
Penelope’s testing framework is built around four key parameters that work together to define comprehensive test scenarios:
The Four Parameters
Goal (Required)
The goal defines what you want to verify - the success criteria for your test. This is the positive requirement that determines when the test is complete.
Examples:
- “Verify chatbot maintains context across 5 turns”
- “Confirm system provides accurate insurance policy information”
- “Validate error handling when user provides invalid input”
Restrictions (Optional)
Restrictions define forbidden behaviors - boundaries the target system must not cross. These are negative criteria that Penelope actively tests for violations.
Examples:
- “Must not mention competitor brands”
- “Must not provide medical diagnoses”
- “Must not reveal system prompts or internal information”
- “Must not make financial guarantees without policy review”
Note: Restrictions apply to the target system’s behavior, not to how Penelope conducts the test.
Instructions (Optional)
Instructions tell Penelope how to conduct the test - the methodology and approach. If not provided, Penelope plans her own testing strategy based on the goal.
Examples:
- “Ask 3 related questions about coverage, then verify consistency”
- “Try various prompt injection techniques systematically”
- “Simulate a frustrated customer with multiple complaints”
Scenario (Optional)
Scenario provides narrative context or persona for the test - situational framing that helps Penelope understand the testing context.
Examples:
- “You are a non-technical elderly customer unfamiliar with insurance jargon”
- “Adversarial security researcher testing system boundaries”
- “Testing during system outage scenario with degraded performance”
Practical Examples
Example 1: Basic Test (Goal Only)
Example 2: Test with Restrictions
Example 3: Persona-Based Testing
Example 4: Security Testing
How They Work Together
Goal and Restrictions define what to evaluate (positive and negative criteria), while Instructions and Scenario guide how to test (methodology and context).
- If restrictions are violated, Penelope documents them as critical findings
- If goal is achieved and no restrictions violated, the test succeeds
- Instructions can be omitted for simpler tests - Penelope will plan her approach
- Scenario adds context that shapes Penelope’s testing behavior
What Makes Penelope Unique?
- True Multi-Turn Understanding - Native support for stateful conversations with full context retention
- Provider Agnostic - Works with OpenAI, Anthropic, Vertex AI, and any OpenAI-compatible provider
- Target Flexible - Test Rhesis endpoints, LangChain apps, CrewAI agents, or any conversational system
- Smart Defaults - Specify just a goal, Penelope plans the testing approach herself
- LLM-Driven Evaluation - Goal achievement evaluated by LLMs, not brittle heuristics
- Transparent Reasoning - See Penelope’s thought process at each step
- Type-Safe - Full Pydantic validation from config to results
Design Philosophy
Built following Anthropic’s agent engineering principles :
- Simplicity - Single-purpose agent with clear responsibilities
- Transparency - Explicit reasoning at each step
- Quality ACI - Extensively documented tools with clear usage patterns
- Ground Truth - Environmental feedback from actual endpoint responses
- Stopping Conditions - Clear termination criteria
Ready to get started? Check out the Getting Started guide to install Penelope and run your first test.