Execution Trace

Penelope captures comprehensive execution traces that provide complete visibility into multi-turn test runs. These traces are structured, machine-readable, and designed for analysis, debugging, and integration with metrics systems.

Overview

Every test execution produces a TestResult object that contains:

Test outcomes - Status, goal achievement, findings
Complete conversation history - Every turn with full message context
Easy-to-read conversation summary - Simplified turn-by-turn flow with clear roles
Structured evaluation data - Complete goal evaluation with detailed criteria analysis
Standardized metrics - SDK-compatible metric summaries (no duplication)
Test configuration - Full reproducibility information
Performance statistics - Timing, tool usage, token consumption

Schema Structure

test_result.json
{
"test_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "success",
"goal_achieved": true,
"turns_used": 3,
"findings": ["All success criteria met"],

"goal_evaluation": {
  "score": 1.0,
  "all_criteria_met": true,
  "confidence": 1.0,
  "reason": "All criteria satisfied",
  "criteria_evaluations": [
    {
      "criterion": "Information accuracy",
      "met": true,
      "evidence": "Turn 2: User confirmed coverage details",
      "relevant_turns": [2]
    }
  ]
},

"history": [
  {
    "turn_number": 1,
    "reasoning": "Need to gather information",
    "assistant_message": {...},
    "tool_message": {...}
  }
],

"conversation_summary": [
  {
    "turn": 1,
    "timestamp": "2024-01-15T10:30:00Z",
    "penelope_reasoning": "Need to gather information about coverage",
    "penelope_message": "What insurance coverage do you offer?",
    "target_response": "We offer auto, home, life, and health insurance...",
    "session_id": "abc123",
    "success": true
  }
],

"metrics": {
  "Penelope Goal Evaluation": {
    "score": 1.0,
    "is_successful": true,
    "confidence": 1.0,
    "criteria_met": 1,
    "criteria_total": 1,
    "reason": "All criteria met"
  }
},

"config": {
  "goal": "Test goal",
  "max_turns": 10,
  "model_name": "gemini-2.0-flash"
},

"stats": {
  "total_turns": 3,
  "tools_used": 2,
  "total_tokens": 1250,
  "execution_time_seconds": 5.2
}
}

Key Fields

Data Structure Overview

Penelope’s execution trace uses an optimized structure to avoid data duplication:

goal_evaluation: Complete goal evaluation data with detailed criteria_evaluations, evidence, and turn references
metrics: Summary-only data for goal achievement metrics (score, criteria counts) - detailed data is in goal_evaluation
conversation_summary: Easy-to-read turn-by-turn flow for UI display
history: Complete technical conversation history with full message context

Status & Outcome

status: "success", "failure", "timeout", or "error"
goal_achieved: Boolean indicating if test objective was met
turns_used: Number of conversation turns executed
findings: List of key observations from the test

Goal Evaluation

Complete goal evaluation with detailed criteria analysis. This field contains the full evaluation data, while metrics contains only summary information to avoid duplication:

goal_evaluation.json
{
  "score": 1,
  "all_criteria_met": true,
  "confidence": 1,
  "reason": "Detailed explanation of evaluation",
  "criteria_evaluations": [
    {
      "criterion": "Information accuracy",
      "met": true,
      "evidence": "Evidence for this criterion",
      "relevant_turns": [
        1,
        2
      ]
    }
  ],
  "criteria_met": 1,
  "criteria_total": 1,
  "is_successful": true
}

Conversation Summary

For easy reading and UI display, each test includes a simplified conversation summary with clear role names:

conversation_summary.json
[
  {
    "turn": 1,
    "timestamp": "2024-01-15T10:30:00Z",
    "penelope_reasoning": "Need to gather information about coverage options",
    "penelope_message": "What insurance coverage do you offer?",
    "target_response": "We offer auto, home, life, and health insurance with various coverage levels...",
    "session_id": "abc123-def456",
    "success": true
  },
  {
    "turn": 2,
    "timestamp": "2024-01-15T10:30:15Z",
    "penelope_reasoning": "Follow up on auto insurance specifics",
    "penelope_message": "Can you tell me more about auto insurance coverage?",
    "target_response": "Auto insurance includes liability, collision, comprehensive, and uninsured motorist coverage...",
    "session_id": "abc123-def456",
    "success": true
  }
]

Key Benefits:

Clear roles: “penelope” for the agent, “target” for the endpoint
Easy tracking: Turn-by-turn conversation flow
UI-friendly: Perfect for frontend display and analysis
Complements history: Simplified view while detailed history remains available

Conversation History

Each turn in the detailed history contains:

turn.json
{
"turn_number": 1,
"reasoning": "Why this action was taken",
"assistant_message": {
  "role": "assistant",
  "content": "Reasoning about next step",
  "tool_calls": [
    {
      "id": "call_123",
      "type": "function",
      "function": {
        "name": "send_message_to_target",
        "arguments": "{"message":"Hello"}"
      }
    }
  ]
},
"tool_message": {
  "role": "tool",
  "tool_call_id": "call_123",
  "content": "Tool response"
}
}

Metrics

SDK-compatible format for integration with Rhesis platform. For goal achievement metrics, contains summary data only (detailed criteria are in goal_evaluation):

metrics.json
{
  "Penelope Goal Evaluation": {
    "score": 1,
    "is_successful": true,
    "confidence": 1,
    "criteria_met": 3,
    "criteria_total": 3,
    "reason": "All success criteria met"
  }
}

Accessing Traces

Python API

access_trace.py
from rhesis.penelope import PenelopeAgent

agent = PenelopeAgent()
result = agent.execute_test(target=target, goal="Test goal")

# Access trace data
print(f"Status: {result.status}")
print(f"Goal achieved: {result.goal_achieved}")
print(f"Turns used: {result.turns_used}")

# Iterate through history
for turn in result.history:
    print(f"Turn {turn.turn_number}: {turn.reasoning}")

# Access conversation summary (easy-to-read format)
for turn in result.conversation_summary:
    print(f"Turn {turn.turn}: {turn.penelope_message}")
    print(f"  → {turn.target_response}")

# Export to JSON
trace_json = result.dict()

# Export to file
result.to_json("trace.json")

JSON Export

export.py
import json

# Save trace
with open("execution_trace.json", "w") as f:
    json.dump(result.dict(), f, indent=2)

# Load trace
with open("execution_trace.json", "r") as f:
    trace_data = json.load(f)

# Reconstruct TestResult
from rhesis.penelope.context import TestResult

result = TestResult.from_dict(trace_data)

Integration with Rhesis Platform

Penelope traces integrate seamlessly with the Rhesis platform:

platform_integration.py
from rhesis.penelope import PenelopeAgent
from rhesis.sdk import Rhesis

# Execute with Penelope
agent = PenelopeAgent()
result = agent.execute_test(target=target, goal="Test goal")

# Submit to Rhesis platform
client = Rhesis()
client.create_test_result(
    test_id="test-123",
    metrics=result.metrics,
    output=result.dict(),
    execution_time=result.stats["execution_time_seconds"],
)

Analysis & Debugging

Quick Summary

summary.py
# Print concise summary
print(f"""
Test ID: {result.test_id}
Status: {result.status.value}
Goal Achieved: {result.goal_achieved}
Turns: {result.turns_used}/{result.config.max_turns}
Findings: {', '.join(result.findings)}
""")

# Check for issues
if not result.goal_achieved:
    print("Failure reasons:")
    for criterion in result.goal_evaluation.criteria_evaluations:
        if not criterion.met:
            print(f" ❌ {criterion.criterion}: {criterion.evidence}")

Debugging Failed Tests

debug.py
# Analyze conversation flow
for turn in result.history:
    print(f"
Turn {turn.turn_number}:")
    print(f"  Reasoning: {turn.reasoning}")
    if turn.assistant_message.tool_calls:
        for call in turn.assistant_message.tool_calls:
            print(f"  Tool: {call.function.name}")
    if turn.tool_message:
        print(f"  Response: {turn.tool_message.content[:100]}...")

# Check token usage
stats = result.stats
if stats["total_tokens"] > 50000:
    print(f"⚠️ High token usage: {stats['total_tokens']}")

Complete Schema: See the Penelope GitHub repository for the complete TestResult schema definition and examples.