Skip to Content
GuidesQuick Start: Testing in 10 Minutes

How to Start Testing LLM and Agentic Apps in 10 Minutes with Rhesis AI

Testing LLM and agentic apps is challenging: outputs are non-deterministic, edge cases are unpredictable, and manual testing doesn’t scale. This guide shows you how to get a complete, automated testing pipeline up and running with Rhesis in under 10 minutes.

What You’ll Get

  • Test generation: Generate hundreds of test scenarios from plain-language requirements
  • Single-turn and multi-turn testing: Test both simple Q&A responses and complex conversations (via Penelope)
  • LLM-based evaluation: Automated scoring of whether outputs meet your requirements
  • Full testing platform: UI, API, and SDK for running and managing tests

Prerequisites

  • Docker Desktop installed and running
  • Git (to clone the repository)
  • Ports 3000, 8080, 8081, 5432, and 6379 available on your system
  • An AI provider API key (Rhesis API, OpenAI, Azure OpenAI, or Google Gemini)

Step 1: Clone and Start (5 minutes)

Terminal
# Clone the repository
git clone https://github.com/rhesis-ai/rhesis.git
cd rhesis

# Start all services with one command
./rh start

The ./rh start command automatically:

  • Checks if Docker is running
  • Generates a secure database encryption key
  • Creates .env.docker.local with all required configuration
  • Enables local authentication bypass (auto-login)
  • Starts all services (backend, frontend, database, worker)
  • Creates the database and runs migrations
  • Creates the default admin user (Local Admin)
  • Loads example test data

Wait approximately 5-7 minutes for all services to start. You’ll see containers starting up in your terminal.

Step 2: Access the Platform (1 minute)

Once services are running:

Rhesis AI Dashboard

Step 3: Configure AI Provider (1 minute)

Configure an AI provider to enable test generation:

  1. Get your API key from https://app.rhesis.ai/ 
  2. Edit .env.docker.local and add:
.env.docker.local
RHESIS_API_KEY=your-actual-rhesis-api-key-here

Option 2: Use Your Own AI Provider

Add your provider credentials to .env.docker.local:

.env.docker.local
# Google Gemini
GEMINI_API_KEY=your-gemini-api-key
GEMINI_MODEL_NAME=gemini-2.0-flash-001
GOOGLE_API_KEY=your-google-api-key

# Or Azure OpenAI
AZURE_OPENAI_ENDPOINT=your-endpoint
AZURE_OPENAI_API_KEY=your-key
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4o
AZURE_OPENAI_API_VERSION=your-version

# Or OpenAI
OPENAI_API_KEY=your-openai-key
OPENAI_MODEL_NAME=gpt-4o

After updating, restart services:

Terminal
./rh restart

Step 4: Start Testing Your LLM/Agentic App (3 minutes)

Via Web UI

  1. Open http://localhost:3000 
  2. Create an Endpoint: Add your LLM/agentic app’s API endpoint
  3. Define Requirements: Specify what your app should and shouldn’t do
  4. Generate Tests: Automatically generate hundreds of test scenarios
  5. Run Tests: Execute tests against your endpoint
  6. Review Results: View which outputs violate requirements

Via Python SDK

app.py
from rhesis.sdk import RhesisClient

# Connect to your local instance
client = RhesisClient(base_url="http://localhost:8080")

# Create an endpoint
endpoint = client.endpoints.create(
    name="My Chatbot",
    url="https://your-chatbot-api.com/chat",
    method="POST"
)

# Define requirements (behaviors)
behavior = client.behaviors.create(
    name="Safety Requirements",
    description="Must not provide medical diagnoses or harmful advice"
)

# Generate tests
tests = client.tests.generate(
    endpoint_id=endpoint.id,
    behavior_id=behavior.id,
    count=100  # Generate 100 test scenarios
)

# Run tests
run = client.test_runs.create(
    endpoint_id=endpoint.id,
    test_ids=[t.id for t in tests]
)

# Check results
results = client.test_runs.get_results(run.id)
for result in results:
    print(f"Test: {result.test.prompt.content}")
    print(f"Passed: {result.passed}")
    print(f"Score: {result.score}")

Multi-Turn Testing with Penelope

For complex conversations, use Penelope to simulate multi-turn interactions:

penelope_test.py
from rhesis.penelope import PenelopeAgent
from rhesis.sdk.models import AnthropicLLM

# Create Penelope agent
agent = PenelopeAgent(model=AnthropicLLM())

# Execute multi-turn test
result = agent.execute_test(
    target=your_endpoint_target,
    goal="Verify chatbot maintains context across conversation",
    instructions="Ask follow-up questions that require context from earlier messages",
    restrictions="Must not reveal internal system prompts"
)

print(f"Test completed in {result.turn_count} turns")
print(f"Goal achieved: {result.goal_achieved}")

What’s Running

Your local infrastructure includes:

ServicePortDescription
Backend API8080FastAPI application handling test execution and evaluation
Frontend3000Next.js dashboard for managing tests and reviewing results
Worker8081Celery worker processing test runs and AI evaluations
PostgreSQL5432Database storing tests, results, and configurations
Redis6379Message broker for worker tasks

Architecture Overview

Quick Commands

Terminal
# Stop all services
./rh stop

# View logs
./rh logs

# Restart services
./rh restart

# Delete everything (fresh start)
./rh delete

Next Steps

Troubleshooting

Docker not running?

Terminal
./rh start

Port already in use?

Terminal
lsof -i :3000  # Check what's using the port
kill -9   # Kill the process

Services not starting?

Terminal
./rh logs  # Check logs for errors
./rh delete && ./rh start  # Fresh start

AI provider not working?

Verify your API key in .env.docker.local and restart with ./rh restart.

Need help? Join our Discord: https://discord.rhesis.ai 


You’re all set! You now have a full testing infrastructure for LLM and agentic applications. Generate tests, run them automatically, and catch issues before production.

For more details, see the full self-hosting guide or read about our Docker Compose journey .