Worker Troubleshooting Guide

This document covers common issues you may encounter with the Rhesis worker system and how to resolve them.

Test runs use the async batch engine (one Celery task per run with internal asyncio fan-out). For how that works and how to debug it, see Test Execution, Execution Modes, and Background Tasks.

Dealing with Stuck Tasks

Tasks can appear stuck when the worker is overloaded, the broker connection drops, a long-running batch is still executing, or a task hits retries/time limits. Typical causes:

The execution worker crashed or was scaled down while a batch task was running
Redis or network issues between workers and the broker
A task exceeded its time limit or is waiting on an external dependency

Inspect and revoke

code.txt
# See what the worker is doing right now
celery -A rhesis.backend.worker inspect active

# Revoke a specific task (stops new execution; see Celery docs for semantics)
celery -A rhesis.backend.worker control revoke <task_id>

Purging queues (last resort)

⚠️ Use with caution in production — this drops pending work across queues.

code.txt
celery -A rhesis.backend.worker purge -f

Celery configuration for brokers, retries, and execution task time limits lives in apps/backend/src/rhesis/backend/celery/config.py. For test-run-specific behavior (cancellation watchdog, batch concurrency), see Test Execution.

Tenant Context Issues

If tasks fail with errors related to the tenant context, such as:

code.txt
unrecognized configuration parameter "app.current_organization"

Ensure that:

Your database has the proper configuration parameters set
The organization_id and user_id are correctly passed to the task
The tenant context is explicitly set at the beginning of database operations

The execute_single_test task in tasks/execution/test.py includes defensive coding to handle such issues:

code.txt
# Access context from task request - task headers take precedence over kwargs
task = self.request
request_user_id = getattr(task, 'user_id', None)
request_org_id = getattr(task, 'organization_id', None)

# Use passed parameters if available, otherwise use request context
user_id = user_id or request_user_id
organization_id = organization_id or request_org_id

Common Worker Errors

Error: No connection to broker

Symptoms: Worker fails to start or tasks are not being processed

Cause: Connection to the Redis broker is not working

Solution:

Check that Redis is running and accessible
Verify the BROKER_URL environment variable is correct
For TLS connections (rediss://), ensure ssl_cert_reqs=CERT_NONE parameter is included
Test Redis connectivity: redis-cli -u "$BROKER_URL" ping
Check firewall rules if running in a cloud environment
For GKE deployments, see the GKE Troubleshooting Guide

Error: Missing API Keys for Model Evaluation

Symptoms: Tasks fail with errors like “GEMINI_API_KEY environment variable is required”

Cause: Model evaluation tasks require API keys for external AI services

Solution:

Ensure the following environment variables are set:
- GEMINI_API_KEY: For Google Gemini models
- GEMINI_MODEL_NAME: Gemini model name (e.g., “gemini-1.5-pro”)
- AZURE_OPENAI_ENDPOINT: Azure OpenAI endpoint URL
- AZURE_OPENAI_API_KEY: Azure OpenAI API key
- AZURE_OPENAI_DEPLOYMENT_NAME: Your Azure deployment name
- AZURE_OPENAI_API_VERSION: API version (e.g., “2024-02-01”)
For GKE deployments, add these to your GitHub secrets
Verify environment variables using the debug endpoint: curl localhost:8080/debug/env

Error: Test runs stuck in “IN_PROGRESS” status

Symptoms: Test configurations start but never complete, remain in progress indefinitely

Cause: Async batch execution did not finish or collect_results never ran (worker crash, revoke, broker outage, or time limit).

Solution:

Check active Celery tasks: celery -A rhesis.backend.worker inspect active
Review worker logs for batch runner events and revoke/cancellation messages
Confirm the test run moves from Progress to a terminal status (Completed, Failed, Partial, Cancelled)

Worker Registration and Status Checking

Check Registered Workers

Use this Python script to check if workers are properly registered with the Celery broker:

code.txt
# Create check_workers.py in project root
cat > check_workers.py << 'EOF'
#!/usr/bin/env python3
import os
import sys
from datetime import datetime

sys.path.insert(0, 'apps/backend/src')

try:
    from rhesis.backend.worker import app as celery_app
except ImportError as e:
    print("❌ Import error:", e)
    sys.exit(1)

def check_celery_workers():
    print("🚀 CELERY WORKER CHECKER")
    print("=" * 50)
    print("⏰ Timestamp:", datetime.now().isoformat())

    try:
        inspect = celery_app.control.inspect()

        # Check active workers
        print("\n📋 Active Workers:")
        active = inspect.active()
        if active:
            for worker_name, tasks in active.items():
                print(f"  ✅ {{worker_name}}: {{len(tasks)}} active tasks")
        else:
            print("  ❌ No active workers found")

        # Check registered workers
        print("\n📋 Registered Workers:")
        registered = inspect.registered()
        if registered:
            for worker_name, tasks in registered.items():
                print(f"  ✅ {{worker_name}}: {{len(tasks)}} registered tasks")
        else:
            print("  ❌ No registered workers found")

        # Check worker stats
        print("\n📊 Worker Statistics:")
        stats = inspect.stats()
        if stats:
            for worker_name, worker_stats in stats.items():
                print(f"  📈 {{worker_name}}:")
                pool_info = worker_stats.get('pool', {})
                max_concurrency = pool_info.get('max-concurrency', 'unknown')
                print(f"    - Pool: {{max_concurrency}} max concurrency")
                print(f"    - Total tasks: {{worker_stats.get('total', 'unknown')}}")
        else:
            print("  ❌ No worker statistics available")

        return bool(active or registered)

    except Exception as e:
        print("❌ Error checking Celery workers:", e)
        return False

if __name__ == "__main__":
    check_celery_workers()
EOF

chmod +x check_workers.py

Usage:

code.txt
python check_workers.py

Expected Output (healthy workers):

code.txt
🚀 CELERY WORKER CHECKER
==================================================
⏰ Timestamp: 2025-06-14T10:57:41.278363

📋 Active Workers:
✅ celery@worker-pod-abc123: 0 active tasks

📋 Registered Workers:
✅ celery@worker-pod-abc123: 12 registered tasks

📊 Worker Statistics:
📈 celery@worker-pod-abc123:
  - Pool: 8 max concurrency
  - Total tasks: 0

Expected Output (no workers):

code.txt
📋 Active Workers:
❌ No active workers found

📋 Registered Workers:
❌ No registered workers found

📊 Worker Statistics:
❌ No worker statistics available

Quick Worker Status Commands

code.txt
# Check if any workers are running
python -c "from rhesis.backend.worker import app; print('Workers:', list(app.control.inspect().active().keys()) if app.control.inspect().active() else 'None')"

# Get worker statistics
python -c "from rhesis.backend.worker import app; import json; print(json.dumps(app.control.inspect().stats(), indent=2))"

# Check registered tasks
python -c "from rhesis.backend.worker import app; registered = app.control.inspect().registered(); print('Registered tasks:', sum(len(tasks) for tasks in registered.values()) if registered else 0)"

Worker Connection Troubleshooting

If no workers are found:

Check broker connectivity:

code.txt
python -c "
import os
import redis
from urllib.parse import urlparse

broker_url = os.getenv('BROKER_URL')
parsed = urlparse(broker_url)
r = redis.Redis(host=parsed.hostname, port=parsed.port, password=parsed.password, ssl=(parsed.scheme=='rediss'))
print('Redis ping:', r.ping())
"

Verify worker processes are running:

code.txt
# For local development
ps aux | grep celery

# For Docker/Kubernetes
kubectl get pods -n <namespace>
kubectl logs <pod-name> -n <namespace>

Check worker startup logs:

code.txt
# Look for successful worker registration
grep -i "ready" /path/to/worker/logs
grep -i "connected" /path/to/worker/logs

Monitoring and Prevention

Regular monitoring

Periodically sample worker and broker health: celery -A rhesis.backend.worker inspect active, worker HTTP debug endpoints where deployed, and Redis latency. Alert on execution queue depth and on test runs stuck in Progress beyond an expected SLA.

Health checks

Include broker reachability and worker liveness (for example HTTP /health or /debug on the worker sidecar, if enabled) in your deployment health checks, plus queue depth or stuck-run alerts if you expose them.

GKE Troubleshooting Guide: Debugging workers in Google Kubernetes Engine
Background Tasks and Processing: General task management information
Architecture and Dependencies: System integration details

Worker Troubleshooting Guide

Dealing with Stuck Tasks

Inspect and revoke

Purging queues (last resort)

Tenant Context Issues

Common Worker Errors

Error: No connection to broker

Error: Missing API Keys for Model Evaluation

Error: Test runs stuck in “IN_PROGRESS” status

Worker Registration and Status Checking

Check Registered Workers

Quick Worker Status Commands

Worker Connection Troubleshooting

Monitoring and Prevention

Regular monitoring

Health checks

Related Documentation