Worker Troubleshooting Guide

This document covers common issues you may encounter with the Rhesis worker system and how to resolve them.

📖 For comprehensive chord management, see Chord Management and Monitoring

The most common issues in the Rhesis worker system involve Celery chords. For detailed information about chord monitoring, troubleshooting, and best practices, refer to the dedicated Chord Management Guide.

Quick Chord Issue Resolution

If you’re experiencing chord issues right now:

Immediate Check: Run python fix_chords.py from the backend directory
Status Overview: Run python -m rhesis.backend.tasks.execution.chord_monitor status
Emergency Cleanup: See Emergency Recovery section

Dealing with Stuck Tasks

Sometimes tasks can get stuck in an infinite retry loop, especially chord tasks (chord_unlock) when subtasks fail. This can happen if:

One or more subtasks in a chord fail permanently
The broker connection is interrupted during a chord execution
The worker processes are killed unexpectedly

Symptoms of Stuck Tasks

The most obvious symptom is thousands of repeated log entries like these:


Task celery.chord_unlock[82116cfc-ae23-4526-b7ff-7267f389b367] retry: Retry in 1.0s
MaxRetriesExceededError: Can't retry celery.chord_unlock[task-id] args:(...) kwargs:{...}

These messages indicate that there are “zombie” tasks that keep retrying indefinitely.

Quick Resolution for Stuck Chords

💡 See Chord Management Guide for comprehensive solutions


# Check for stuck chords
python -m rhesis.backend.tasks.execution.chord_monitor check --max-hours 1
 
# Revoke stuck chords (dry run first)
python -m rhesis.backend.tasks.execution.chord_monitor revoke --max-hours 1 --dry-run
 
# Actually revoke them
python -m rhesis.backend.tasks.execution.chord_monitor revoke --max-hours 1

Configuration to Prevent Stuck Tasks

The worker.py file includes configuration to limit chord retries:


app.conf.update(
    # Chord configuration - prevent infinite retry loops
    chord_unlock_max_retries=3,
    chord_unlock_retry_delay=1.0,
 
    # Improved chord reliability
    result_persistent=True,
    result_expires=3600,
 
    # Task tracking for monitoring
    task_track_started=True,
    task_send_sent_event=True,
    worker_send_task_events=True,
)

Additionally, the results handling in tasks/execution/results.py includes logic to detect and handle failed subtasks:


# Handle different result formats from chord execution
processed_results = []
if results:
    for result in results:
        if result is None:
            processed_results.append(None)
        elif isinstance(result, list) and len(result) == 2:
            # Handle [[task_id, result], error] format from failed chord tasks
            task_result = result[1] if result[1] is not None else None
            processed_results.append(task_result)
        else:
            processed_results.append(result)
 
# Check for failed tasks and count them
failed_tasks = sum(1 for result in processed_results
                  if result is None or
                  (isinstance(result, dict) and result.get("status") == "failed"))

Purging Stuck Tasks

⚠️ Use these commands with caution in production

For immediate relief from stuck tasks:


# Emergency: Purge all tasks (see chord-management.md for safer alternatives)
python -m rhesis.backend.tasks.execution.chord_monitor clean --force

For more targeted approaches, see the Chord Management Guide.

Tenant Context Issues

If tasks fail with errors related to the tenant context, such as:


unrecognized configuration parameter "app.current_organization"

Ensure that:

Your database has the proper configuration parameters set
The organization_id and user_id are correctly passed to the task
The tenant context is explicitly set at the beginning of database operations

The execute_single_test task in tasks/execution/test.py includes defensive coding to handle such issues:


# Access context from task request - task headers take precedence over kwargs
task = self.request
request_user_id = getattr(task, 'user_id', None)
request_org_id = getattr(task, 'organization_id', None)
 
# Use passed parameters if available, otherwise use request context
user_id = user_id or request_user_id
organization_id = organization_id or request_org_id

Common Worker Errors

Error: “chord_unlock” task failing repeatedly

Symptoms: Repeated logs of chord_unlock tasks retrying, MaxRetriesExceededError

Cause: This typically happens when one or more subtasks in a chord (group of tasks) fail, but the callback still needs to run

Solution:

Use the monitoring script: python fix_chords.py
See Chord Management Guide for detailed solutions
Ensure tasks always return valid results (see best practices)

Error: No connection to broker

Symptoms: Worker fails to start or tasks are not being processed

Cause: Connection to the Redis broker is not working

Solution:

Check that Redis is running and accessible
Verify the BROKER_URL environment variable is correct
For TLS connections (rediss://), ensure ssl_cert_reqs=CERT_NONE parameter is included
Test Redis connectivity: redis-cli -u "$BROKER_URL" ping
Check firewall rules if running in a cloud environment
For GKE deployments, see the GKE Troubleshooting Guide

Error: Missing API Keys for Model Evaluation

Symptoms: Tasks fail with errors like “GEMINI_API_KEY environment variable is required”

Cause: Model evaluation tasks require API keys for external AI services

Solution:

Ensure the following environment variables are set:
- GEMINI_API_KEY: For Google Gemini models
- GEMINI_MODEL_NAME: Gemini model name (e.g., “gemini-1.5-pro”)
- AZURE_OPENAI_ENDPOINT: Azure OpenAI endpoint URL
- AZURE_OPENAI_API_KEY: Azure OpenAI API key
- AZURE_OPENAI_DEPLOYMENT_NAME: Your Azure deployment name
- AZURE_OPENAI_API_VERSION: API version (e.g., “2024-02-01”)
For GKE deployments, add these to your GitHub secrets
Verify environment variables using the debug endpoint: curl localhost:8080/debug/env

Error: Test runs stuck in “IN_PROGRESS” status

Symptoms: Test configurations start but never complete, remain in progress indefinitely

Cause: Usually chord-related - the callback function (collect_results) never executes

Solution:

Check for stuck chords: python -m rhesis.backend.tasks.execution.chord_monitor status
See Chord Never Completing in the Chord Management Guide
Review individual task results to ensure they’re returning valid data

Worker Registration and Status Checking

Check Registered Workers

Use this Python script to check if workers are properly registered with the Celery broker:


# Create check_workers.py in project root
cat > check_workers.py << 'EOF'
#!/usr/bin/env python3
import os
import sys
from datetime import datetime
 
sys.path.insert(0, 'apps/backend/src')
 
try:
    from rhesis.backend.worker import app as celery_app
except ImportError as e:
    print(f"❌ Import error: {e}")
    sys.exit(1)
 
def check_celery_workers():
    print("🚀 CELERY WORKER CHECKER")
    print("=" * 50)
    print(f"⏰ Timestamp: {datetime.now().isoformat()}")
 
    try:
        inspect = celery_app.control.inspect()
 
        # Check active workers
        print("\n📋 Active Workers:")
        active = inspect.active()
        if active:
            for worker_name, tasks in active.items():
                print(f"  ✅ {worker_name}: {len(tasks)} active tasks")
        else:
            print("  ❌ No active workers found")
 
        # Check registered workers
        print("\n📋 Registered Workers:")
        registered = inspect.registered()
        if registered:
            for worker_name, tasks in registered.items():
                print(f"  ✅ {worker_name}: {len(tasks)} registered tasks")
        else:
            print("  ❌ No registered workers found")
 
        # Check worker stats
        print("\n📊 Worker Statistics:")
        stats = inspect.stats()
        if stats:
            for worker_name, worker_stats in stats.items():
                print(f"  📈 {worker_name}:")
                print(f"    - Pool: {worker_stats.get('pool', {}).get('max-concurrency', 'unknown')} max concurrency")
                print(f"    - Total tasks: {worker_stats.get('total', 'unknown')}")
        else:
            print("  ❌ No worker statistics available")
 
        return bool(active or registered)
 
    except Exception as e:
        print(f"❌ Error checking Celery workers: {e}")
        return False
 
if __name__ == "__main__":
    check_celery_workers()
EOF
 
chmod +x check_workers.py

Usage:


python check_workers.py

Expected Output (healthy workers):


🚀 CELERY WORKER CHECKER
==================================================
⏰ Timestamp: 2025-06-14T10:57:41.278363

📋 Active Workers:
  ✅ celery@worker-pod-abc123: 0 active tasks

📋 Registered Workers:
  ✅ celery@worker-pod-abc123: 12 registered tasks

📊 Worker Statistics:
  📈 celery@worker-pod-abc123:
    - Pool: 8 max concurrency
    - Total tasks: 0

Expected Output (no workers):


📋 Active Workers:
  ❌ No active workers found

📋 Registered Workers:
  ❌ No registered workers found

📊 Worker Statistics:
  ❌ No worker statistics available

Quick Worker Status Commands


# Check if any workers are running
python -c "from rhesis.backend.worker import app; print('Workers:', list(app.control.inspect().active().keys()) if app.control.inspect().active() else 'None')"
 
# Get worker statistics
python -c "from rhesis.backend.worker import app; import json; print(json.dumps(app.control.inspect().stats(), indent=2))"
 
# Check registered tasks
python -c "from rhesis.backend.worker import app; registered = app.control.inspect().registered(); print(f'Registered tasks: {sum(len(tasks) for tasks in registered.values()) if registered else 0}')"

Worker Connection Troubleshooting

If no workers are found:

Check broker connectivity:


python -c "
import os
import redis
from urllib.parse import urlparse
 
broker_url = os.getenv('BROKER_URL')
parsed = urlparse(broker_url)
r = redis.Redis(host=parsed.hostname, port=parsed.port, password=parsed.password, ssl=(parsed.scheme=='rediss'))
print('Redis ping:', r.ping())
"

Verify worker processes are running:


# For local development
ps aux | grep celery
 
# For Docker/Kubernetes
kubectl get pods -n <namespace>
kubectl logs <pod-name> -n <namespace>

Check worker startup logs:


# Look for successful worker registration
grep -i "ready" /path/to/worker/logs
grep -i "connected" /path/to/worker/logs

Monitoring and Prevention

Regular Monitoring

Set up automated monitoring to catch issues early:


# Add to crontab for periodic monitoring
*/15 * * * * cd /path/to/backend && python fix_chords.py >/dev/null 2>&1

Health Checks

Include chord status in your application health checks:


from rhesis.backend.tasks.execution.chord_monitor import get_active_chord_unlocks, check_stuck_chords
 
def worker_health_check():
    active_chords = get_active_chord_unlocks()
    stuck_chords = check_stuck_chords(max_runtime_hours=1)
 
    return {
        "status": "unhealthy" if stuck_chords else "healthy",
        "active_chord_unlocks": len(active_chords),
        "stuck_chords": len(stuck_chords)
    }

Chord Management and Monitoring: Comprehensive guide for chord-specific issues
GKE Troubleshooting Guide: Debugging workers in Google Kubernetes Engine
Background Tasks and Processing: General task management information
Architecture and Dependencies: System integration details

Worker Troubleshooting Guide

Chord-Related Issues

Quick Chord Issue Resolution

Dealing with Stuck Tasks

Symptoms of Stuck Tasks

Quick Resolution for Stuck Chords

Configuration to Prevent Stuck Tasks

Purging Stuck Tasks

Tenant Context Issues

Common Worker Errors

Error: “chord_unlock” task failing repeatedly

Error: No connection to broker

Error: Missing API Keys for Model Evaluation

Error: Test runs stuck in “IN_PROGRESS” status

Worker Registration and Status Checking

Check Registered Workers

Quick Worker Status Commands

Worker Connection Troubleshooting

Monitoring and Prevention

Regular Monitoring

Health Checks

Related Documentation