Worker Troubleshooting Guide
This document covers common issues you may encounter with the Rhesis worker system and how to resolve them.
Chord-Related Issues
π For comprehensive chord management, see Chord Management and Monitoring
The most common issues in the Rhesis worker system involve Celery chords. For detailed information about chord monitoring, troubleshooting, and best practices, refer to the dedicated Chord Management Guide.
Quick Chord Issue Resolution
If youβre experiencing chord issues right now:
- Immediate Check: Run
python fix_chords.pyfrom the backend directory - Status Overview: Run
python -m rhesis.backend.tasks.execution.chord_monitor status - Emergency Cleanup: See Emergency Recovery section
Dealing with Stuck Tasks
Sometimes tasks can get stuck in an infinite retry loop, especially chord tasks (chord_unlock) when subtasks fail. This can happen if:
- One or more subtasks in a chord fail permanently
- The broker connection is interrupted during a chord execution
- The worker processes are killed unexpectedly
Symptoms of Stuck Tasks
The most obvious symptom is thousands of repeated log entries like these:
These messages indicate that there are βzombieβ tasks that keep retrying indefinitely.
Quick Resolution for Stuck Chords
π‘ See Chord Management Guide for comprehensive solutions
Configuration to Prevent Stuck Tasks
The worker.py file includes configuration to limit chord retries:
Additionally, the results handling in tasks/execution/results.py includes logic to detect and handle failed subtasks:
Purging Stuck Tasks
β οΈ Use these commands with caution in production
For immediate relief from stuck tasks:
For more targeted approaches, see the Chord Management Guide.
Tenant Context Issues
If tasks fail with errors related to the tenant context, such as:
Ensure that:
- Your database has the proper configuration parameters set
- The
organization_idanduser_idare correctly passed to the task - The tenant context is explicitly set at the beginning of database operations
The execute_single_test task in tasks/execution/test.py includes defensive coding to handle such issues:
Common Worker Errors
Error: βchord_unlockβ task failing repeatedly
Symptoms: Repeated logs of chord_unlock tasks retrying, MaxRetriesExceededError
Cause: This typically happens when one or more subtasks in a chord (group of tasks) fail, but the callback still needs to run
Solution:
- Use the monitoring script:
python fix_chords.py - See Chord Management Guide for detailed solutions
- Ensure tasks always return valid results (see best practices)
Error: No connection to broker
Symptoms: Worker fails to start or tasks are not being processed
Cause: Connection to the Redis broker is not working
Solution:
- Check that Redis is running and accessible
- Verify the
BROKER_URLenvironment variable is correct - For TLS connections (
rediss://), ensuressl_cert_reqs=CERT_NONEparameter is included - Test Redis connectivity:
redis-cli -u "$BROKER_URL" ping - Check firewall rules if running in a cloud environment
- For GKE deployments, see the GKE Troubleshooting Guide
Error: Missing API Keys for Model Evaluation
Symptoms: Tasks fail with errors like βGEMINI_API_KEY environment variable is requiredβ
Cause: Model evaluation tasks require API keys for external AI services
Solution:
- Ensure the following environment variables are set:
GEMINI_API_KEY: For Google Gemini modelsGEMINI_MODEL_NAME: Gemini model name (e.g., βgemini-1.5-proβ)AZURE_OPENAI_ENDPOINT: Azure OpenAI endpoint URLAZURE_OPENAI_API_KEY: Azure OpenAI API keyAZURE_OPENAI_DEPLOYMENT_NAME: Your Azure deployment nameAZURE_OPENAI_API_VERSION: API version (e.g., β2024-02-01β)
- For GKE deployments, add these to your GitHub secrets
- Verify environment variables using the debug endpoint:
curl localhost:8080/debug/env
Error: Test runs stuck in βIN_PROGRESSβ status
Symptoms: Test configurations start but never complete, remain in progress indefinitely
Cause: Usually chord-related - the callback function (collect_results) never executes
Solution:
- Check for stuck chords:
python -m rhesis.backend.tasks.execution.chord_monitor status - See Chord Never Completing in the Chord Management Guide
- Review individual task results to ensure theyβre returning valid data
Worker Registration and Status Checking
Check Registered Workers
Use this Python script to check if workers are properly registered with the Celery broker:
Usage:
Expected Output (healthy workers):
Expected Output (no workers):
Quick Worker Status Commands
Worker Connection Troubleshooting
If no workers are found:
- Check broker connectivity:
- Verify worker processes are running:
- Check worker startup logs:
Monitoring and Prevention
Regular Monitoring
Set up automated monitoring to catch issues early:
Health Checks
Include chord status in your application health checks:
Related Documentation
- Chord Management and Monitoring: Comprehensive guide for chord-specific issues
- GKE Troubleshooting Guide: Debugging workers in Google Kubernetes Engine
- Background Tasks and Processing: General task management information
- Architecture and Dependencies: System integration details