Worker Troubleshooting Guide
This document covers common issues you may encounter with the Rhesis worker system and how to resolve them.
Test runs use the async batch engine (one Celery task per run with internal asyncio fan-out). For how that works and how to debug it, see Test Execution, Execution Modes, and Background Tasks.
Dealing with Stuck Tasks
Tasks can appear stuck when the worker is overloaded, the broker connection drops, a long-running batch is still executing, or a task hits retries/time limits. Typical causes:
- The execution worker crashed or was scaled down while a batch task was running
- Redis or network issues between workers and the broker
- A task exceeded its time limit or is waiting on an external dependency
Inspect and revoke
Purging queues (last resort)
⚠️ Use with caution in production — this drops pending work across queues.
Celery configuration for brokers, retries, and execution task time limits lives in apps/backend/src/rhesis/backend/celery/config.py. For test-run-specific behavior (cancellation watchdog, batch concurrency), see Test Execution.
Tenant Context Issues
If tasks fail with errors related to the tenant context, such as:
Ensure that:
- Your database has the proper configuration parameters set
- The
organization_idanduser_idare correctly passed to the task - The tenant context is explicitly set at the beginning of database operations
The execute_single_test task in tasks/execution/test.py includes defensive coding to handle such issues:
Common Worker Errors
Error: No connection to broker
Symptoms: Worker fails to start or tasks are not being processed
Cause: Connection to the Redis broker is not working
Solution:
- Check that Redis is running and accessible
- Verify the
BROKER_URLenvironment variable is correct - For TLS connections (
rediss://), ensuressl_cert_reqs=CERT_NONEparameter is included - Test Redis connectivity:
redis-cli -u "$BROKER_URL" ping - Check firewall rules if running in a cloud environment
- For GKE deployments, see the GKE Troubleshooting Guide
Error: Missing API Keys for Model Evaluation
Symptoms: Tasks fail with errors like “GEMINI_API_KEY environment variable is required”
Cause: Model evaluation tasks require API keys for external AI services
Solution:
- Ensure the following environment variables are set:
GEMINI_API_KEY: For Google Gemini modelsGEMINI_MODEL_NAME: Gemini model name (e.g., “gemini-1.5-pro”)AZURE_OPENAI_ENDPOINT: Azure OpenAI endpoint URLAZURE_OPENAI_API_KEY: Azure OpenAI API keyAZURE_OPENAI_DEPLOYMENT_NAME: Your Azure deployment nameAZURE_OPENAI_API_VERSION: API version (e.g., “2024-02-01”)
- For GKE deployments, add these to your GitHub secrets
- Verify environment variables using the debug endpoint:
curl localhost:8080/debug/env
Error: Test runs stuck in “IN_PROGRESS” status
Symptoms: Test configurations start but never complete, remain in progress indefinitely
Cause: Async batch execution did not finish or collect_results never ran (worker crash, revoke, broker outage, or time limit).
Solution:
- Check active Celery tasks:
celery -A rhesis.backend.worker inspect active - Review worker logs for batch runner events and revoke/cancellation messages
- Confirm the test run moves from
Progressto a terminal status (Completed,Failed,Partial,Cancelled)
Worker Registration and Status Checking
Check Registered Workers
Use this Python script to check if workers are properly registered with the Celery broker:
Usage:
Expected Output (healthy workers):
Expected Output (no workers):
Quick Worker Status Commands
Worker Connection Troubleshooting
If no workers are found:
- Check broker connectivity:
- Verify worker processes are running:
- Check worker startup logs:
Monitoring and Prevention
Regular monitoring
Periodically sample worker and broker health: celery -A rhesis.backend.worker inspect active, worker HTTP debug endpoints where deployed, and Redis latency. Alert on execution queue depth and on test runs stuck in Progress beyond an expected SLA.
Health checks
Include broker reachability and worker liveness (for example HTTP /health or /debug on the worker sidecar, if enabled) in your deployment health checks, plus queue depth or stuck-run alerts if you expose them.
Related Documentation
- GKE Troubleshooting Guide: Debugging workers in Google Kubernetes Engine
- Background Tasks and Processing: General task management information
- Architecture and Dependencies: System integration details