Chord Management and Monitoring
This document provides comprehensive information about managing Celery chords in the Rhesis worker system, including monitoring, troubleshooting, and best practices.
What are Chords?
A chord in Celery is a pattern that allows you to execute a group of tasks in parallel and then run a callback function once all tasks in the group have completed. This is particularly useful for scenarios like:
- Running multiple test executions in parallel and then collecting the results
- Processing multiple files concurrently and then aggregating the output
- Performing parallel computations and combining the results
Chord Structure
How Rhesis Uses Chords
In the Rhesis system, chords are primarily used in test execution:
- Parallel Test Execution: Individual tests are executed in parallel using
execute_single_testtasks - Result Collection: Once all tests complete,
collect_resultsis called to aggregate the results - Status Updates: The test run status is updated based on the aggregated results
Example from orchestration.py:
Common Chord Issues
1. chord_unlock MaxRetriesExceededError
Symptoms:
Causes:
- Individual tasks returning
Noneinstead of proper results - Tasks failing without proper error handling
- Network interruptions during chord execution
- Worker processes being terminated unexpectedly
Solutions:
- Ensure all tasks return valid results (even on failure)
- Configure maximum retries for
chord_unlocktasks - Implement proper error handling in callback functions
- Use monitoring tools to detect and resolve stuck chords
2. Chord Never Completing
Symptoms:
- Callback function (
collect_results) never executes - Test runs remain in “IN_PROGRESS” status indefinitely
- Tasks appear to complete but no final status update
Causes:
- One or more subtasks in the chord failed silently
- Result backend issues preventing result storage
- Incorrect chord setup or callback configuration
Chord Monitoring Tools
Built-in Monitoring Script
The system includes a comprehensive monitoring script at src/rhesis/backend/tasks/execution/chord_monitor.py that provides several utilities:
1. Check Chord Status
2. Show Current Status
3. Revoke Stuck Chords
4. Inspect Specific Chord
5. Clean All Tasks (Emergency)
Quick Fix Script
A simplified monitoring script is available at the root level:
This script:
- Shows current chord status
- Detects stuck chords (>30 minutes)
- Offers to revoke stuck chords interactively
- Auto-revokes very stuck chords (>2 hours)
- Provides recommendations for next steps
Monitoring Best Practices
1. Regular Monitoring
Set up periodic monitoring to catch chord issues early:
2. Automated Cleanup
Use the built-in periodic monitoring function:
3. Logging and Alerting
Monitor your logs for chord-related errors:
4. Health Checks
Include chord status in your health check endpoints:
Configuration for Chord Stability
Worker Configuration
In worker.py, ensure proper chord configuration:
Task Implementation Best Practices
Always Return Valid Results
Handle Malformed Results in Callbacks
Troubleshooting Workflows
When You Encounter Chord Issues
- Immediate Assessment
- Check Active Tasks
- Look for Stuck Chords
- Review Logs
- Clean Up if Necessary
Emergency Recovery
If the system is completely stuck with many chord_unlock tasks:
- Stop All Workers
- Purge All Tasks (use with caution)
- Restart Workers
- Monitor Recovery
Monitoring Script Reference
Command Line Options
| Command | Description | Example |
|---|---|---|
status | Show current chord status | python -m ...chord_monitor status |
check | Check for stuck chords | python -m ...chord_monitor check --max-hours 2 |
revoke | Revoke stuck chords | python -m ...chord_monitor revoke --max-hours 1 |
inspect | Inspect specific chord | python -m ...chord_monitor inspect <chord-id> |
clean | Purge all tasks | python -m ...chord_monitor clean --force |
Common Options
--max-hours N: Consider chords stuck after N hours--dry-run: Show what would be done without executing--json: Output results in JSON format--verbose: Show detailed information--force: Required for destructive operations
Return Codes
0: Success, no issues found1: Issues found or errors occurred130: Operation cancelled by user
Prevention Strategies
- Proper Task Design: Always return valid results, handle exceptions gracefully
- Configuration: Set appropriate timeouts and retry limits
- Monitoring: Regular checks for stuck chords
- Testing: Test chord behavior in development with various failure scenarios
- Logging: Comprehensive logging to diagnose issues quickly
- Documentation: Keep this documentation updated with new patterns and solutions
Related Files
src/rhesis/backend/worker.py- Celery configurationsrc/rhesis/backend/tasks/execution/orchestration.py- Chord implementationsrc/rhesis/backend/tasks/execution/test.py- Individual task implementationsrc/rhesis/backend/tasks/execution/results.py- Chord callback implementationsrc/rhesis/backend/tasks/execution/chord_monitor.py- Monitoring utilitiesfix_chords.py- Quick monitoring script