Worker Logging Guide
This guide covers everything about logging in the Rhesis worker system, from configuration to analysis and troubleshooting.
Overview
The worker system generates logs from multiple sources:
- Celery Worker: Task execution, queue processing, worker lifecycle
- Health Server: HTTP health checks, debugging endpoints
- Startup Script: Container initialization, environment validation
- Application Code: Task-specific logging from your business logic
Log Configuration
Environment Variables
Control logging behavior with these environment variables:
Available Log Levels:
DEBUG: Detailed debugging informationINFO: General operational messages (recommended)WARNING: Warning messages for potential issuesERROR: Error conditions that don’t stop executionCRITICAL: Serious errors that may stop execution
Celery Logging Configuration
In the worker startup, Celery is configured with:
Log Sources and Formats
1. Startup Script Logs
Location: Container stdout during initialization Format: Structured with emoji indicators and timestamps
2. Health Server Logs
Location: Container stdout from health server process Format: HTTP access logs with endpoint information
3. Celery Worker Logs
Location: Container stdout from Celery process Format: Celery’s standard logging format with task information
4. Application Task Logs
Location: Container stdout from your task code Format: Python logging format as configured in your tasks
Accessing Logs
Local Development
GKE Deployment
Basic Log Access
Real-Time Monitoring
Historical Logs
Log Analysis Techniques
1. Finding Your Pods
2. Filtering Logs
Search for Errors
Search for Task Activity
Search for Health Check Activity
3. Advanced Log Analysis
Export Logs for Analysis
Multi-Pod Log Aggregation
Log Patterns and What They Mean
Healthy Worker Startup
Common Warning Patterns
Error Patterns to Investigate
Connection Errors
Action: Check Redis connectivity, network policies, firewall rules
Import Errors
Action: Check Docker image build, PYTHONPATH configuration
Task Errors
Action: Check task code, input parameters, database connectivity
Health Check Errors
Action: Check Celery worker status, Redis connectivity
Log Monitoring and Alerting
Key Metrics to Monitor
- Error Rate: Frequency of ERROR/CRITICAL log entries
- Health Check Failures: HTTP 500 responses on
/health - Connection Timeouts: Redis/broker connectivity issues
- Task Failure Rate: Ratio of failed to successful tasks
- Worker Restarts: Container restart frequency
Sample Monitoring Queries
Using kubectl and basic tools
Log-based Health Check
Debugging with Logs
Step-by-Step Debugging Process
- Identify the Problem
- Get Recent Logs
- Search for Specific Issues
- Correlate with Health Endpoints
Common Debugging Scenarios
Scenario 1: Pod Won’t Start
Scenario 2: Health Checks Failing
Scenario 3: Tasks Not Processing
## Best Practices
### 1. Log Retention
- Keep logs for at least 7 days for troubleshooting
- Archive important logs before pod restarts
- Use log aggregation systems for production
### 2. Log Levels
- Use `INFO` for production (balance of detail vs. noise)
- Use `DEBUG` for development and troubleshooting
- Use `ERROR` only for actual errors that need attention
### 3. Structured Logging
- Include relevant context (organization_id, user_id, task_id)
- Use consistent log formats across tasks
- Include timing information for performance monitoring
### 4. Log Monitoring
- Set up alerts for error rate increases
- Monitor health check failure patterns
- Track connection timeout frequencies
### 5. Performance Considerations
- Avoid excessive logging in tight loops
- Use appropriate log levels to control verbosity
- Consider log sampling for high-volume operations
## Integration with Other Tools
### Health Endpoints
The logging system integrates with the health endpoints:
- `/debug`: Shows system status including recent errors
- `/debug/redis`: Shows Redis connectivity with error details
- All endpoints log their usage for monitoring
### Monitoring Systems
Logs can be integrated with:
- **Prometheus**: Metric extraction from log patterns
- **Grafana**: Log visualization and dashboards
- **ELK Stack**: Centralized log aggregation and search
- **Google Cloud Logging**: Native GKE log collection
### Alert Integration
Example alert conditions based on logs:
- Error rate > 10% over 5 minutes
- Health check failure rate > 20% over 2 minutes
- No successful task processing for 10 minutes
- Redis connection timeouts > 5 in 5 minutes
This comprehensive logging system provides visibility into all aspects of worker operation, from startup through task processing to health monitoring.