Worker Logging Guide
This guide covers everything about logging in the Rhesis worker system, from configuration to analysis and troubleshooting.
Overview
The worker system generates logs from multiple sources:
- Celery Worker: Task execution, queue processing, worker lifecycle
- Health Server: HTTP health checks, debugging endpoints
- Startup Script: Container initialization, environment validation
- Application Code: Task-specific logging from your business logic
Log Configuration
Environment Variables
Control logging behavior with these environment variables:
# Primary log level (affects all components)
LOG_LEVEL=INFO
# Celery-specific log level
CELERY_WORKER_LOGLEVEL=INFO
# Python logging configuration
PYTHONUNBUFFERED=1 # Ensures immediate log output
Available Log Levels:
DEBUG
: Detailed debugging informationINFO
: General operational messages (recommended)WARNING
: Warning messages for potential issuesERROR
: Error conditions that don’t stop executionCRITICAL
: Serious errors that may stop execution
Celery Logging Configuration
In the worker startup, Celery is configured with:
celery -A rhesis.backend.worker.app worker \
--loglevel=${CELERY_WORKER_LOGLEVEL:-INFO} \
--concurrency=${CELERY_WORKER_CONCURRENCY:-8} \
# ... other options
Log Sources and Formats
1. Startup Script Logs
Location: Container stdout during initialization Format: Structured with emoji indicators and timestamps
# Successful operations
✅ Health server starting on port 8080
✅ Successfully imported Celery app from rhesis.backend.worker
✅ Redis connectivity: connected
# Warnings
⚠️ TLS detected in broker URL, adjusting timeouts to 5 seconds
# Errors
❌ Failed to import Celery app: No module named 'rhesis.backend.worker'
❌ Redis connectivity test failed: connection timeout
2. Health Server Logs
Location: Container stdout from health server process Format: HTTP access logs with endpoint information
# Successful health checks
INFO: 127.0.0.1:35492 - "GET /health/basic HTTP/1.1" 200 OK
INFO: 127.0.0.1:35494 - "GET /health HTTP/1.1" 200 OK
# Health check failures
WARNING: 127.0.0.1:35496 - "GET /health HTTP/1.1" 500 Internal Server Error
ERROR: Health check failed: Redis connection timeout
# Debug endpoint usage
INFO: 127.0.0.1:35498 - "GET /debug HTTP/1.1" 200 OK
INFO: 127.0.0.1:35500 - "GET /debug/redis HTTP/1.1" 200 OK
3. Celery Worker Logs
Location: Container stdout from Celery process Format: Celery’s standard logging format with task information
# Worker startup
[2024-01-15 10:30:00,123: INFO/MainProcess] Connected to redis://redis-host:6379/0
[2024-01-15 10:30:00,456: INFO/MainProcess] mingle: searching for available workers
[2024-01-15 10:30:01,789: INFO/MainProcess] celery@worker-pod ready.
# Task processing
[2024-01-15 10:30:15,234: INFO/MainProcess] Received task: rhesis.backend.tasks.execute_test[task-id-123]
[2024-01-15 10:30:15,456: INFO/ForkPoolWorker-1] Task rhesis.backend.tasks.execute_test[task-id-123] succeeded in 2.34s
[2024-01-15 10:30:16,789: INFO/MainProcess] Received task: rhesis.backend.tasks.collect_results[chord-id-456]
# Errors
[2024-01-15 10:30:20,123: ERROR/ForkPoolWorker-2] Task rhesis.backend.tasks.execute_test[task-id-789] raised unexpected: ConnectionError('Redis connection failed')
[2024-01-15 10:30:25,456: WARNING/MainProcess] Chord unlock task chord_unlock[chord-id-456] retry: Retry in 1.0s
4. Application Task Logs
Location: Container stdout from your task code Format: Python logging format as configured in your tasks
# In your task code
import logging
logger = logging.getLogger(__name__)
@app.task(base=BaseTask)
def my_task(self):
logger.info(f"Starting task for organization: {self.request.organization_id}")
try:
# Task logic
result = process_data()
logger.info(f"Task completed successfully: {result}")
return result
except Exception as e:
logger.error(f"Task failed: {str(e)}")
raise
Accessing Logs
Local Development
# Using Docker Compose
docker-compose logs worker
# Follow logs in real-time
docker-compose logs -f worker
# Get last N lines
docker-compose logs --tail=100 worker
GKE Deployment
Basic Log Access
# Get logs from worker container
kubectl logs <pod-name> -c worker -n <namespace>
# Get recent logs (last 100 lines)
kubectl logs <pod-name> -c worker -n <namespace> --tail=100
# Get logs from last hour
kubectl logs <pod-name> -c worker -n <namespace> --since=1h
Real-Time Monitoring
# Follow logs as they're generated
kubectl logs -f <pod-name> -c worker -n <namespace>
# Follow logs from all worker pods
kubectl logs -f deployment/rhesis-worker -n <namespace>
# Follow logs from all containers in pod
kubectl logs -f <pod-name> -n <namespace> --all-containers=true
Historical Logs
# Get logs from previous container restart (if crashed)
kubectl logs <pod-name> -c worker -n <namespace> --previous
# Get logs with timestamps
kubectl logs <pod-name> -c worker -n <namespace> --timestamps=true
Log Analysis Techniques
1. Finding Your Pods
# List all worker pods
kubectl get pods -n <namespace> -l app=rhesis-worker
# Get pod details including restart count
kubectl get pods -n <namespace> -o wide
2. Filtering Logs
Search for Errors
# Find all errors
kubectl logs <pod-name> -c worker -n <namespace> | grep -i error
# Find Redis connection issues
kubectl logs <pod-name> -c worker -n <namespace> | grep -i "redis\|connection\|timeout"
# Find task failures
kubectl logs <pod-name> -c worker -n <namespace> | grep -E "(failed|exception|error)"
Search for Task Activity
# Find task executions
kubectl logs <pod-name> -c worker -n <namespace> | grep "Received task"
# Find task completions
kubectl logs <pod-name> -c worker -n <namespace> | grep "succeeded in"
# Find chord activity
kubectl logs <pod-name> -c worker -n <namespace> | grep -i chord
Search for Health Check Activity
# Find health check requests
kubectl logs <pod-name> -c worker -n <namespace> | grep -E "(GET /health|GET /ping)"
# Find health check failures
kubectl logs <pod-name> -c worker -n <namespace> | grep -E "(500|timeout|failed)"
3. Advanced Log Analysis
Export Logs for Analysis
# Save logs to file
kubectl logs <pod-name> -c worker -n <namespace> --tail=1000 > worker-logs.txt
# Save logs with timestamps
kubectl logs <pod-name> -c worker -n <namespace> --timestamps=true --tail=1000 > worker-logs-timestamped.txt
# Save logs from specific time period
kubectl logs <pod-name> -c worker -n <namespace> --since=2h > recent-worker-logs.txt
Multi-Pod Log Aggregation
# Get logs from all worker pods
for pod in $(kubectl get pods -n <namespace> -l app=rhesis-worker -o jsonpath='{.items[*].metadata.name}'); do
echo "=== Logs from $pod ===" >> all-worker-logs.txt
kubectl logs $pod -c worker -n <namespace> --tail=100 >> all-worker-logs.txt
echo "" >> all-worker-logs.txt
done
Log Patterns and What They Mean
Healthy Worker Startup
✅ Health server starting on port 8080
✅ Environment validation completed
✅ Successfully imported Celery app
✅ Redis connectivity: connected
✅ Health server ready, all endpoints responding
[INFO/MainProcess] Connected to rediss://...
[INFO/MainProcess] celery@worker-pod ready.
Common Warning Patterns
# TLS connection delay (normal for Redis TLS)
⚠️ TLS detected in broker URL, adjusting timeouts to 10 seconds
# Chord retries (may indicate failed subtasks)
[WARNING/MainProcess] Chord unlock task chord_unlock[...] retry: Retry in 1.0s
# Health check timeouts (may indicate Redis delays)
WARNING: Health check took 8.5 seconds (timeout: 10)
Error Patterns to Investigate
Connection Errors
❌ Redis connectivity test failed: connection timeout
[ERROR/MainProcess] consumer: Cannot connect to rediss://...: Error connecting
Action: Check Redis connectivity, network policies, firewall rules
Import Errors
❌ Failed to import Celery app: No module named 'rhesis.backend.worker'
Action: Check Docker image build, PYTHONPATH configuration
Task Errors
[ERROR/ForkPoolWorker-1] Task rhesis.backend.tasks.execute_test[...] raised unexpected: Exception('Task failed')
Action: Check task code, input parameters, database connectivity
Health Check Errors
ERROR: Health check failed: Celery ping timeout after 10 seconds
INFO: 127.0.0.1:42756 - "GET /health HTTP/1.1" 500 Internal Server Error
Action: Check Celery worker status, Redis connectivity
Log Monitoring and Alerting
Key Metrics to Monitor
- Error Rate: Frequency of ERROR/CRITICAL log entries
- Health Check Failures: HTTP 500 responses on
/health
- Connection Timeouts: Redis/broker connectivity issues
- Task Failure Rate: Ratio of failed to successful tasks
- Worker Restarts: Container restart frequency
Sample Monitoring Queries
Using kubectl and basic tools
# Count errors in last 100 log lines
kubectl logs <pod-name> -c worker -n <namespace> --tail=100 | grep -c ERROR
# Check for recent connection issues
kubectl logs <pod-name> -c worker -n <namespace> --since=10m | grep -i "connection\|timeout"
# Monitor health check success rate
kubectl logs <pod-name> -c worker -n <namespace> --since=1h | grep "GET /health" | grep -c "200 OK"
Log-based Health Check
#!/bin/bash
# Simple health check based on logs
NAMESPACE="rhesis-worker-dev"
POD=$(kubectl get pods -n $NAMESPACE -l app=rhesis-worker -o jsonpath='{.items[0].metadata.name}')
# Check for recent errors
ERROR_COUNT=$(kubectl logs $POD -c worker -n $NAMESPACE --tail=50 | grep -c ERROR)
if [ $ERROR_COUNT -gt 5 ]; then
echo "WARNING: $ERROR_COUNT errors found in recent logs"
fi
# Check for Redis connectivity
REDIS_ERRORS=$(kubectl logs $POD -c worker -n $NAMESPACE --since=5m | grep -c "Redis\|connection")
if [ $REDIS_ERRORS -gt 0 ]; then
echo "WARNING: Redis connection issues detected"
fi
Debugging with Logs
Step-by-Step Debugging Process
-
Identify the Problem
# Check pod status kubectl get pods -n <namespace> # Look for restart indicators kubectl describe pod <pod-name> -n <namespace>
-
Get Recent Logs
# Get current logs kubectl logs <pod-name> -c worker -n <namespace> --tail=100 # Get crash logs if restarted kubectl logs <pod-name> -c worker -n <namespace> --previous
-
Search for Specific Issues
# Connection problems kubectl logs <pod-name> -c worker -n <namespace> | grep -i "connection\|redis\|timeout" # Task problems kubectl logs <pod-name> -c worker -n <namespace> | grep -i "task\|error\|failed"
-
Correlate with Health Endpoints
# Check current system state kubectl exec -it <pod-name> -n <namespace> -- curl localhost:8080/debug | jq
Common Debugging Scenarios
Scenario 1: Pod Won’t Start
# Check startup logs
kubectl logs <pod-name> -c worker -n <namespace>
# Look for import errors, connection failures, configuration issues
kubectl logs <pod-name> -c worker -n <namespace> | grep -E "(❌|ERROR|Failed)"
Scenario 2: Health Checks Failing
# Check health endpoint logs
kubectl logs <pod-name> -c worker -n <namespace> | grep -E "(GET /health|500|timeout)"
# Test health endpoint directly
kubectl exec -it <pod-name> -n <namespace> -- curl -w "%{time_total}" localhost:8080/health
Scenario 3: Tasks Not Processing
# Check for task reception
kubectl logs <pod-name> -c worker -n <namespace> | grep "Received task"
# Check for task failures
kubectl logs <pod-name> -c worker -n <namespace> | grep -E "(failed|exception|error)"
# Check worker status
kubectl exec -it <pod-name> -n <namespace> -- curl localhost:8080/debug | jq '.celery_status'
Best Practices
1. Log Retention
- Keep logs for at least 7 days for troubleshooting
- Archive important logs before pod restarts
- Use log aggregation systems for production
2. Log Levels
- Use
INFO
for production (balance of detail vs. noise) - Use
DEBUG
for development and troubleshooting - Use
ERROR
only for actual errors that need attention
3. Structured Logging
- Include relevant context (organization_id, user_id, task_id)
- Use consistent log formats across tasks
- Include timing information for performance monitoring
4. Log Monitoring
- Set up alerts for error rate increases
- Monitor health check failure patterns
- Track connection timeout frequencies
5. Performance Considerations
- Avoid excessive logging in tight loops
- Use appropriate log levels to control verbosity
- Consider log sampling for high-volume operations
Integration with Other Tools
Health Endpoints
The logging system integrates with the health endpoints:
/debug
: Shows system status including recent errors/debug/redis
: Shows Redis connectivity with error details- All endpoints log their usage for monitoring
Monitoring Systems
Logs can be integrated with:
- Prometheus: Metric extraction from log patterns
- Grafana: Log visualization and dashboards
- ELK Stack: Centralized log aggregation and search
- Google Cloud Logging: Native GKE log collection
Alert Integration
Example alert conditions based on logs:
- Error rate > 10% over 5 minutes
- Health check failure rate > 20% over 2 minutes
- No successful task processing for 10 minutes
- Redis connection timeouts > 5 in 5 minutes
This comprehensive logging system provides visibility into all aspects of worker operation, from startup through task processing to health monitoring.