Worker Logging Guide

This guide covers everything about logging in the Rhesis worker system, from configuration to analysis and troubleshooting.

Overview

The worker system generates logs from multiple sources:

Celery Worker: Task execution, queue processing, worker lifecycle
Health Server: HTTP health checks, debugging endpoints
Startup Script: Container initialization, environment validation
Application Code: Task-specific logging from your business logic

Log Configuration

Environment Variables

Control logging behavior with these environment variables:


# Primary log level (affects all components)
LOG_LEVEL=INFO
 
# Celery-specific log level
CELERY_WORKER_LOGLEVEL=INFO
 
# Python logging configuration
PYTHONUNBUFFERED=1  # Ensures immediate log output

Available Log Levels:

DEBUG: Detailed debugging information
INFO: General operational messages (recommended)
WARNING: Warning messages for potential issues
ERROR: Error conditions that don’t stop execution
CRITICAL: Serious errors that may stop execution

Celery Logging Configuration

In the worker startup, Celery is configured with:


celery -A rhesis.backend.worker.app worker \
    --loglevel=${CELERY_WORKER_LOGLEVEL:-INFO} \
    --concurrency=${CELERY_WORKER_CONCURRENCY:-8} \
    # ... other options

Log Sources and Formats

1. Startup Script Logs

Location: Container stdout during initialization Format: Structured with emoji indicators and timestamps


# Successful operations
✅ Health server starting on port 8080
✅ Successfully imported Celery app from rhesis.backend.worker
✅ Redis connectivity: connected
 
# Warnings
⚠️  TLS detected in broker URL, adjusting timeouts to 5 seconds
 
# Errors
❌ Failed to import Celery app: No module named 'rhesis.backend.worker'
❌ Redis connectivity test failed: connection timeout

2. Health Server Logs

Location: Container stdout from health server process Format: HTTP access logs with endpoint information


# Successful health checks
INFO:     127.0.0.1:35492 - "GET /health/basic HTTP/1.1" 200 OK
INFO:     127.0.0.1:35494 - "GET /health HTTP/1.1" 200 OK
 
# Health check failures
WARNING:  127.0.0.1:35496 - "GET /health HTTP/1.1" 500 Internal Server Error
ERROR:    Health check failed: Redis connection timeout
 
# Debug endpoint usage
INFO:     127.0.0.1:35498 - "GET /debug HTTP/1.1" 200 OK
INFO:     127.0.0.1:35500 - "GET /debug/redis HTTP/1.1" 200 OK

3. Celery Worker Logs

Location: Container stdout from Celery process Format: Celery’s standard logging format with task information


# Worker startup
[2024-01-15 10:30:00,123: INFO/MainProcess] Connected to redis://redis-host:6379/0
[2024-01-15 10:30:00,456: INFO/MainProcess] mingle: searching for available workers
[2024-01-15 10:30:01,789: INFO/MainProcess] celery@worker-pod ready.
 
# Task processing
[2024-01-15 10:30:15,234: INFO/MainProcess] Received task: rhesis.backend.tasks.execute_test[task-id-123]
[2024-01-15 10:30:15,456: INFO/ForkPoolWorker-1] Task rhesis.backend.tasks.execute_test[task-id-123] succeeded in 2.34s
[2024-01-15 10:30:16,789: INFO/MainProcess] Received task: rhesis.backend.tasks.collect_results[chord-id-456]
 
# Errors
[2024-01-15 10:30:20,123: ERROR/ForkPoolWorker-2] Task rhesis.backend.tasks.execute_test[task-id-789] raised unexpected: ConnectionError('Redis connection failed')
[2024-01-15 10:30:25,456: WARNING/MainProcess] Chord unlock task chord_unlock[chord-id-456] retry: Retry in 1.0s

4. Application Task Logs

Location: Container stdout from your task code Format: Python logging format as configured in your tasks


# In your task code
import logging
 
logger = logging.getLogger(__name__)
 
@app.task(base=BaseTask)
def my_task(self):
    logger.info(f"Starting task for organization: {self.request.organization_id}")
    try:
        # Task logic
        result = process_data()
        logger.info(f"Task completed successfully: {result}")
        return result
    except Exception as e:
        logger.error(f"Task failed: {str(e)}")
        raise

Accessing Logs

Local Development


# Using Docker Compose
docker-compose logs worker
 
# Follow logs in real-time
docker-compose logs -f worker
 
# Get last N lines
docker-compose logs --tail=100 worker

GKE Deployment

Basic Log Access


# Get logs from worker container
kubectl logs <pod-name> -c worker -n <namespace>
 
# Get recent logs (last 100 lines)
kubectl logs <pod-name> -c worker -n <namespace> --tail=100
 
# Get logs from last hour
kubectl logs <pod-name> -c worker -n <namespace> --since=1h

Real-Time Monitoring


# Follow logs as they're generated
kubectl logs -f <pod-name> -c worker -n <namespace>
 
# Follow logs from all worker pods
kubectl logs -f deployment/rhesis-worker -n <namespace>
 
# Follow logs from all containers in pod
kubectl logs -f <pod-name> -n <namespace> --all-containers=true

Historical Logs


# Get logs from previous container restart (if crashed)
kubectl logs <pod-name> -c worker -n <namespace> --previous
 
# Get logs with timestamps
kubectl logs <pod-name> -c worker -n <namespace> --timestamps=true

Log Analysis Techniques

1. Finding Your Pods


# List all worker pods
kubectl get pods -n <namespace> -l app=rhesis-worker
 
# Get pod details including restart count
kubectl get pods -n <namespace> -o wide

2. Filtering Logs

Search for Errors


# Find all errors
kubectl logs <pod-name> -c worker -n <namespace> | grep -i error
 
# Find Redis connection issues
kubectl logs <pod-name> -c worker -n <namespace> | grep -i "redis\|connection\|timeout"
 
# Find task failures
kubectl logs <pod-name> -c worker -n <namespace> | grep -E "(failed|exception|error)"

Search for Task Activity


# Find task executions
kubectl logs <pod-name> -c worker -n <namespace> | grep "Received task"
 
# Find task completions
kubectl logs <pod-name> -c worker -n <namespace> | grep "succeeded in"
 
# Find chord activity
kubectl logs <pod-name> -c worker -n <namespace> | grep -i chord

Search for Health Check Activity


# Find health check requests
kubectl logs <pod-name> -c worker -n <namespace> | grep -E "(GET /health|GET /ping)"
 
# Find health check failures
kubectl logs <pod-name> -c worker -n <namespace> | grep -E "(500|timeout|failed)"

3. Advanced Log Analysis

Export Logs for Analysis


# Save logs to file
kubectl logs <pod-name> -c worker -n <namespace> --tail=1000 > worker-logs.txt
 
# Save logs with timestamps
kubectl logs <pod-name> -c worker -n <namespace> --timestamps=true --tail=1000 > worker-logs-timestamped.txt
 
# Save logs from specific time period
kubectl logs <pod-name> -c worker -n <namespace> --since=2h > recent-worker-logs.txt

Multi-Pod Log Aggregation


# Get logs from all worker pods
for pod in $(kubectl get pods -n <namespace> -l app=rhesis-worker -o jsonpath='{.items[*].metadata.name}'); do
  echo "=== Logs from $pod ===" >> all-worker-logs.txt
  kubectl logs $pod -c worker -n <namespace> --tail=100 >> all-worker-logs.txt
  echo "" >> all-worker-logs.txt
done

Log Patterns and What They Mean

Healthy Worker Startup


✅ Health server starting on port 8080
✅ Environment validation completed
✅ Successfully imported Celery app
✅ Redis connectivity: connected
✅ Health server ready, all endpoints responding
[INFO/MainProcess] Connected to rediss://...
[INFO/MainProcess] celery@worker-pod ready.

Common Warning Patterns


# TLS connection delay (normal for Redis TLS)
⚠️  TLS detected in broker URL, adjusting timeouts to 10 seconds
 
# Chord retries (may indicate failed subtasks)
[WARNING/MainProcess] Chord unlock task chord_unlock[...] retry: Retry in 1.0s
 
# Health check timeouts (may indicate Redis delays)
WARNING: Health check took 8.5 seconds (timeout: 10)

Error Patterns to Investigate

Connection Errors


❌ Redis connectivity test failed: connection timeout
[ERROR/MainProcess] consumer: Cannot connect to rediss://...: Error connecting

Action: Check Redis connectivity, network policies, firewall rules

Import Errors


❌ Failed to import Celery app: No module named 'rhesis.backend.worker'

Action: Check Docker image build, PYTHONPATH configuration

Task Errors


[ERROR/ForkPoolWorker-1] Task rhesis.backend.tasks.execute_test[...] raised unexpected: Exception('Task failed')

Action: Check task code, input parameters, database connectivity

Health Check Errors


ERROR: Health check failed: Celery ping timeout after 10 seconds
INFO: 127.0.0.1:42756 - "GET /health HTTP/1.1" 500 Internal Server Error

Action: Check Celery worker status, Redis connectivity

Log Monitoring and Alerting

Key Metrics to Monitor

Error Rate: Frequency of ERROR/CRITICAL log entries
Health Check Failures: HTTP 500 responses on /health
Connection Timeouts: Redis/broker connectivity issues
Task Failure Rate: Ratio of failed to successful tasks
Worker Restarts: Container restart frequency

Sample Monitoring Queries

Using kubectl and basic tools


# Count errors in last 100 log lines
kubectl logs <pod-name> -c worker -n <namespace> --tail=100 | grep -c ERROR
 
# Check for recent connection issues
kubectl logs <pod-name> -c worker -n <namespace> --since=10m | grep -i "connection\|timeout"
 
# Monitor health check success rate
kubectl logs <pod-name> -c worker -n <namespace> --since=1h | grep "GET /health" | grep -c "200 OK"

Log-based Health Check


#!/bin/bash
# Simple health check based on logs
NAMESPACE="rhesis-worker-dev"
POD=$(kubectl get pods -n $NAMESPACE -l app=rhesis-worker -o jsonpath='{.items[0].metadata.name}')
 
# Check for recent errors
ERROR_COUNT=$(kubectl logs $POD -c worker -n $NAMESPACE --tail=50 | grep -c ERROR)
if [ $ERROR_COUNT -gt 5 ]; then
    echo "WARNING: $ERROR_COUNT errors found in recent logs"
fi
 
# Check for Redis connectivity
REDIS_ERRORS=$(kubectl logs $POD -c worker -n $NAMESPACE --since=5m | grep -c "Redis\|connection")
if [ $REDIS_ERRORS -gt 0 ]; then
    echo "WARNING: Redis connection issues detected"
fi

Debugging with Logs

Step-by-Step Debugging Process

Identify the Problem


# Check pod status
kubectl get pods -n <namespace>
 
# Look for restart indicators
kubectl describe pod <pod-name> -n <namespace>

Get Recent Logs


# Get current logs
kubectl logs <pod-name> -c worker -n <namespace> --tail=100
 
# Get crash logs if restarted
kubectl logs <pod-name> -c worker -n <namespace> --previous

Search for Specific Issues


# Connection problems
kubectl logs <pod-name> -c worker -n <namespace> | grep -i "connection\|redis\|timeout"
 
# Task problems
kubectl logs <pod-name> -c worker -n <namespace> | grep -i "task\|error\|failed"

Correlate with Health Endpoints


# Check current system state
kubectl exec -it <pod-name> -n <namespace> -- curl localhost:8080/debug | jq

Common Debugging Scenarios

Scenario 1: Pod Won’t Start


# Check startup logs
kubectl logs <pod-name> -c worker -n <namespace>
 
# Look for import errors, connection failures, configuration issues
kubectl logs <pod-name> -c worker -n <namespace> | grep -E "(❌|ERROR|Failed)"

Scenario 2: Health Checks Failing


# Check health endpoint logs
kubectl logs <pod-name> -c worker -n <namespace> | grep -E "(GET /health|500|timeout)"
 
# Test health endpoint directly
kubectl exec -it <pod-name> -n <namespace> -- curl -w "%{time_total}" localhost:8080/health

Scenario 3: Tasks Not Processing


# Check for task reception
kubectl logs <pod-name> -c worker -n <namespace> | grep "Received task"
 
# Check for task failures
kubectl logs <pod-name> -c worker -n <namespace> | grep -E "(failed|exception|error)"
 
# Check worker status
kubectl exec -it <pod-name> -n <namespace> -- curl localhost:8080/debug | jq '.celery_status'

Best Practices

1. Log Retention

Keep logs for at least 7 days for troubleshooting
Archive important logs before pod restarts
Use log aggregation systems for production

2. Log Levels

Use INFO for production (balance of detail vs. noise)
Use DEBUG for development and troubleshooting
Use ERROR only for actual errors that need attention

3. Structured Logging

Include relevant context (organization_id, user_id, task_id)
Use consistent log formats across tasks
Include timing information for performance monitoring

4. Log Monitoring

Set up alerts for error rate increases
Monitor health check failure patterns
Track connection timeout frequencies

5. Performance Considerations

Avoid excessive logging in tight loops
Use appropriate log levels to control verbosity
Consider log sampling for high-volume operations

Integration with Other Tools

Health Endpoints

The logging system integrates with the health endpoints:

/debug: Shows system status including recent errors
/debug/redis: Shows Redis connectivity with error details
All endpoints log their usage for monitoring

Monitoring Systems

Logs can be integrated with:

Prometheus: Metric extraction from log patterns
Grafana: Log visualization and dashboards
ELK Stack: Centralized log aggregation and search
Google Cloud Logging: Native GKE log collection

Alert Integration

Example alert conditions based on logs:

Error rate > 10% over 5 minutes
Health check failure rate > 20% over 2 minutes
No successful task processing for 10 minutes
Redis connection timeouts > 5 in 5 minutes

This comprehensive logging system provides visibility into all aspects of worker operation, from startup through task processing to health monitoring.