GKE Worker Troubleshooting Guide
This guide covers troubleshooting Celery workers running in Google Kubernetes Engine (GKE), including using the built-in debugging tools.
Quick Start: Connect to Your Cluster
1. Find Your Cluster
2. Get Credentials
3. Install kubectl (if needed)
Health Check Endpoints
The worker includes several debugging endpoints:
| Endpoint | Purpose | Use Case |
|---|---|---|
/ping | Basic connectivity | Quick server test |
/health/basic | Server health (no dependencies) | Readiness probe |
/health | Lightweight health (Celery + Redis, no worker ping) | Liveness probe |
/debug | Comprehensive system info | General debugging |
/debug/env | Environment variables (sanitized) | Config issues |
/debug/redis | Redis connectivity details | Connection problems |
/debug/detailed | Slow health check with worker ping | Deep troubleshooting |
Worker Registration Checking
Check Registered Workers with Python Script
Create a Python script to check registered Celery workers and Redis connectivity:
Usage:
Expected Output (with workers running):
Expected Output (no workers):
Cluster Management Commands
Scale workers down (for debugging):
Scale workers back up:
Check current replica count:
Common Troubleshooting Commands
Check Pod Status
Expected Output:
Problem Indicators:
1/2 Ready: Worker container failing, cloudsql-proxy working0/2 Ready: Both containers failingCrashLoopBackOff: Container repeatedly failing- High restart count: Ongoing issues
Check Pod Events
Look for events section at the bottom:
Unhealthy: Health check failuresFailed: Container start failuresKilling: Pod being terminated
Test Basic Connectivity
Expected: pong
If this fails:
- Health server not starting
- Port 8080 not listening
- Container networking issues
Test Health Endpoints
Get Debug Information
Common Issues and Solutions
1. Pods Stuck at 1/2 Ready
Symptoms:
Diagnosis:
Common Causes:
A. Redis Connection Issues
Solutions:
- Check Redis URL format:
rediss://for TLS,redis://for standard - Verify SSL parameters:
ssl_cert_reqs=CERT_NONE - Check network policies allowing outbound connections
- Verify Redis service is accessible from GKE
B. Health Check Timeouts
Note: As of the latest update, the main /health endpoint uses a lightweight check that doesn’t ping workers. If you’re still seeing timeouts:
Solutions:
- Check Redis connectivity specifically:
curl localhost:8080/debug/redis - Use detailed health check to test worker ping:
curl localhost:8080/debug/detailed - The
/healthendpoint should now be much faster since it doesn’t wait for worker responses - If
/healthis still slow, it’s likely a Redis connection issue, not worker startup
C. Environment Configuration
Check for:
- Missing environment variables
- Incorrect secret references
- Malformed URLs
2. CrashLoopBackOff
Diagnosis:
Common Causes:
A. Import Errors
Solutions:
- Check PYTHONPATH in deployment
- Verify Docker image build process
- Ensure all dependencies installed
B. Connection Failures
Solutions:
- Check SSL certificate configuration
- Verify
ssl_cert_reqs=CERT_NONEparameter - Test Redis connectivity outside GKE
3. High Memory Usage
Diagnosis:
Solutions:
- Adjust
CELERY_WORKER_MAX_TASKS_PER_CHILD - Increase memory limits in deployment
- Monitor for memory leaks in tasks
4. Task Processing Issues
Diagnosis:
Advanced Debugging
Interactive Shell Access
From inside the container:
Monitor Logs in Real-Time
Network Debugging
Performance Monitoring
Resource Usage
Health Check Performance
Preventive Measures
1. Proper Resource Limits
2. Appropriate Health Check Timeouts
3. Monitoring and Alerting
Emergency Procedures
Force Pod Restart
Scale Down/Up
Emergency Debugging
Getting Help
When reporting issues, include:
-
Cluster Information:
kubectl version kubectl get nodes -
Pod Status:
kubectl get pods -n <namespace> -o wide kubectl describe pod <pod-name> -n <namespace> -
Debug Output:
kubectl exec -it <pod-name> -n <namespace> -- \ curl http://localhost:8080/debug | jq -
Recent Logs:
kubectl logs <pod-name> -n <namespace> --tail=100 -
Configuration:
kubectl get deployment rhesis-worker -n <namespace> -o yaml
This comprehensive information will help quickly identify and resolve issues.