Multi-Worker RPC Coordination
When running multiple backend workers (e.g., --workers 4), RPC requests need to be routed to the specific worker that holds the WebSocket connection to the SDK. This is achieved through direct worker routing using Redis.
Architecture Overview
The system uses a routing registry in Redis to track which worker owns each SDK connection, then routes RPC requests directly to that worker via dedicated channels.
Implementation
Worker Registration
When a backend worker establishes a WebSocket connection with an SDK, it registers itself as the handler:
Key points:
- Each backend worker has a unique ID:
backend@{hostname}-{uuid} - Registration key format:
ws:routing:{project_id}:{environment} - 30-second TTL prevents stale registrations; refreshed by heartbeat loop
- On disconnect, worker unregisters by deleting the routing key
Direct Routing
RPC clients (Celery workers) use the routing registry to send requests directly to the correct worker:
Benefits of direct routing:
- No broadcast to all workers - only the correct worker receives the request
- Fail-fast behavior when SDK is disconnected (no worker registered)
- No race conditions or spurious errors from workers without connections
Worker Request Processing
Each backend worker runs a listener loop for its dedicated channel:
Why this works:
- Each worker only listens to its own channel
ws:rpc:{worker_id} - Requests are queued (Redis list), so no messages are lost
- BLPOP is blocking but efficient (1s timeout allows graceful shutdown)
Heartbeat Mechanism
Worker registrations expire after 30 seconds to prevent stale entries. A heartbeat loop refreshes the registration every 10 seconds:
This ensures:
- Crashed workers are automatically unregistered after 30s
- Active workers maintain their routing entries
- Clients don’t route to dead workers
Monitoring
Expected Log Patterns
RPC Client (Celery worker):
Backend Worker (the one with connection):
Other backend workers:
These workers receive no requests for this connection - their listeners are idle. They only process requests for connections they own.
Problematic Patterns
SDK disconnected but routing key exists:
ERROR - SDK connection project123:prod is not available (no worker registered)If this happens frequently, check:
- Worker heartbeat is running (should refresh routing every 10s)
- Redis connectivity between workers and Redis
- Worker registration on connection establishment
Worker routing mismatch:
ERROR - Worker routing mismatch: received RPC for project123:prod but connection not foundThis indicates a race condition where:
- Routing key points to this worker
- But WebSocket connection was closed/cleaned up
- Usually resolves when routing key expires (30s) or client retries
Testing
To verify multi-worker routing:
- Start multiple workers:
--workers 4 - Enable DEBUG logging: Set log level to DEBUG
- Connect SDK: Establish WebSocket connection
- Check routing registration:
redis-cli GET ws:routing:project_id:environment # Should return: backend@hostname-uuid - Execute test: Trigger SDK function invocation
- Verify logs:
- RPC client logs show routing to specific worker
- Only that worker logs
Forwarding RPC request ... to SDK - Other workers have no logs for this request
Key Implementation Files
manager.py: Worker registration, RPC listener loop, request handling, heartbeatrpc_client.py: Routing lookup, direct request routing, response handlingredis_client.py: Redis connection management and pub/sub infrastructure
Architecture Diagram
Flow:
- Backend worker registers itself when SDK connects
- Celery worker looks up which backend worker has the connection
- Request is pushed directly to that worker’s queue
- Worker polls its queue and receives the request
- Celery worker subscribes to response channel
- Backend worker forwards request to SDK via WebSocket
- SDK result is published to response channel
- Celery worker receives result and returns it
Benefits of Direct Routing
Compared to broadcast-based approaches:
- Efficiency: Only the relevant worker receives requests (no wasted processing)
- Fail-fast: Immediate error if SDK is disconnected (no timeout waiting)
- No race conditions: Single source of truth for which worker has the connection
- Scalability: O(1) routing lookup regardless of number of workers
- Simplicity: No complex coordination logic or connection location checking
Reliability:
- Heartbeat keeps routing fresh (30s TTL, 10s refresh)
- Stale entries expire automatically
- BLPOP queuing ensures no dropped requests
- Worker-specific queues prevent message interference