Trace Lifecycle
This page covers the complete trace lifecycle, including the race condition problem and the hybrid linking solution.
The Complete Flow
Timeline for Fast Test (< 5 seconds)
Key Phases
The Race Condition Problem
The Issue
OpenTelemetry’s BatchSpanProcessor batches spans and exports every 5 seconds. This creates unpredictable timing:
Fast Tests (< 5s): Test result created BEFORE spans exported
Slow Tests (> 5s): Spans exported BEFORE test result created
The Solution: Hybrid Linking
Link traces at TWO strategic points to handle both scenarios:
Linking Points
Point #1: After Test Result Creation
Location: results.py → link_traces_for_test_result()
Purpose: Catch traces that arrived BEFORE test result (slow tests)
Point #2: After Span Ingestion
Location: telemetry.py → link_traces_for_incoming_batch()
Purpose: Catch traces that arrived AFTER test result (fast tests)
Idempotency
Both linking points use the same CRUD operation with idempotency check:
Safe to call multiple times:
- First call: Updates N traces
- Second call: Updates 0 traces (already linked)
Result: 100% linking success rate regardless of test duration.
Test Execution Context
Context passed from test executor → SDK → spans:
This context is stored as span attributes:
Timing Summary
Critical Timing Points
| Point | Timing | Impact |
|---|---|---|
| BatchSpanProcessor delay | 5 seconds | Largest delay, unavoidable |
| Test execution | Variable | Determines which scenario |
| Span ingestion | ~10-20ms | Fast |
| Enrichment | ~50-100ms | Background |
| Query | ~10ms | Cached |
Why 5 Seconds?
The 5-second batch delay is a trade-off:
| Shorter Delay | Longer Delay |
|---|---|
| More HTTP requests | Fewer HTTP requests |
| Lower latency | Higher latency |
| Higher network overhead | Lower network overhead |
| Better real-time visibility | Batching efficiency |
OpenTelemetry’s default of 5 seconds optimizes for production efficiency over real-time visibility.
Cannot Avoid: The 5-second delay is fundamental to OpenTelemetry’s batch processing design. The hybrid linking strategy is the mitigation.
Error Handling
Workers Unavailable
Detection: celery_app.control.inspect()
Fallback: Sync enrichment
Impact: Slower ingestion, no data loss
Database Failure
Handling: Return 500 error
Impact: Spans lost (SDK retries once)
Linking Failure
Handling: Log error, don’t fail ingestion
Impact: Traces stored but not linked
Enrichment Failure
Handling: Skip problematic spans
Impact: Partial enrichment (other spans still enriched)
Debugging Guide
Traces Not Appearing?
- Check SDK export: Is
BatchSpanProcessorconfigured? - Check backend: Is
/telemetry/tracesreceiving requests? - Check database: Are spans stored in
tracestable? - Wait 5 seconds for batching delay
Traces Not Linked to Test Results?
- Check test context: Are spans created with
rhesis.test.*attributes? - Check linking logs: Are both linking attempts running?
- Check database: Is
test_result_idNULL or set?
Enrichment Not Happening?
- Check workers: Are Celery workers running?
- Check logs: Is enrichment being queued or run sync?
- Check database: Is
enriched_datacolumn populated?