Tracing System
Technical documentation for the Rhesis tracing system architecture and implementation.
For SDK Users: See the Tracing documentation for usage guides. This section covers the internal architecture and design for developers and contributors.
Overview
The tracing system captures OpenTelemetry-compliant traces from SDK-instrumented applications. It supports two operating modes:
- Test Mode: Traces linked to test runs, test cases, and test results
- Production Mode: Traces from live application monitoring
High-Level Architecture
Key Technologies
| Component | Technology | Purpose |
|---|---|---|
| SDK Tracer | OpenTelemetry Python | Span creation with AI semantic conventions |
| Batch Processor | OTEL BatchSpanProcessor | Batches spans, exports every 5 seconds |
| Transport | OTLP/HTTP | JSON payload to /telemetry/traces |
| Storage | PostgreSQL + JSONB | Flexible span storage with full-text search |
| Enrichment | Celery + LiteLLM | Cost calculation, anomaly detection |
| Linking | Service Layer | Hybrid strategy for test context linking |
Communication Channels
The SDK uses two independent channels:
| Channel | Protocol | Purpose | Used By |
|---|---|---|---|
| Tracing | HTTP POST | Export OpenTelemetry spans | @observe, @endpoint |
| Testing | WebSocket | Remote test invocation | @endpoint only |
Design Principles
- OpenTelemetry Standard - Industry-standard OTLP protocol for interoperability
- Async-First with Sync Fallback - Optimal in production, works without workers in development
- Hybrid Linking - Two strategic linking points to handle race conditions
- Idempotent Operations - Safe to call linking multiple times
- Cache Enrichment - Compute once, query fast
- Graceful Degradation - System works even when components fail
Performance Characteristics
| Operation | Timing | Notes |
|---|---|---|
| Span creation (SDK) | ~0.1ms | Per span, negligible overhead |
| BatchProcessor delay | 5000ms | Fixed by OpenTelemetry design |
| Span export (OTLP) | ~10ms | Network call |
| Backend ingestion | ~10-20ms | With async enrichment |
| Enrichment calculation | ~50-100ms | Background (Celery) |
| Trace query | ~10ms | Cached enrichment |
| End-to-end | ~5 seconds | Test start to queryable trace |
Key Files
SDK
| File | Purpose |
|---|---|
sdk/src/rhesis/sdk/telemetry/tracer.py | Core Tracer class |
sdk/src/rhesis/sdk/telemetry/exporter.py | OTLP HTTP exporter |
sdk/src/rhesis/sdk/telemetry/attributes.py | AI semantic conventions |
sdk/src/rhesis/sdk/decorators/observe.py | @observe decorator |
sdk/src/rhesis/sdk/decorators/endpoint.py | @endpoint decorator |
Backend
| File | Purpose |
|---|---|
apps/backend/.../routers/telemetry.py | Ingestion endpoint |
apps/backend/.../services/telemetry/linking_service.py | Hybrid linking logic |
apps/backend/.../services/telemetry/enricher.py | Cost/anomaly enrichment |
apps/backend/.../tasks/execution/executors/results.py | Test result processing |
apps/backend/.../crud.py | create_trace_spans(), update_traces_with_test_result_id() |
Next Steps
- Architecture - Detailed component architecture
- Trace Lifecycle - Complete flow and race condition handling
- Data Structures - Schemas and database design