Skip to content

Workers

Target Architecture — Final-State Design

This page describes the final-state background workers of the Observability & Feedback Platform. Workers run as hosted services / Azure Functions, consume the canonical event envelope, and are idempotent on eventId.

The platform runs eight background workers that do the heavy, asynchronous work of turning raw telemetry into signals and feedback. Ingestion workers (TraceIngestionWorker, MetricAggregationWorker) handle factory-scale volume on Azure Functions / Container Apps with consumption scaling; reaction workers (AlertEvaluationWorker, IncidentAnalysisWorker, FeedbackCreationWorker, QualityScoreWorker, CostAnomalyWorker, TelemetryCorrelationWorker) run as scheduled or event-driven hosted services.

Worker Catalog

Worker Trigger Purpose Input Output Retry Idempotency
TraceIngestionWorker OTLP stream / queue Normalise and project incoming spans into correlated TraceRecord views. OTLP spans TraceRecord, TraceRecorded Exponential backoff; DLQ after 5 Dedup on span+traceId; upsert by traceId
MetricAggregationWorker Schedule (1m/5m/1h) Roll raw counters/histograms into MetricSeries. Raw metrics MetricSeries, MetricAggregated Re-run window on failure (idempotent) Deterministic per (metric, window, key)
AlertEvaluationWorker Schedule (per rule cadence) Evaluate AlertRule conditions against metrics/signals. MetricSeries, AlertRule AlertTriggered Retry next cadence Dedup trigger per (ruleId, window)
IncidentAnalysisWorker Event AlertTriggered, SloBreached Correlate triggers into incidents; enrich with trace lineage. Triggers, correlation views Incident, IncidentOpened Backoff; DLQ after 5 Dedup on (source, sourceId); idempotent key
FeedbackCreationWorker Event IncidentResolved, runtime signals Distil incidents/signals into FeedbackItem. Incidents, signals FeedbackItem, FeedbackItemCreated Backoff; DLQ after 5 Idempotency key from sourceId
QualityScoreWorker Schedule + event FeedbackItemCreated Recompute QualityScore per project/artifact. Feedback, incidents, SLO status QualityScore, QualityScoreComputed Re-run (idempotent) Deterministic per (project, window)
CostAnomalyWorker Schedule (hourly/daily) Attribute cost and detect anomalies. Metering/usage, CostSignal CostSignal, CostAnomalyDetected Re-run window Dedup anomaly per (category, day)
TelemetryCorrelationWorker Event (any) Stitch traces, logs, metrics, feedback into TelemetryCorrelation by traceId. All platform events TelemetryCorrelation snapshot Backoff; DLQ after 5 Upsert by traceId

Event-Flow Diagram

flowchart TB
    OTLP["OTLP spans / metrics<br/>from Runtime &amp; Agents"] --> TraceW["TraceIngestionWorker"]
    OTLP --> MetricW["MetricAggregationWorker"]

    TraceW -->|"TraceRecorded"| CorrW["TelemetryCorrelationWorker"]
    MetricW -->|"MetricAggregated"| CorrW
    MetricW -->|"series"| AlertW["AlertEvaluationWorker"]
    MetricW -->|"series"| SloEval["SloService (breach eval)"]

    AlertW -->|"AlertTriggered"| IncW["IncidentAnalysisWorker"]
    SloEval -->|"SloBreached"| IncW
    IncW -->|"IncidentOpened"| Store[("Incidents store")]
    IncW -->|"IncidentResolved"| FbW["FeedbackCreationWorker"]
    TraceW -->|"runtime signals"| FbW

    FbW -->|"FeedbackItemCreated"| QualW["QualityScoreWorker"]
    FbW -->|"FeedbackItemCreated"| KP["Knowledge Platform"]
    MetricW -->|"usage"| CostW["CostAnomalyWorker"]
    CostW -->|"CostAnomalyDetected"| FbW
    QualW -->|"QualityScoreComputed"| KP
    CorrW -->|"correlated views"| IncW
    CorrW -->|"correlated views"| QualW
Hold "Alt" / "Option" to enable pan & zoom

Worker Design

  • Idempotency — every worker deduplicates on the envelope eventId and computes a handler-specific idempotency key (e.g. eventId + handler name, or a deterministic key per aggregation window). Re-delivery never produces duplicate aggregates, incidents, or feedback.
  • Retry and poison handling — event-driven workers use exponential backoff and move unprocessable messages to a dead-letter subqueue with the full envelope preserved for replay (see event envelope consumer rules). Scheduled workers simply re-run the window, which is safe because their output is deterministic.
  • Tenant guard — every handler asserts tenantId from the envelope against the operation scope before touching a store.
  • Trace propagation — workers continue the inbound traceId into their OTEL spans and Serilog context, so the platform's own processing is itself traceable.
  • ScalingTraceIngestionWorker and MetricAggregationWorker scale on queue depth / schedule concurrency; reaction workers scale on subscription backlog. See Deployment.