Skip to content

Observability

Target Architecture — Final-State Design

This page describes how the Observability & Feedback Platform observes itself. It dogfoods the same Serilog + OpenTelemetry + Application Insights stack it provides to the rest of the factory — if the platform that watches everything is not itself observable, the trust loop has a blind spot.

The platform applies its own discipline to itself. Every service and worker is instrumented with the shared ConnectSoft observability libraries, emits the required telemetry dimensions, and is monitored by the same dashboards, alerts, and SLOs it offers to generated SaaS products. This is consistent with the factory's observability-driven design principle.

Implemented

The instrumentation substrate is already realised in the codebase: ConnectSoft.Extensions.Observability, ConnectSoft.Extensions.Telemetry, ConnectSoft.Extensions.Logging.Serilog, and ConnectSoft.Extensions.Diagnostics.Metrics, exporting OTEL traces/metrics/logs to Application Insights / Log Analytics, with optional Grafana/Prometheus dashboards. See Platform Foundations — Observability and Factory Runtime — Observability.

Logs

  • Structured Serilog via ConnectSoft.Extensions.Logging.Serilog, sinking to Log Analytics.
  • Every log line is enriched with the required dimensions (traceId, executionId, tenantId, projectId, moduleId, agentId, skillId, artifactId, workflowId, environment, version).
  • Redaction enrichers scrub secrets and PII before egress (see Security).
  • The platform's own logs are queried through the same LogQueryService it exposes to others.

Metrics

  • OpenTelemetry metrics via ConnectSoft.Extensions.Diagnostics.Metrics, exported to Application Insights.
  • Per-service operational metrics: ingestion throughput and lag (TraceIngestionWorker, MetricAggregationWorker), Service Bus subscription backlog, aggregation window latency, alert evaluation duration, incident open/resolve rates, feedback creation rate, cost-recompute duration.
  • All metrics are dimensioned by the required dimensions plus service and worker, so the platform's own health is sliceable the same way factory telemetry is.

Traces

  • OpenTelemetry traces via ConnectSoft.Extensions.Observability, exported to Application Insights.
  • Inbound traceId/correlationId from the event envelope are continued into worker spans, so processing of a runtime signal is itself part of the original trace — the platform's own work is traceable end to end.
  • Span attributes carry the required dimensions, enabling the platform's own TelemetryCorrelationService to correlate its self-telemetry.

Dashboards

  • Self-monitoring dashboards (App Insights workbooks, optional Grafana) cover ingestion pipeline health, event-flow latency across the workers, store health (App Insights, Log Analytics, Azure SQL, Blob), and the end-to-end loop latency from TraceRecorded to FeedbackItemCreated.
  • These dashboards are defined through the platform's own DashboardService.

Alerts

  • Alert rules (via the platform's own AlertRuleService) watch for: ingestion lag exceeding threshold, dead-letter accumulation, aggregation falling behind, loop latency regressions, and store unavailability.
  • Critical platform alerts page the platform on-call and open incidents through the platform's own IncidentService — the platform reacts to its own problems with its own machinery.

Required Dimensions

The platform both enforces and consumes the required telemetry dimensions:

Dimension Enforced as
traceId OTEL trace context; envelope field; mandatory on ingest.
executionId Span attribute + Serilog enricher.
tenantId Span/log enricher; isolation boundary; rejected if absent.
projectId Span/log enricher; query filter.
moduleId Span/log enricher.
agentId Span attribute (nullable for non-agent runtime).
skillId Span attribute (nullable).
artifactId Span attribute; feedback attribution key.
workflowId Span attribute; ties to Control Plane workflow.
environment OTEL resource attribute.
version OTEL resource attribute.

Telemetry missing traceId or tenantId is rejected at ingestion; the remaining dimensions are validated and back-filled where derivable, and gaps are themselves reported as a data-quality metric.