Skip to content

Observability

The DevOps / GitOps Platform is instrumented end to end so that every commit, build, release, and reconciliation is observable. Telemetry is emitted via OpenTelemetry and Serilog (ConnectSoft.Extensions.Observability, ConnectSoft.Extensions.Telemetry) and flows into the Observability & Feedback Platform.

Target Architecture — Final-State Design

All signals carry the required dimensions below so they can be correlated across services and with the broader factory lifecycle via a single traceId.

Required Dimensions

Every log line, metric, span, and event carries these dimensions:

Dimension Source Purpose
traceId Envelope / span context End-to-end correlation from artifact to runtime
tenantId Tenant context Tenant-scoped filtering and isolation
projectId Operation scope Per-project views
moduleId Operation scope Per-module delivery health
environment Release/promotion context dev / test / staging / prod segmentation
correlationId Envelope Workflow/saga correlation

Logs

  • Structured JSON logs via Serilog, enriched with the required dimensions and actor identity.
  • Pipeline and build logs are streamed to Blob and tailed live over gRPC; structured summaries are logged per stage.
  • Log levels are governed centrally; sensitive values are masked (see Security).

Metrics

Metric Type Dimensions Use
devops.commits.total counter tenant, project, module Commit throughput
devops.pullrequests.open gauge tenant, project Review backlog
devops.pipeline.run.duration histogram tenant, module, status Build performance
devops.build.success.rate gauge tenant, module Build health
devops.release.lead.time histogram tenant, module, environment Commit-to-deploy lead time (DORA)
devops.deployment.frequency counter tenant, environment Deployment frequency (DORA)
devops.change.failure.rate gauge tenant, environment Change failure rate (DORA)
devops.mttr histogram tenant, environment Mean time to restore (DORA)
devops.gitops.drift gauge tenant, environment Drift incidents
devops.iac.apply.duration histogram tenant, environment Pulumi apply time

Traces

  • Distributed traces span the full delivery flow; each service and worker is a span linked by traceId/correlationId.
  • gRPC streams (run logs, sync state) propagate context so live tails are correlated to their parent trace.
  • Traces tie back to the originating artifact and forward to runtime signals, enabling prompt-to-production trace stitching.

Dashboards

  • Delivery Health — DORA metrics (lead time, deployment frequency, change failure rate, MTTR) per tenant/module/environment.
  • Pipeline Overview — run durations, success rates, queue times, flaky-test signals.
  • Release & Promotion — releases by state, approval latency, promotion success, rollback counts.
  • GitOps Status — sync state and drift per environment, reconciliation latency.
  • Infrastructure — Pulumi apply durations, resource change counts, drift detections.

Alerts

Alert Condition Routing
Build success rate drop devops.build.success.rate < threshold over window DevOps Engineer agent + incident
Release stuck ReleasePlan in PendingApproval/Promoting beyond SLA Release Manager agent + Studio banner
Deployment failure DeploymentPromoted failed / rollback executed Deployment Orchestrator agent + incident
Persistent drift GitOps sync Degraded beyond reconcile window Environment Manager agent + incident
Dead-letter growth DLQ depth rising on any topic On-call + replay tooling
IaC apply failure Pulumi apply error Cloud Provisioner agent + incident

Feedback Loop

Pillar Alignment

  • Observability — comprehensive logs, metrics, and traces with mandatory correlation dimensions.
  • TraceabilitytraceId correlation spans artifact → commit → build → release → runtime.
  • Autonomy — alerts route to the responsible agent first, escalating to humans only when needed.