Observability¶
The DevOps / GitOps Platform is instrumented end to end so that every commit, build, release, and reconciliation is observable. Telemetry is emitted via OpenTelemetry and Serilog (ConnectSoft.Extensions.Observability, ConnectSoft.Extensions.Telemetry) and flows into the Observability & Feedback Platform.
Target Architecture — Final-State Design
All signals carry the required dimensions below so they can be correlated across services and with the broader factory lifecycle via a single traceId.
Required Dimensions¶
Every log line, metric, span, and event carries these dimensions:
| Dimension | Source | Purpose |
|---|---|---|
traceId |
Envelope / span context | End-to-end correlation from artifact to runtime |
tenantId |
Tenant context | Tenant-scoped filtering and isolation |
projectId |
Operation scope | Per-project views |
moduleId |
Operation scope | Per-module delivery health |
environment |
Release/promotion context | dev / test / staging / prod segmentation |
correlationId |
Envelope | Workflow/saga correlation |
Logs¶
- Structured JSON logs via Serilog, enriched with the required dimensions and actor identity.
- Pipeline and build logs are streamed to Blob and tailed live over gRPC; structured summaries are logged per stage.
- Log levels are governed centrally; sensitive values are masked (see Security).
Metrics¶
| Metric | Type | Dimensions | Use |
|---|---|---|---|
devops.commits.total |
counter | tenant, project, module | Commit throughput |
devops.pullrequests.open |
gauge | tenant, project | Review backlog |
devops.pipeline.run.duration |
histogram | tenant, module, status | Build performance |
devops.build.success.rate |
gauge | tenant, module | Build health |
devops.release.lead.time |
histogram | tenant, module, environment | Commit-to-deploy lead time (DORA) |
devops.deployment.frequency |
counter | tenant, environment | Deployment frequency (DORA) |
devops.change.failure.rate |
gauge | tenant, environment | Change failure rate (DORA) |
devops.mttr |
histogram | tenant, environment | Mean time to restore (DORA) |
devops.gitops.drift |
gauge | tenant, environment | Drift incidents |
devops.iac.apply.duration |
histogram | tenant, environment | Pulumi apply time |
Traces¶
- Distributed traces span the full delivery flow; each service and worker is a span linked by
traceId/correlationId. - gRPC streams (run logs, sync state) propagate context so live tails are correlated to their parent trace.
- Traces tie back to the originating artifact and forward to runtime signals, enabling prompt-to-production trace stitching.
Dashboards¶
- Delivery Health — DORA metrics (lead time, deployment frequency, change failure rate, MTTR) per tenant/module/environment.
- Pipeline Overview — run durations, success rates, queue times, flaky-test signals.
- Release & Promotion — releases by state, approval latency, promotion success, rollback counts.
- GitOps Status — sync state and drift per environment, reconciliation latency.
- Infrastructure — Pulumi apply durations, resource change counts, drift detections.
Alerts¶
| Alert | Condition | Routing |
|---|---|---|
| Build success rate drop | devops.build.success.rate < threshold over window |
DevOps Engineer agent + incident |
| Release stuck | ReleasePlan in PendingApproval/Promoting beyond SLA |
Release Manager agent + Studio banner |
| Deployment failure | DeploymentPromoted failed / rollback executed |
Deployment Orchestrator agent + incident |
| Persistent drift | GitOps sync Degraded beyond reconcile window |
Environment Manager agent + incident |
| Dead-letter growth | DLQ depth rising on any topic | On-call + replay tooling |
| IaC apply failure | Pulumi apply error | Cloud Provisioner agent + incident |
Feedback Loop¶
- All telemetry feeds the Observability & Feedback Platform, which closes the loop back to agents and the Knowledge Platform — for example, recurring build failures inform future generation.
Pillar Alignment¶
- Observability — comprehensive logs, metrics, and traces with mandatory correlation dimensions.
- Traceability —
traceIdcorrelation spans artifact → commit → build → release → runtime. - Autonomy — alerts route to the responsible agent first, escalating to humans only when needed.