Skip to content

Observability

Target Architecture — Final-State Design

Observability is built on Serilog + OpenTelemetry exporting to Application Insights, the same stack the factory uses everywhere. The Runtime & Cloud Platform is both observed (its own services) and the conduit through which generated runtimes become observable to the rest of the factory.

The platform's job is to make running SaaS legible. It instruments its own services and standardizes telemetry from every generated workload, then forwards the operationally significant signals to the Observability & Feedback Platform where they drive SLOs, incidents, and the factory's learning loop.

Logs

  • Structured logging via Serilog with the canonical envelope context (tenantId, projectId, moduleId, traceId, correlationId) bound to every log entry.
  • Platform service logs and generated-workload logs flow to Application Insights; deployment step logs are also persisted to Azure Blob for long-term audit.
  • Logs are tenant-scoped and queryable by traceId for end-to-end correlation.

Metrics

  • OpenTelemetry metrics exported to Application Insights, covering both platform operations and generated-runtime SLOs.
Metric Description
runtime.deployment.duration Time from Requested to Completed/RolledBack
runtime.deployment.rollback.rate Fraction of deployments that roll back
runtime.health.status Healthy/Degraded/Unhealthy per service
runtime.scaling.replicas Current vs desired replicas per service
runtime.scaling.actions Scale-up/down actions per policy
runtime.drift.open Open drift findings by severity
runtime.drift.remediation.time Time to remediate a finding
runtime.secret.rotation.age Age since last rotation per binding
Generated SaaS SLIs Latency, throughput, error rate, saturation per generated component

Traces

  • Distributed tracing via OpenTelemetry; traceId propagates from the triggering command through every service, worker, and emitted event.
  • A single trace spans provision → configure → deploy → health gate → promote → scale → drift, linking back through the DevOps delivery trace to the originating business intent.

Dashboards

  • Runtime Center tiles (see UI) render live service inventory, health, deployments, scaling, and drift.
  • Application Insights workbooks provide per-environment and per-tenant operational dashboards.
  • Cross-platform SLO dashboards live in Observability & Feedback, fed by this platform's signals.

Alerts

Alert Condition Routing
Deployment failure RuntimeDeploymentRolledBack or Failed state On-call + Runtime Center incident
Service unhealthy HealthCheckCompleted = Unhealthy beyond threshold On-call + Observability incident
Scaling violation ScalingPolicyViolated (can't meet target) On-call + capacity review
High-severity drift RuntimeDriftDetected severity = High in prod Governance + on-call
Secret rotation overdue runtime.secret.rotation.age exceeds policy Security + Governance

Required Dimensions

Every log, metric, trace, and alert carries the cross-cutting dimensions so signals are filterable and correlatable:

  • tenantId, projectId, moduleId, environmentId, serviceId
  • traceId, correlationId
  • stage (dev/test/staging/prod), region, computeTarget
  • version (image/deployment), deploymentId

Feeding Observability & Feedback

flowchart LR
    Gen["Generated SaaS Runtimes"] -->|"logs, metrics, traces"| RC["Runtime & Cloud Platform"]
    RC -->|"HealthCheckCompleted, RuntimeDriftDetected, ScalingPolicyApplied, RuntimeDeploymentCompleted"| Obs["Observability & Feedback"]
    Obs -->|"SLOs, alert rules, scaling targets"| RC
    Obs -->|"runtime signals + incidents"| Knowledge["Knowledge Platform"]
Hold "Alt" / "Option" to enable pan & zoom

Generated runtimes emit standardized telemetry because they are produced from factory templates that wire in ConnectSoft.Extensions.Observability / Extensions.Telemetry / Extensions.Logging.Serilog and ConnectSoft.Extensions.Diagnostics.HealthChecks. The Runtime & Cloud Platform normalizes and forwards the operationally significant events as integration events, closing the loop: runtime behaviour becomes feedback that the factory attributes back to the context and decisions that produced each component.