Skip to content

Observability

Target Architecture — Final-State Design

This page describes the final-state observability of the Governance, Security & Compliance Platform. Telemetry uses Serilog + OpenTelemetry exported to Application Insights via ConnectSoft.Extensions.Observability / ConnectSoft.Extensions.Telemetry / ConnectSoft.Extensions.Logging.Serilog, and the platform's own canonical events feed Observability & Feedback. Distinct from audit: audit is the legal record of trust decisions; observability is the operational view of platform health.

Telemetry Pillars

flowchart LR
    Services["Governance services + workers"] --> Logs["Structured logs (Serilog)"]
    Services --> Metrics["Metrics (OpenTelemetry)"]
    Services --> Traces["Distributed traces (OpenTelemetry)"]
    Logs --> AppInsights["Application Insights"]
    Metrics --> AppInsights
    Traces --> AppInsights
    Services -->|canonical events| ObsPlatform["Observability & Feedback Platform"]
    AppInsights --> Dashboards["Dashboards"]
    AppInsights --> Alerts["Alerts"]
Hold "Alt" / "Option" to enable pan & zoom

Logs

  • Structured & contextual — Serilog emits structured JSON with the Metadata Schema fields enriched into every log entry (tenantId, traceId, correlationId, projectId, moduleId).
  • Decision logs — each PolicyDecision logs the matched policy/version, effect, and risk at Information; denies and gates at Warning.
  • Security logs — finding ingestion, secret leak detection, and rotation events log with severity.
  • No sensitive payloads — logs carry identifiers and digests, never secret values or restricted-classified content (redaction enforced by classification).
  • Separation from audit — operational logs are mutable/expiring; the immutable record is the audit trail.

Metrics

Metric Type Purpose
governance.policy.evaluations counter PDP call volume (by decision: allow/deny/gate).
governance.policy.evaluation.latency histogram Inline PDP latency (p50/p95/p99) — guards the request-path budget.
governance.approval.pending gauge Open approval requests (by tier/age).
governance.approval.timeouts counter Auto-rejections/escalations from timeout.
governance.approval.decision.latency histogram Time from request to human decision.
governance.audit.entries counter Audit write throughput.
governance.audit.export.lag gauge Backlog between recorded and exported entries.
governance.security.findings.open gauge Open findings by severity.
governance.risk.score histogram Distribution of computed risk scores by band.
governance.classification.events counter Classifications assigned/changed.
governance.compliance.reports counter Reports generated (by framework/status).
governance.worker.dlq gauge Dead-lettered messages per worker.

Traces

  • End-to-end spans — the inline PDP call, supplier gRPC calls (isolation/classification/risk), approval lifecycle, and audit writes are all spans on the originating action's traceId, so a deployment promotion can be traced from request → evaluation → approval → audit.
  • Cross-platform stitchingtraceId/correlationId propagate from the Control Plane / DevOps & GitOps caller into governance spans and out via emitted events, giving a single thread across platforms.
  • Worker spans — each worker creates a span linked to the consumed event's eventId for replay debugging.

Dashboards

Dashboard Audience Shows
PDP Health Platform operators Evaluation volume, latency percentiles, allow/deny/gate mix, error rate.
Approval Operations Compliance/release managers Pending queue depth and age, timeout/escalation rate, decision latency.
Security Posture Security engineers Open findings by severity, leaked-secret rate, risk band distribution.
Audit & Compliance Compliance officers Audit throughput, export lag, chain-integrity status, reports generated.
Worker Health Platform operators Throughput, retry/DLQ counts, lag per worker.

Alerts

Alert Condition Severity
PDP latency breach p99 evaluation latency above budget for 5 min High
PDP error spike Evaluation error rate above threshold High
Approval queue aging Open requests exceeding SLA age Medium
Excessive timeouts Timeout/escalation rate spike Medium
Critical finding Any Critical SecurityFinding raised High
Leaked secret SecurityFindingRaised (leak category) Critical
Audit export lag Export lag above threshold High
Chain integrity Hash-chain verification failure Critical
Worker DLQ growth DLQ depth rising for any worker Medium

Required Dimensions

Every metric, log, and trace carries — at minimum — the following dimensions so signals can be sliced and correlated, per the Metadata Schema:

tenantId, traceId, correlationId, projectId, moduleId, service, environment, plus domain dimensions: governanceDomain (one of the ten), decision, severity, riskBand, framework, and worker.

How Observability Ties Back

  • Traceability — shared traceId makes any governance decision queryable end to end.
  • Autonomy — PDP latency and approval-queue metrics confirm the factory keeps moving autonomously and surfaces only true exceptions to humans.
  • Governance & compliance — export lag and chain-integrity alerts guarantee the evidence record stays complete and provable.
  • Multi-tenant scale — every signal is tenant-dimensioned, so per-tenant health and load are visible.