Observability¶

Target Architecture — Final-State Design

This page describes the final-state observability of the Governance, Security & Compliance Platform. Telemetry uses Serilog + OpenTelemetry exported to Application Insights via ConnectSoft.Extensions.Observability / ConnectSoft.Extensions.Telemetry / ConnectSoft.Extensions.Logging.Serilog, and the platform's own canonical events feed Observability & Feedback. Distinct from audit: audit is the legal record of trust decisions; observability is the operational view of platform health.

Telemetry Pillars¶

flowchart LR
    Services["Governance services + workers"] --> Logs["Structured logs (Serilog)"]
    Services --> Metrics["Metrics (OpenTelemetry)"]
    Services --> Traces["Distributed traces (OpenTelemetry)"]
    Logs --> AppInsights["Application Insights"]
    Metrics --> AppInsights
    Traces --> AppInsights
    Services -->|canonical events| ObsPlatform["Observability &amp; Feedback Platform"]
    AppInsights --> Dashboards["Dashboards"]
    AppInsights --> Alerts["Alerts"]

Hold "Alt" / "Option" to enable pan & zoom

Logs¶

Structured & contextual — Serilog emits structured JSON with the Metadata Schema fields enriched into every log entry (tenantId, traceId, correlationId, projectId, moduleId).
Decision logs — each PolicyDecision logs the matched policy/version, effect, and risk at Information; denies and gates at Warning.
Security logs — finding ingestion, secret leak detection, and rotation events log with severity.
No sensitive payloads — logs carry identifiers and digests, never secret values or restricted-classified content (redaction enforced by classification).
Separation from audit — operational logs are mutable/expiring; the immutable record is the audit trail.

Metrics¶

Metric	Type	Purpose
`governance.policy.evaluations`	counter	PDP call volume (by decision: allow/deny/gate).
`governance.policy.evaluation.latency`	histogram	Inline PDP latency (p50/p95/p99) — guards the request-path budget.
`governance.approval.pending`	gauge	Open approval requests (by tier/age).
`governance.approval.timeouts`	counter	Auto-rejections/escalations from timeout.
`governance.approval.decision.latency`	histogram	Time from request to human decision.
`governance.audit.entries`	counter	Audit write throughput.
`governance.audit.export.lag`	gauge	Backlog between recorded and exported entries.
`governance.security.findings.open`	gauge	Open findings by severity.
`governance.risk.score`	histogram	Distribution of computed risk scores by band.
`governance.classification.events`	counter	Classifications assigned/changed.
`governance.compliance.reports`	counter	Reports generated (by framework/status).
`governance.worker.dlq`	gauge	Dead-lettered messages per worker.

Traces¶

End-to-end spans — the inline PDP call, supplier gRPC calls (isolation/classification/risk), approval lifecycle, and audit writes are all spans on the originating action's traceId, so a deployment promotion can be traced from request → evaluation → approval → audit.
Cross-platform stitching — traceId/correlationId propagate from the Control Plane / DevOps & GitOps caller into governance spans and out via emitted events, giving a single thread across platforms.
Worker spans — each worker creates a span linked to the consumed event's eventId for replay debugging.

Dashboards¶

Dashboard	Audience	Shows
PDP Health	Platform operators	Evaluation volume, latency percentiles, allow/deny/gate mix, error rate.
Approval Operations	Compliance/release managers	Pending queue depth and age, timeout/escalation rate, decision latency.
Security Posture	Security engineers	Open findings by severity, leaked-secret rate, risk band distribution.
Audit & Compliance	Compliance officers	Audit throughput, export lag, chain-integrity status, reports generated.
Worker Health	Platform operators	Throughput, retry/DLQ counts, lag per worker.

Alerts¶

Alert	Condition	Severity
PDP latency breach	p99 evaluation latency above budget for 5 min	High
PDP error spike	Evaluation error rate above threshold	High
Approval queue aging	Open requests exceeding SLA age	Medium
Excessive timeouts	Timeout/escalation rate spike	Medium
Critical finding	Any `Critical` `SecurityFinding` raised	High
Leaked secret	`SecurityFindingRaised` (leak category)	Critical
Audit export lag	Export lag above threshold	High
Chain integrity	Hash-chain verification failure	Critical
Worker DLQ growth	DLQ depth rising for any worker	Medium

Required Dimensions¶

Every metric, log, and trace carries — at minimum — the following dimensions so signals can be sliced and correlated, per the Metadata Schema:

tenantId, traceId, correlationId, projectId, moduleId, service, environment, plus domain dimensions: governanceDomain (one of the ten), decision, severity, riskBand, framework, and worker.

How Observability Ties Back¶

Traceability — shared traceId makes any governance decision queryable end to end.
Autonomy — PDP latency and approval-queue metrics confirm the factory keeps moving autonomously and surfaces only true exceptions to humans.
Governance & compliance — export lag and chain-integrity alerts guarantee the evidence record stays complete and provable.
Multi-tenant scale — every signal is tenant-dimensioned, so per-tenant health and load are visible.

Security · Workers · Workflows · Deployment
Observability & Feedback Platform
Reference: Metadata Schema · Event Envelope