Observability¶
Target Architecture — Final-State Design
This page describes the final-state observability of the Governance, Security & Compliance Platform. Telemetry uses Serilog + OpenTelemetry exported to Application Insights via ConnectSoft.Extensions.Observability / ConnectSoft.Extensions.Telemetry / ConnectSoft.Extensions.Logging.Serilog, and the platform's own canonical events feed Observability & Feedback. Distinct from audit: audit is the legal record of trust decisions; observability is the operational view of platform health.
Telemetry Pillars¶
flowchart LR
Services["Governance services + workers"] --> Logs["Structured logs (Serilog)"]
Services --> Metrics["Metrics (OpenTelemetry)"]
Services --> Traces["Distributed traces (OpenTelemetry)"]
Logs --> AppInsights["Application Insights"]
Metrics --> AppInsights
Traces --> AppInsights
Services -->|canonical events| ObsPlatform["Observability & Feedback Platform"]
AppInsights --> Dashboards["Dashboards"]
AppInsights --> Alerts["Alerts"]
Logs¶
- Structured & contextual — Serilog emits structured JSON with the Metadata Schema fields enriched into every log entry (
tenantId,traceId,correlationId,projectId,moduleId). - Decision logs — each
PolicyDecisionlogs the matched policy/version, effect, and risk atInformation; denies and gates atWarning. - Security logs — finding ingestion, secret leak detection, and rotation events log with severity.
- No sensitive payloads — logs carry identifiers and digests, never secret values or restricted-classified content (redaction enforced by classification).
- Separation from audit — operational logs are mutable/expiring; the immutable record is the audit trail.
Metrics¶
| Metric | Type | Purpose |
|---|---|---|
governance.policy.evaluations |
counter | PDP call volume (by decision: allow/deny/gate). |
governance.policy.evaluation.latency |
histogram | Inline PDP latency (p50/p95/p99) — guards the request-path budget. |
governance.approval.pending |
gauge | Open approval requests (by tier/age). |
governance.approval.timeouts |
counter | Auto-rejections/escalations from timeout. |
governance.approval.decision.latency |
histogram | Time from request to human decision. |
governance.audit.entries |
counter | Audit write throughput. |
governance.audit.export.lag |
gauge | Backlog between recorded and exported entries. |
governance.security.findings.open |
gauge | Open findings by severity. |
governance.risk.score |
histogram | Distribution of computed risk scores by band. |
governance.classification.events |
counter | Classifications assigned/changed. |
governance.compliance.reports |
counter | Reports generated (by framework/status). |
governance.worker.dlq |
gauge | Dead-lettered messages per worker. |
Traces¶
- End-to-end spans — the inline PDP call, supplier gRPC calls (isolation/classification/risk), approval lifecycle, and audit writes are all spans on the originating action's
traceId, so a deployment promotion can be traced from request → evaluation → approval → audit. - Cross-platform stitching —
traceId/correlationIdpropagate from the Control Plane / DevOps & GitOps caller into governance spans and out via emitted events, giving a single thread across platforms. - Worker spans — each worker creates a span linked to the consumed event's
eventIdfor replay debugging.
Dashboards¶
| Dashboard | Audience | Shows |
|---|---|---|
| PDP Health | Platform operators | Evaluation volume, latency percentiles, allow/deny/gate mix, error rate. |
| Approval Operations | Compliance/release managers | Pending queue depth and age, timeout/escalation rate, decision latency. |
| Security Posture | Security engineers | Open findings by severity, leaked-secret rate, risk band distribution. |
| Audit & Compliance | Compliance officers | Audit throughput, export lag, chain-integrity status, reports generated. |
| Worker Health | Platform operators | Throughput, retry/DLQ counts, lag per worker. |
Alerts¶
| Alert | Condition | Severity |
|---|---|---|
| PDP latency breach | p99 evaluation latency above budget for 5 min | High |
| PDP error spike | Evaluation error rate above threshold | High |
| Approval queue aging | Open requests exceeding SLA age | Medium |
| Excessive timeouts | Timeout/escalation rate spike | Medium |
| Critical finding | Any Critical SecurityFinding raised |
High |
| Leaked secret | SecurityFindingRaised (leak category) |
Critical |
| Audit export lag | Export lag above threshold | High |
| Chain integrity | Hash-chain verification failure | Critical |
| Worker DLQ growth | DLQ depth rising for any worker | Medium |
Required Dimensions¶
Every metric, log, and trace carries — at minimum — the following dimensions so signals can be sliced and correlated, per the Metadata Schema:
tenantId, traceId, correlationId, projectId, moduleId, service, environment, plus domain dimensions: governanceDomain (one of the ten), decision, severity, riskBand, framework, and worker.
How Observability Ties Back¶
- Traceability — shared
traceIdmakes any governance decision queryable end to end. - Autonomy — PDP latency and approval-queue metrics confirm the factory keeps moving autonomously and surfaces only true exceptions to humans.
- Governance & compliance — export lag and chain-integrity alerts guarantee the evidence record stays complete and provable.
- Multi-tenant scale — every signal is tenant-dimensioned, so per-tenant health and load are visible.
Related¶
- Security · Workers · Workflows · Deployment
- Observability & Feedback Platform
- Reference: Metadata Schema · Event Envelope