Observability¶

Target Architecture — Final-State Design

This page describes the final-state observability model of the Integration Platform. Every service uses ConnectSoft.Extensions.Observability (Serilog + OpenTelemetry + Application Insights). Because this platform spans the boundary to external systems, its telemetry emphasises external API health, latency, error classification, and credential lifecycle — the signals that predict and explain vendor-side failures.

Observability here answers three questions continuously: Is the factory's connectivity to each vendor healthy? Why did a specific external call fail? Is any tenant or provider approaching a rate or cost limit? All signals carry the canonical envelope identity so they correlate with the Knowledge Platform and feed the Observability & Feedback Platform.

Logs¶

Structured Serilog logs with the canonical metadata enriched on every line: tenantId, projectId, traceId, correlationId, connectionId, providerId, runId / deliveryId.
Secret-safe. Logs carry credentialRef and SecretFingerprint only — never secret material, tokens, or full vendor payloads (payloads are referenced by Blob payloadRef).
Outbound call logs record provider, operation, attempt, HTTP status class, and latency; inbound webhook logs record direction, signature status, and normalisation outcome.
Log levels escalate failures by classification: Transient/RateLimited at warning, Auth/Validation/Poison at error.

Metrics¶

Metric	Type	Purpose
`integration.run.count`	Counter	Outbound operations by provider/operation/outcome
`integration.run.latency`	Histogram	Vendor call latency distribution
`integration.run.failures`	Counter	Failures by `category` (Transient/RateLimited/Auth/Validation/Poison)
`webhook.delivery.count`	Counter	Inbound/outbound deliveries by status
`webhook.delivery.attempts`	Histogram	Retry attempts per delivery
`webhook.deadletter.count`	Counter	Dead-lettered deliveries
`external.health.status`	Gauge	Per-connection health (0=Unhealthy … 3=Healthy)
`external.health.latency`	Histogram	Probe latency per provider
`integration.ratelimit.remaining`	Gauge	Remaining budget per provider/tenant
`credential.rotation.count`	Counter	Rotations by outcome
`credential.expiry.days`	Gauge	Days until credential expiry
`circuit.state`	Gauge	Open/half-open/closed circuits per provider

Traces¶

OpenTelemetry spans wrap every outbound vendor call and inbound webhook processing stage; traceId/correlationId propagate from the event envelope into each span and onward into vendor request headers where supported.
A single trace links: agent decision → ExecuteIntegrationRun → vendor HTTP span → IntegrationRunCompleted → downstream factory consumer.
Inbound traces link: vendor webhook → signature verify span → normalise span → WebhookDelivered → Knowledge Platform ingestion.
Spans are exported to Application Insights with provider, operation, attempt, and outcome attributes (never secrets or full payloads).

Dashboards¶

Vendor Health — health status, latency, and error rate per provider/connection; open circuits highlighted.
Integration Throughput — run and delivery volume, success rate, and latency percentiles by provider and tenant.
Failure Analytics — failures by category and provider over time; retry success rate; dead-letter trends.
Credential Lifecycle — upcoming expirations, rotation success/failure, and revocations.
Rate & Cost — remaining rate-limit budget and per-tenant usage against quotas.

Alerts¶

Alert	Condition	Action
Vendor outage	`external.health.status` Unhealthy for N intervals	Open circuit; page on-call; notify dependent platforms
Elevated failure rate	`integration.run.failures` rate > threshold per provider	Investigate; possible vendor incident
Dead-letter surge	`webhook.deadletter.count` spike	Inspect poison payloads; pause/replay after fix
Auth failures	`Auth`-category failures > threshold	Trigger credential rotation; review credential validity
Credential expiring	`credential.expiry.days` below warning window	Rotate ahead of expiry
Rate-limit exhaustion	`integration.ratelimit.remaining` near zero	Backpressure; review tenant quota

Alerts and incidents are forwarded to the Observability & Feedback Platform, where they become runtime memory attributable back to the producing connection and trace.

Required Dimensions¶

Every log, metric, and trace MUST carry: tenantId, traceId, correlationId, providerId, and (where applicable) connectionId, runId / deliveryId, operation, outcome, and failure category. These dimensions are what make external behaviour sliceable by tenant, provider, and operation, and are mandatory per the factory metadata schema.

External API Health Monitoring¶

The ExternalApiHealthWorker continuously probes vendor endpoints and writes per-connection health that gates routing:

Probing — lightweight, provider-appropriate health checks at a configurable interval per connection.
Health write — latest-wins external.health.status with latency; sustained failure opens a circuit so dependent services fail fast.
Recovery — half-open probing restores routing once the vendor recovers; recovery emits an event so paused work resumes.
Correlation — health degradation links to the IntegrationFailure records observed in the same window for root-cause analysis.