Observability¶
Target Architecture — Final-State Design
This page describes the final-state observability model of the Integration Platform. Every service uses ConnectSoft.Extensions.Observability (Serilog + OpenTelemetry + Application Insights). Because this platform spans the boundary to external systems, its telemetry emphasises external API health, latency, error classification, and credential lifecycle — the signals that predict and explain vendor-side failures.
Observability here answers three questions continuously: Is the factory's connectivity to each vendor healthy? Why did a specific external call fail? Is any tenant or provider approaching a rate or cost limit? All signals carry the canonical envelope identity so they correlate with the Knowledge Platform and feed the Observability & Feedback Platform.
Logs¶
- Structured Serilog logs with the canonical metadata enriched on every line:
tenantId,projectId,traceId,correlationId,connectionId,providerId,runId/deliveryId. - Secret-safe. Logs carry
credentialRefandSecretFingerprintonly — never secret material, tokens, or full vendor payloads (payloads are referenced by BlobpayloadRef). - Outbound call logs record provider, operation, attempt, HTTP status class, and latency; inbound webhook logs record direction, signature status, and normalisation outcome.
- Log levels escalate failures by classification:
Transient/RateLimitedat warning,Auth/Validation/Poisonat error.
Metrics¶
| Metric | Type | Purpose |
|---|---|---|
integration.run.count |
Counter | Outbound operations by provider/operation/outcome |
integration.run.latency |
Histogram | Vendor call latency distribution |
integration.run.failures |
Counter | Failures by category (Transient/RateLimited/Auth/Validation/Poison) |
webhook.delivery.count |
Counter | Inbound/outbound deliveries by status |
webhook.delivery.attempts |
Histogram | Retry attempts per delivery |
webhook.deadletter.count |
Counter | Dead-lettered deliveries |
external.health.status |
Gauge | Per-connection health (0=Unhealthy … 3=Healthy) |
external.health.latency |
Histogram | Probe latency per provider |
integration.ratelimit.remaining |
Gauge | Remaining budget per provider/tenant |
credential.rotation.count |
Counter | Rotations by outcome |
credential.expiry.days |
Gauge | Days until credential expiry |
circuit.state |
Gauge | Open/half-open/closed circuits per provider |
Traces¶
- OpenTelemetry spans wrap every outbound vendor call and inbound webhook processing stage;
traceId/correlationIdpropagate from the event envelope into each span and onward into vendor request headers where supported. - A single trace links: agent decision →
ExecuteIntegrationRun→ vendor HTTP span →IntegrationRunCompleted→ downstream factory consumer. - Inbound traces link: vendor webhook → signature verify span → normalise span →
WebhookDelivered→ Knowledge Platform ingestion. - Spans are exported to Application Insights with provider, operation, attempt, and outcome attributes (never secrets or full payloads).
Dashboards¶
- Vendor Health — health status, latency, and error rate per provider/connection; open circuits highlighted.
- Integration Throughput — run and delivery volume, success rate, and latency percentiles by provider and tenant.
- Failure Analytics — failures by category and provider over time; retry success rate; dead-letter trends.
- Credential Lifecycle — upcoming expirations, rotation success/failure, and revocations.
- Rate & Cost — remaining rate-limit budget and per-tenant usage against quotas.
Alerts¶
| Alert | Condition | Action |
|---|---|---|
| Vendor outage | external.health.status Unhealthy for N intervals |
Open circuit; page on-call; notify dependent platforms |
| Elevated failure rate | integration.run.failures rate > threshold per provider |
Investigate; possible vendor incident |
| Dead-letter surge | webhook.deadletter.count spike |
Inspect poison payloads; pause/replay after fix |
| Auth failures | Auth-category failures > threshold |
Trigger credential rotation; review credential validity |
| Credential expiring | credential.expiry.days below warning window |
Rotate ahead of expiry |
| Rate-limit exhaustion | integration.ratelimit.remaining near zero |
Backpressure; review tenant quota |
Alerts and incidents are forwarded to the Observability & Feedback Platform, where they become runtime memory attributable back to the producing connection and trace.
Required Dimensions¶
Every log, metric, and trace MUST carry: tenantId, traceId, correlationId, providerId, and (where applicable) connectionId, runId / deliveryId, operation, outcome, and failure category. These dimensions are what make external behaviour sliceable by tenant, provider, and operation, and are mandatory per the factory metadata schema.
External API Health Monitoring¶
The ExternalApiHealthWorker continuously probes vendor endpoints and writes per-connection health that gates routing:
- Probing — lightweight, provider-appropriate health checks at a configurable interval per connection.
- Health write — latest-wins
external.health.statuswith latency; sustained failure opens a circuit so dependent services fail fast. - Recovery — half-open probing restores routing once the vendor recovers; recovery emits an event so paused work resumes.
- Correlation — health degradation links to the
IntegrationFailurerecords observed in the same window for root-cause analysis.