Skip to content

Observability

Target Architecture — Final-State Design

This page describes the final-state observability model of the Integration Platform. Every service uses ConnectSoft.Extensions.Observability (Serilog + OpenTelemetry + Application Insights). Because this platform spans the boundary to external systems, its telemetry emphasises external API health, latency, error classification, and credential lifecycle — the signals that predict and explain vendor-side failures.

Observability here answers three questions continuously: Is the factory's connectivity to each vendor healthy? Why did a specific external call fail? Is any tenant or provider approaching a rate or cost limit? All signals carry the canonical envelope identity so they correlate with the Knowledge Platform and feed the Observability & Feedback Platform.

Logs

  • Structured Serilog logs with the canonical metadata enriched on every line: tenantId, projectId, traceId, correlationId, connectionId, providerId, runId / deliveryId.
  • Secret-safe. Logs carry credentialRef and SecretFingerprint only — never secret material, tokens, or full vendor payloads (payloads are referenced by Blob payloadRef).
  • Outbound call logs record provider, operation, attempt, HTTP status class, and latency; inbound webhook logs record direction, signature status, and normalisation outcome.
  • Log levels escalate failures by classification: Transient/RateLimited at warning, Auth/Validation/Poison at error.

Metrics

Metric Type Purpose
integration.run.count Counter Outbound operations by provider/operation/outcome
integration.run.latency Histogram Vendor call latency distribution
integration.run.failures Counter Failures by category (Transient/RateLimited/Auth/Validation/Poison)
webhook.delivery.count Counter Inbound/outbound deliveries by status
webhook.delivery.attempts Histogram Retry attempts per delivery
webhook.deadletter.count Counter Dead-lettered deliveries
external.health.status Gauge Per-connection health (0=Unhealthy … 3=Healthy)
external.health.latency Histogram Probe latency per provider
integration.ratelimit.remaining Gauge Remaining budget per provider/tenant
credential.rotation.count Counter Rotations by outcome
credential.expiry.days Gauge Days until credential expiry
circuit.state Gauge Open/half-open/closed circuits per provider

Traces

  • OpenTelemetry spans wrap every outbound vendor call and inbound webhook processing stage; traceId/correlationId propagate from the event envelope into each span and onward into vendor request headers where supported.
  • A single trace links: agent decision → ExecuteIntegrationRun → vendor HTTP span → IntegrationRunCompleted → downstream factory consumer.
  • Inbound traces link: vendor webhook → signature verify span → normalise span → WebhookDelivered → Knowledge Platform ingestion.
  • Spans are exported to Application Insights with provider, operation, attempt, and outcome attributes (never secrets or full payloads).

Dashboards

  • Vendor Health — health status, latency, and error rate per provider/connection; open circuits highlighted.
  • Integration Throughput — run and delivery volume, success rate, and latency percentiles by provider and tenant.
  • Failure Analytics — failures by category and provider over time; retry success rate; dead-letter trends.
  • Credential Lifecycle — upcoming expirations, rotation success/failure, and revocations.
  • Rate & Cost — remaining rate-limit budget and per-tenant usage against quotas.

Alerts

Alert Condition Action
Vendor outage external.health.status Unhealthy for N intervals Open circuit; page on-call; notify dependent platforms
Elevated failure rate integration.run.failures rate > threshold per provider Investigate; possible vendor incident
Dead-letter surge webhook.deadletter.count spike Inspect poison payloads; pause/replay after fix
Auth failures Auth-category failures > threshold Trigger credential rotation; review credential validity
Credential expiring credential.expiry.days below warning window Rotate ahead of expiry
Rate-limit exhaustion integration.ratelimit.remaining near zero Backpressure; review tenant quota

Alerts and incidents are forwarded to the Observability & Feedback Platform, where they become runtime memory attributable back to the producing connection and trace.

Required Dimensions

Every log, metric, and trace MUST carry: tenantId, traceId, correlationId, providerId, and (where applicable) connectionId, runId / deliveryId, operation, outcome, and failure category. These dimensions are what make external behaviour sliceable by tenant, provider, and operation, and are mandatory per the factory metadata schema.

External API Health Monitoring

The ExternalApiHealthWorker continuously probes vendor endpoints and writes per-connection health that gates routing:

  • Probing — lightweight, provider-appropriate health checks at a configurable interval per connection.
  • Health write — latest-wins external.health.status with latency; sustained failure opens a circuit so dependent services fail fast.
  • Recovery — half-open probing restores routing once the vendor recovers; recovery emits an event so paused work resumes.
  • Correlation — health degradation links to the IntegrationFailure records observed in the same window for root-cause analysis.