Resilience, DR & Business Continuity¶
This page defines how the ConnectSoft AI Software Factory stays available through failures and recovers from disasters. It is the final-state architectural view; the Failure Modes & Recovery runtime page describes the concrete run/job-level mechanics, and each template's resiliency.md page (for example the microservice template resiliency) documents what every generated service inherits.
Target Architecture — Final-State Design
Resilience is designed in, not bolted on. Every platform service and every generated SaaS service inherits the same resilience primitives from ConnectSoft.Extensions.* libraries and the templates, so behavior is uniform across the factory and its products.
Failure domains¶
flowchart TB
subgraph region [Azure Region]
subgraph az1 [Availability Zone 1]
n1["Service replicas"]
end
subgraph az2 [Availability Zone 2]
n2["Service replicas"]
end
subgraph az3 [Availability Zone 3]
n3["Service replicas"]
end
bus["Service Bus (zone-redundant)"]
sql["Azure SQL / PostgreSQL (zone-redundant)"]
blob["Blob (GRS)"]
end
secondary["Secondary Region<br/>(DR target)"]
n1 --> bus
n2 --> bus
n3 --> bus
bus --> sql
sql -.->|geo-replication| secondary
blob -.->|geo-redundant| secondary
Failures are isolated by domain so a single fault never takes down the platform: a pod, a node, an availability zone, a dependency, or (in the worst case) a region.
Resilience patterns¶
Applied uniformly via shared libraries and the templates:
| Pattern | Purpose | Where it lives |
|---|---|---|
| Retry with exponential backoff + jitter | Absorb transient faults without thundering herd | ConnectSoft.Extensions.Resilience (Polly-based), worker pipelines |
| Circuit breaker | Stop calling a failing dependency, fail fast | HTTP/gRPC clients, LLM clients |
| Timeout | Bound every outbound call | All clients |
| Bulkhead / concurrency limits | Isolate resource pools so one dependency can't starve others | Service hosts, agent runtime |
| Idempotency | Safe retries and replay | Dedup on eventId per the Event Envelope |
| Dead-letter + replay | Preserve poison messages for inspection and reprocessing | Service Bus DLQ subqueues |
| Graceful degradation / fallback | Degrade to deterministic flow when agents/LLMs fail | Agent runtime fallback to template-based generation |
| Health probes | Remove unhealthy instances from rotation | ConnectSoft.Extensions.Diagnostics.HealthChecks |
| Saga compensation | Undo partial work on critical failure | Control Plane workflows / coordinators |
Recovery flow¶
flowchart TD
fault["Fault detected"]
classify{"Transient?"}
retry["Retry with backoff"]
breaker["Circuit breaker opens"]
fallback["Fallback / degrade"]
dlq["Dead-letter + alert"]
resume["Resume from checkpoint"]
compensate["Compensating workflow"]
recovered["Recovered"]
fault --> classify
classify -->|yes| retry
retry -->|success| recovered
retry -->|exhausted| breaker
breaker --> fallback
fallback -->|critical| compensate
fallback -->|non-critical| dlq
classify -->|no| dlq
dlq --> resume
compensate --> recovered
resume --> recovered
State is persisted so runs resume from the last successful checkpoint rather than restarting; see State & Memory.
RTO / RPO targets¶
| Scope | RTO (recovery time) | RPO (data loss) | Strategy |
|---|---|---|---|
| Single pod / node failure | Seconds | Zero | Replica reschedule, health probes |
| Availability-zone failure | < 1 min | Zero | Zone-redundant compute + data |
| Stateful store failure | Minutes | Near-zero | Zone-redundant Azure SQL/PostgreSQL, automatic failover |
| Regional disaster (factory) | < 4 hours | < 15 min | Geo-replicated data, IaC re-provisioning in secondary region |
| Generated prod SaaS | Per edition SLA | Per edition SLA | Inherited DR posture, per-tenant configuration |
Backup & restore¶
| Data | Backup mechanism | Restore approach |
|---|---|---|
| Transactional metadata (Azure SQL / PostgreSQL) | Automated PITR backups + geo-replication | Point-in-time restore or geo-failover |
| Artifacts (Blob) | Geo-redundant storage + version history | Restore prior version / secondary region |
| Source of truth (Git / Azure DevOps) | Distributed by nature + repo backups | Re-clone / restore repository |
| Vector memory (Qdrant) | Snapshot backups; rebuildable from artifacts | Restore snapshot or re-embed from source artifacts |
| Hot cache (Redis) | Not backed up (ephemeral) | Rebuild from durable stores on restart |
| Secrets (Key Vault) | Soft-delete + purge protection | Recover deleted secrets/keys |
| Telemetry (App Insights) | Retention windows | Not restored; rolling retention by policy |
Backups are verified by periodic restore drills, and recovery from artifacts/Git is preferred over relying solely on derived stores (vector, cache).
Multi-region & business continuity¶
- Active-passive DR: the factory runs active in a primary region with data geo-replicated to a secondary region; infrastructure in the secondary region is re-provisioned from Pulumi programs by the DevOps / GitOps
IaCProvisioningService. - Stateless services are reconstructable anywhere from container images in the registry; state comes from geo-replicated stores and Git.
- Continuity of generated SaaS: products inherit the same DR posture through the
ConnectSoft.Saas.*templates and are bound per tenant viaRuntimeTenantBinding(see Runtime Cloud). - Runbooks for failover, restore, and DR drills live with each template (runbook pages) and platform
deployment.mdpages.
Chaos engineering¶
Resilience is validated, not assumed. The Resiliency & Chaos Engineer Agent injects controlled faults (latency, dependency failure, zone loss, message poisoning) against staging environments and feeds results back through the Observability & Feedback Platform. Findings that change a resilience target are recorded as an ADR.
Pillar alignment¶
- Traceability — every failure, retry, and compensation emits an event in the canonical envelope correlated by
traceId. - Reusability — all patterns ship in shared libraries and templates, so resilience is inherited, not re-implemented.
- Autonomy — chaos, recovery, and DR drills are agent-driven.
- Governance — DR posture, RTO/RPO, and risk acceptances are policy-governed and ADR-recorded.
- Observability — failure signals drive alerts, incidents, and the improvement loop.
- Multi-tenant scale — isolation holds during failover; one tenant's incident does not compromise another.