Resilience, DR & Business Continuity¶

This page defines how the ConnectSoft AI Software Factory stays available through failures and recovers from disasters. It is the final-state architectural view; the Failure Modes & Recovery runtime page describes the concrete run/job-level mechanics, and each template's resiliency.md page (for example the microservice template resiliency) documents what every generated service inherits.

Target Architecture — Final-State Design

Resilience is designed in, not bolted on. Every platform service and every generated SaaS service inherits the same resilience primitives from ConnectSoft.Extensions.* libraries and the templates, so behavior is uniform across the factory and its products.

Failure domains¶

flowchart TB
    subgraph region [Azure Region]
        subgraph az1 [Availability Zone 1]
            n1["Service replicas"]
        end
        subgraph az2 [Availability Zone 2]
            n2["Service replicas"]
        end
        subgraph az3 [Availability Zone 3]
            n3["Service replicas"]
        end
        bus["Service Bus (zone-redundant)"]
        sql["Azure SQL / PostgreSQL (zone-redundant)"]
        blob["Blob (GRS)"]
    end
    secondary["Secondary Region<br/>(DR target)"]

    n1 --> bus
    n2 --> bus
    n3 --> bus
    bus --> sql
    sql -.->|geo-replication| secondary
    blob -.->|geo-redundant| secondary

Hold "Alt" / "Option" to enable pan & zoom

Failures are isolated by domain so a single fault never takes down the platform: a pod, a node, an availability zone, a dependency, or (in the worst case) a region.

Resilience patterns¶

Applied uniformly via shared libraries and the templates:

Pattern	Purpose	Where it lives
Retry with exponential backoff + jitter	Absorb transient faults without thundering herd	`ConnectSoft.Extensions.Resilience` (Polly-based), worker pipelines
Circuit breaker	Stop calling a failing dependency, fail fast	HTTP/gRPC clients, LLM clients
Timeout	Bound every outbound call	All clients
Bulkhead / concurrency limits	Isolate resource pools so one dependency can't starve others	Service hosts, agent runtime
Idempotency	Safe retries and replay	Dedup on `eventId` per the Event Envelope
Dead-letter + replay	Preserve poison messages for inspection and reprocessing	Service Bus DLQ subqueues
Graceful degradation / fallback	Degrade to deterministic flow when agents/LLMs fail	Agent runtime fallback to template-based generation
Health probes	Remove unhealthy instances from rotation	`ConnectSoft.Extensions.Diagnostics.HealthChecks`
Saga compensation	Undo partial work on critical failure	Control Plane workflows / coordinators

Recovery flow¶

flowchart TD
    fault["Fault detected"]
    classify{"Transient?"}
    retry["Retry with backoff"]
    breaker["Circuit breaker opens"]
    fallback["Fallback / degrade"]
    dlq["Dead-letter + alert"]
    resume["Resume from checkpoint"]
    compensate["Compensating workflow"]
    recovered["Recovered"]

    fault --> classify
    classify -->|yes| retry
    retry -->|success| recovered
    retry -->|exhausted| breaker
    breaker --> fallback
    fallback -->|critical| compensate
    fallback -->|non-critical| dlq
    classify -->|no| dlq
    dlq --> resume
    compensate --> recovered
    resume --> recovered

Hold "Alt" / "Option" to enable pan & zoom

State is persisted so runs resume from the last successful checkpoint rather than restarting; see State & Memory.

RTO / RPO targets¶

Scope	RTO (recovery time)	RPO (data loss)	Strategy
Single pod / node failure	Seconds	Zero	Replica reschedule, health probes
Availability-zone failure	< 1 min	Zero	Zone-redundant compute + data
Stateful store failure	Minutes	Near-zero	Zone-redundant Azure SQL/PostgreSQL, automatic failover
Regional disaster (factory)	< 4 hours	< 15 min	Geo-replicated data, IaC re-provisioning in secondary region
Generated prod SaaS	Per edition SLA	Per edition SLA	Inherited DR posture, per-tenant configuration

Backup & restore¶

Data	Backup mechanism	Restore approach
Transactional metadata (Azure SQL / PostgreSQL)	Automated PITR backups + geo-replication	Point-in-time restore or geo-failover
Artifacts (Blob)	Geo-redundant storage + version history	Restore prior version / secondary region
Source of truth (Git / Azure DevOps)	Distributed by nature + repo backups	Re-clone / restore repository
Vector memory (Qdrant)	Snapshot backups; rebuildable from artifacts	Restore snapshot or re-embed from source artifacts
Hot cache (Redis)	Not backed up (ephemeral)	Rebuild from durable stores on restart
Secrets (Key Vault)	Soft-delete + purge protection	Recover deleted secrets/keys
Telemetry (App Insights)	Retention windows	Not restored; rolling retention by policy

Backups are verified by periodic restore drills, and recovery from artifacts/Git is preferred over relying solely on derived stores (vector, cache).

Multi-region & business continuity¶

Active-passive DR: the factory runs active in a primary region with data geo-replicated to a secondary region; infrastructure in the secondary region is re-provisioned from Pulumi programs by the DevOps / GitOps IaCProvisioningService.
Stateless services are reconstructable anywhere from container images in the registry; state comes from geo-replicated stores and Git.
Continuity of generated SaaS: products inherit the same DR posture through the ConnectSoft.Saas.* templates and are bound per tenant via RuntimeTenantBinding (see Runtime Cloud).
Runbooks for failover, restore, and DR drills live with each template (runbook pages) and platform deployment.md pages.

Chaos engineering¶

Resilience is validated, not assumed. The Resiliency & Chaos Engineer Agent injects controlled faults (latency, dependency failure, zone loss, message poisoning) against staging environments and feeds results back through the Observability & Feedback Platform. Findings that change a resilience target are recorded as an ADR.

Pillar alignment¶

Traceability — every failure, retry, and compensation emits an event in the canonical envelope correlated by traceId.
Reusability — all patterns ship in shared libraries and templates, so resilience is inherited, not re-implemented.
Autonomy — chaos, recovery, and DR drills are agent-driven.
Governance — DR posture, RTO/RPO, and risk acceptances are policy-governed and ADR-recorded.
Observability — failure signals drive alerts, incidents, and the improvement loop.
Multi-tenant scale — isolation holds during failover; one tenant's incident does not compromise another.

Resilience, DR & Business Continuity¶

Failure domains¶

Resilience patterns¶

Recovery flow¶

RTO / RPO targets¶

Backup & restore¶

Multi-region & business continuity¶

Chaos engineering¶

Pillar alignment¶

Related¶