Skip to content

Resilience, DR & Business Continuity

This page defines how the ConnectSoft AI Software Factory stays available through failures and recovers from disasters. It is the final-state architectural view; the Failure Modes & Recovery runtime page describes the concrete run/job-level mechanics, and each template's resiliency.md page (for example the microservice template resiliency) documents what every generated service inherits.

Target Architecture — Final-State Design

Resilience is designed in, not bolted on. Every platform service and every generated SaaS service inherits the same resilience primitives from ConnectSoft.Extensions.* libraries and the templates, so behavior is uniform across the factory and its products.

Failure domains

flowchart TB
    subgraph region [Azure Region]
        subgraph az1 [Availability Zone 1]
            n1["Service replicas"]
        end
        subgraph az2 [Availability Zone 2]
            n2["Service replicas"]
        end
        subgraph az3 [Availability Zone 3]
            n3["Service replicas"]
        end
        bus["Service Bus (zone-redundant)"]
        sql["Azure SQL / PostgreSQL (zone-redundant)"]
        blob["Blob (GRS)"]
    end
    secondary["Secondary Region<br/>(DR target)"]

    n1 --> bus
    n2 --> bus
    n3 --> bus
    bus --> sql
    sql -.->|geo-replication| secondary
    blob -.->|geo-redundant| secondary
Hold "Alt" / "Option" to enable pan & zoom

Failures are isolated by domain so a single fault never takes down the platform: a pod, a node, an availability zone, a dependency, or (in the worst case) a region.

Resilience patterns

Applied uniformly via shared libraries and the templates:

Pattern Purpose Where it lives
Retry with exponential backoff + jitter Absorb transient faults without thundering herd ConnectSoft.Extensions.Resilience (Polly-based), worker pipelines
Circuit breaker Stop calling a failing dependency, fail fast HTTP/gRPC clients, LLM clients
Timeout Bound every outbound call All clients
Bulkhead / concurrency limits Isolate resource pools so one dependency can't starve others Service hosts, agent runtime
Idempotency Safe retries and replay Dedup on eventId per the Event Envelope
Dead-letter + replay Preserve poison messages for inspection and reprocessing Service Bus DLQ subqueues
Graceful degradation / fallback Degrade to deterministic flow when agents/LLMs fail Agent runtime fallback to template-based generation
Health probes Remove unhealthy instances from rotation ConnectSoft.Extensions.Diagnostics.HealthChecks
Saga compensation Undo partial work on critical failure Control Plane workflows / coordinators

Recovery flow

flowchart TD
    fault["Fault detected"]
    classify{"Transient?"}
    retry["Retry with backoff"]
    breaker["Circuit breaker opens"]
    fallback["Fallback / degrade"]
    dlq["Dead-letter + alert"]
    resume["Resume from checkpoint"]
    compensate["Compensating workflow"]
    recovered["Recovered"]

    fault --> classify
    classify -->|yes| retry
    retry -->|success| recovered
    retry -->|exhausted| breaker
    breaker --> fallback
    fallback -->|critical| compensate
    fallback -->|non-critical| dlq
    classify -->|no| dlq
    dlq --> resume
    compensate --> recovered
    resume --> recovered
Hold "Alt" / "Option" to enable pan & zoom

State is persisted so runs resume from the last successful checkpoint rather than restarting; see State & Memory.

RTO / RPO targets

Scope RTO (recovery time) RPO (data loss) Strategy
Single pod / node failure Seconds Zero Replica reschedule, health probes
Availability-zone failure < 1 min Zero Zone-redundant compute + data
Stateful store failure Minutes Near-zero Zone-redundant Azure SQL/PostgreSQL, automatic failover
Regional disaster (factory) < 4 hours < 15 min Geo-replicated data, IaC re-provisioning in secondary region
Generated prod SaaS Per edition SLA Per edition SLA Inherited DR posture, per-tenant configuration

Backup & restore

Data Backup mechanism Restore approach
Transactional metadata (Azure SQL / PostgreSQL) Automated PITR backups + geo-replication Point-in-time restore or geo-failover
Artifacts (Blob) Geo-redundant storage + version history Restore prior version / secondary region
Source of truth (Git / Azure DevOps) Distributed by nature + repo backups Re-clone / restore repository
Vector memory (Qdrant) Snapshot backups; rebuildable from artifacts Restore snapshot or re-embed from source artifacts
Hot cache (Redis) Not backed up (ephemeral) Rebuild from durable stores on restart
Secrets (Key Vault) Soft-delete + purge protection Recover deleted secrets/keys
Telemetry (App Insights) Retention windows Not restored; rolling retention by policy

Backups are verified by periodic restore drills, and recovery from artifacts/Git is preferred over relying solely on derived stores (vector, cache).

Multi-region & business continuity

  • Active-passive DR: the factory runs active in a primary region with data geo-replicated to a secondary region; infrastructure in the secondary region is re-provisioned from Pulumi programs by the DevOps / GitOps IaCProvisioningService.
  • Stateless services are reconstructable anywhere from container images in the registry; state comes from geo-replicated stores and Git.
  • Continuity of generated SaaS: products inherit the same DR posture through the ConnectSoft.Saas.* templates and are bound per tenant via RuntimeTenantBinding (see Runtime Cloud).
  • Runbooks for failover, restore, and DR drills live with each template (runbook pages) and platform deployment.md pages.

Chaos engineering

Resilience is validated, not assumed. The Resiliency & Chaos Engineer Agent injects controlled faults (latency, dependency failure, zone loss, message poisoning) against staging environments and feeds results back through the Observability & Feedback Platform. Findings that change a resilience target are recorded as an ADR.

Pillar alignment

  • Traceability — every failure, retry, and compensation emits an event in the canonical envelope correlated by traceId.
  • Reusability — all patterns ship in shared libraries and templates, so resilience is inherited, not re-implemented.
  • Autonomy — chaos, recovery, and DR drills are agent-driven.
  • Governance — DR posture, RTO/RPO, and risk acceptances are policy-governed and ADR-recorded.
  • Observability — failure signals drive alerts, incidents, and the improvement loop.
  • Multi-tenant scale — isolation holds during failover; one tenant's incident does not compromise another.