Runtime & Control Plane Overview¶

Purpose¶

This section covers the operational architecture of the ConnectSoft AI Software Factory — the runtime view of how the Factory executes jobs, manages state, handles failures, and provides observability.

While other Factory documentation focuses on conceptual architecture, agent capabilities, and workflows, this section explains:

How the Factory runs jobs — from request to completion
How control plane and data plane are separated — enabling scalability and reliability
How agents, queues, state, and observability fit together — the operational mechanics

This documentation is essential for:

Architects designing Factory integrations and extensions
Engineers implementing Factory services and workers
Operations teams running and monitoring the Factory
SREs ensuring reliability and performance

Business Context

For business-oriented documentation covering reliability, scalability, cost, and operational excellence, see the Business Runtime Overview in the Company Documentation repository.

Big Picture: Control Plane vs Data Plane¶

The Factory runtime is architected using a control plane / data plane separation, a pattern common in cloud-native systems that enables independent scaling, isolation, and reliability.

graph TD
    subgraph ControlPlane["Control Plane"]
        API[Factory API<br/>GraphQL / gRPC / REST]
        Orchestrator[Orchestrator Service]
        Scheduler[Schedulers]
        RunDB[Run State DB]
        Queue[Job Queue / Bus]
        Audit[Audit Sink]
    end

    subgraph DataPlane["Data Plane"]
        RepoGen[Repo Generator<br/>Workers]
        PipelineGen[Pipeline Generator<br/>Workers]
        SaaSGen[SaaS Scaffolder<br/>Workers]
        DevOps[Azure DevOps<br/>Git / CI Integration]
    end

    Client[CLI / UI / External Caller] --> API
    API --> Orchestrator
    Orchestrator --> Scheduler
    Orchestrator --> Queue
    Orchestrator --> RunDB
    Orchestrator --> Audit

    Queue --> RepoGen
    Queue --> PipelineGen
    Queue --> SaaSGen

    RepoGen --> DevOps
    PipelineGen --> DevOps
    SaaSGen --> DevOps

Hold "Alt" / "Option" to enable pan & zoom

Control Plane vs Data Plane: Summary¶

Control Plane¶

The Control Plane is responsible for orchestration, coordination, and governance:

Orchestration — Coordinates agent workflows and manages run lifecycles
Scheduling — Determines when and how jobs are executed
State Management — Tracks run status, step progress, and metadata
Validation — Validates inputs and enforces policies before execution
Audit Logging — Records all actions for compliance and traceability
API Surface — Exposes REST, GraphQL, or gRPC APIs to UI, CLI, and external services

The Control Plane consists of long-running services (orchestrator, schedulers, API gateways) backed by databases (run state store) and message queues (job queue/bus). These services are typically deployed as stateful, highly available services with redundancy and failover capabilities.

Data Plane¶

The Data Plane is responsible for execution of work units:

Repository Generation — Creates Git repositories, commits code, manages branches
Pipeline Generation — Generates CI/CD pipelines, infrastructure-as-code
SaaS Scaffolding — Creates microservices, libraries, and application components
External System Integration — Interacts with Azure DevOps, Git providers, cloud services

The Data Plane consists of stateless worker pools that scale horizontally based on workload. Workers consume jobs from queues, execute tasks, and report results back to the Control Plane. Workers can be autoscaled (scale up/down based on queue depth) and are typically deployed in separate namespaces, node pools, or even clusters for isolation.

Key Runtime Concepts¶

Runs and Jobs¶

Run — A single execution request (e.g., "generate a microservice for project X")
Job — A discrete work unit within a run (e.g., "generate base repo", "apply overlays", "create pipelines")
Relationship — One run → multiple jobs (sequential or parallel)

State and Memory¶

Run State — Operational state stored in Run State DB (run status, step progress, timestamps)
AI Memory — Long-term knowledge stored in Knowledge & Memory System (patterns, historical runs, learnings)
Separation — Operational state (ephemeral, per-run) vs. long-term memory (persistent, cross-project)

Failure and Recovery¶

Automatic Retry — Transient failures are retried with exponential backoff
Idempotency — Jobs are designed to be safely retried without side effects
Resume — Runs can resume from the last successful step after failures
Compensation — Failed runs can trigger compensating actions (cleanup, rollback)

Observability¶

Traces — Distributed tracing across all services and agents
Metrics — Run success rates, queue depths, execution times, AI token usage
Logs — Structured logs with correlation IDs (runId, jobId, traceId)
Dashboards — Real-time visibility into Factory health, performance, and costs

Documentation Structure¶

This runtime documentation is organized into focused topics:

Control Plane — Deep dive into control plane vs data plane separation, responsibilities, and deployment patterns
Execution Engine — Run lifecycle, queuing model, idempotency, and traceability
State & Memory — Run state store, artifacts, and integration with the Knowledge & Memory System
Failure & Recovery — Failure modes, retry policies, resume capabilities, and compensation patterns
Observability — Traces, metrics, logs, dashboards, and alerts

Factory Architecture¶

Overall Platform Architecture — High-level Factory architecture and design principles
Orchestration Layer — How orchestration coordinates agents and workflows
Agent System Overview — Multi-agent system design and agent capabilities

Knowledge & Memory¶

Knowledge and Memory System — How the Factory learns and improves over time
Knowledge Indices — Vector search and semantic retrieval
Knowledge Graph — Graph-based knowledge representation

Business Context¶

External Documentation