Skip to content

Runtime & Control Plane Overview

Purpose

This section covers the operational architecture of the ConnectSoft AI Software Factory — the runtime view of how the Factory executes jobs, manages state, handles failures, and provides observability.

While other Factory documentation focuses on conceptual architecture, agent capabilities, and workflows, this section explains:

  • How the Factory runs jobs — from request to completion
  • How control plane and data plane are separated — enabling scalability and reliability
  • How agents, queues, state, and observability fit together — the operational mechanics

This documentation is essential for: - Architects designing Factory integrations and extensions - Engineers implementing Factory services and workers - Operations teams running and monitoring the Factory - SREs ensuring reliability and performance

Business Context

For business-oriented documentation covering reliability, scalability, cost, and operational excellence, see the Business Runtime Overview in the Company Documentation repository.


Big Picture: Control Plane vs Data Plane

The Factory runtime is architected using a control plane / data plane separation, a pattern common in cloud-native systems that enables independent scaling, isolation, and reliability.

graph TD
    subgraph ControlPlane["Control Plane"]
        API[Factory API<br/>GraphQL / gRPC / REST]
        Orchestrator[Orchestrator Service]
        Scheduler[Schedulers]
        RunDB[Run State DB]
        Queue[Job Queue / Bus]
        Audit[Audit Sink]
    end

    subgraph DataPlane["Data Plane"]
        RepoGen[Repo Generator<br/>Workers]
        PipelineGen[Pipeline Generator<br/>Workers]
        SaaSGen[SaaS Scaffolder<br/>Workers]
        DevOps[Azure DevOps<br/>Git / CI Integration]
    end

    Client[CLI / UI / External Caller] --> API
    API --> Orchestrator
    Orchestrator --> Scheduler
    Orchestrator --> Queue
    Orchestrator --> RunDB
    Orchestrator --> Audit

    Queue --> RepoGen
    Queue --> PipelineGen
    Queue --> SaaSGen

    RepoGen --> DevOps
    PipelineGen --> DevOps
    SaaSGen --> DevOps
Hold "Alt" / "Option" to enable pan & zoom

Control Plane vs Data Plane: Summary

Control Plane

The Control Plane is responsible for orchestration, coordination, and governance:

  • Orchestration — Coordinates agent workflows and manages run lifecycles
  • Scheduling — Determines when and how jobs are executed
  • State Management — Tracks run status, step progress, and metadata
  • Validation — Validates inputs and enforces policies before execution
  • Audit Logging — Records all actions for compliance and traceability
  • API Surface — Exposes REST, GraphQL, or gRPC APIs to UI, CLI, and external services

The Control Plane consists of long-running services (orchestrator, schedulers, API gateways) backed by databases (run state store) and message queues (job queue/bus). These services are typically deployed as stateful, highly available services with redundancy and failover capabilities.

Data Plane

The Data Plane is responsible for execution of work units:

  • Repository Generation — Creates Git repositories, commits code, manages branches
  • Pipeline Generation — Generates CI/CD pipelines, infrastructure-as-code
  • SaaS Scaffolding — Creates microservices, libraries, and application components
  • External System Integration — Interacts with Azure DevOps, Git providers, cloud services

The Data Plane consists of stateless worker pools that scale horizontally based on workload. Workers consume jobs from queues, execute tasks, and report results back to the Control Plane. Workers can be autoscaled (scale up/down based on queue depth) and are typically deployed in separate namespaces, node pools, or even clusters for isolation.


Key Runtime Concepts

Runs and Jobs

  • Run — A single execution request (e.g., "generate a microservice for project X")
  • Job — A discrete work unit within a run (e.g., "generate base repo", "apply overlays", "create pipelines")
  • Relationship — One run → multiple jobs (sequential or parallel)

State and Memory

  • Run State — Operational state stored in Run State DB (run status, step progress, timestamps)
  • AI Memory — Long-term knowledge stored in Knowledge & Memory System (patterns, historical runs, learnings)
  • Separation — Operational state (ephemeral, per-run) vs. long-term memory (persistent, cross-project)

Failure and Recovery

  • Automatic Retry — Transient failures are retried with exponential backoff
  • Idempotency — Jobs are designed to be safely retried without side effects
  • Resume — Runs can resume from the last successful step after failures
  • Compensation — Failed runs can trigger compensating actions (cleanup, rollback)

Observability

  • Traces — Distributed tracing across all services and agents
  • Metrics — Run success rates, queue depths, execution times, AI token usage
  • Logs — Structured logs with correlation IDs (runId, jobId, traceId)
  • Dashboards — Real-time visibility into Factory health, performance, and costs

Documentation Structure

This runtime documentation is organized into focused topics:

  • Control Plane — Deep dive into control plane vs data plane separation, responsibilities, and deployment patterns
  • Execution Engine — Run lifecycle, queuing model, idempotency, and traceability
  • State & Memory — Run state store, artifacts, and integration with the Knowledge & Memory System
  • Failure & Recovery — Failure modes, retry policies, resume capabilities, and compensation patterns
  • Observability — Traces, metrics, logs, dashboards, and alerts

Factory Architecture

Knowledge & Memory

Business Context

External Documentation

For business-oriented documentation covering reliability, scalability, cost, and operational excellence, see the Business Runtime Overview in the Company Documentation repository.