Runtime & Control Plane Overview¶
Purpose¶
This section covers the operational architecture of the ConnectSoft AI Software Factory — the runtime view of how the Factory executes jobs, manages state, handles failures, and provides observability.
While other Factory documentation focuses on conceptual architecture, agent capabilities, and workflows, this section explains:
- How the Factory runs jobs — from request to completion
- How control plane and data plane are separated — enabling scalability and reliability
- How agents, queues, state, and observability fit together — the operational mechanics
This documentation is essential for: - Architects designing Factory integrations and extensions - Engineers implementing Factory services and workers - Operations teams running and monitoring the Factory - SREs ensuring reliability and performance
Business Context
For business-oriented documentation covering reliability, scalability, cost, and operational excellence, see the Business Runtime Overview in the Company Documentation repository.
Big Picture: Control Plane vs Data Plane¶
The Factory runtime is architected using a control plane / data plane separation, a pattern common in cloud-native systems that enables independent scaling, isolation, and reliability.
graph TD
subgraph ControlPlane["Control Plane"]
API[Factory API<br/>GraphQL / gRPC / REST]
Orchestrator[Orchestrator Service]
Scheduler[Schedulers]
RunDB[Run State DB]
Queue[Job Queue / Bus]
Audit[Audit Sink]
end
subgraph DataPlane["Data Plane"]
RepoGen[Repo Generator<br/>Workers]
PipelineGen[Pipeline Generator<br/>Workers]
SaaSGen[SaaS Scaffolder<br/>Workers]
DevOps[Azure DevOps<br/>Git / CI Integration]
end
Client[CLI / UI / External Caller] --> API
API --> Orchestrator
Orchestrator --> Scheduler
Orchestrator --> Queue
Orchestrator --> RunDB
Orchestrator --> Audit
Queue --> RepoGen
Queue --> PipelineGen
Queue --> SaaSGen
RepoGen --> DevOps
PipelineGen --> DevOps
SaaSGen --> DevOps
Control Plane vs Data Plane: Summary¶
Control Plane¶
The Control Plane is responsible for orchestration, coordination, and governance:
- Orchestration — Coordinates agent workflows and manages run lifecycles
- Scheduling — Determines when and how jobs are executed
- State Management — Tracks run status, step progress, and metadata
- Validation — Validates inputs and enforces policies before execution
- Audit Logging — Records all actions for compliance and traceability
- API Surface — Exposes REST, GraphQL, or gRPC APIs to UI, CLI, and external services
The Control Plane consists of long-running services (orchestrator, schedulers, API gateways) backed by databases (run state store) and message queues (job queue/bus). These services are typically deployed as stateful, highly available services with redundancy and failover capabilities.
Data Plane¶
The Data Plane is responsible for execution of work units:
- Repository Generation — Creates Git repositories, commits code, manages branches
- Pipeline Generation — Generates CI/CD pipelines, infrastructure-as-code
- SaaS Scaffolding — Creates microservices, libraries, and application components
- External System Integration — Interacts with Azure DevOps, Git providers, cloud services
The Data Plane consists of stateless worker pools that scale horizontally based on workload. Workers consume jobs from queues, execute tasks, and report results back to the Control Plane. Workers can be autoscaled (scale up/down based on queue depth) and are typically deployed in separate namespaces, node pools, or even clusters for isolation.
Key Runtime Concepts¶
Runs and Jobs¶
- Run — A single execution request (e.g., "generate a microservice for project X")
- Job — A discrete work unit within a run (e.g., "generate base repo", "apply overlays", "create pipelines")
- Relationship — One run → multiple jobs (sequential or parallel)
State and Memory¶
- Run State — Operational state stored in Run State DB (run status, step progress, timestamps)
- AI Memory — Long-term knowledge stored in Knowledge & Memory System (patterns, historical runs, learnings)
- Separation — Operational state (ephemeral, per-run) vs. long-term memory (persistent, cross-project)
Failure and Recovery¶
- Automatic Retry — Transient failures are retried with exponential backoff
- Idempotency — Jobs are designed to be safely retried without side effects
- Resume — Runs can resume from the last successful step after failures
- Compensation — Failed runs can trigger compensating actions (cleanup, rollback)
Observability¶
- Traces — Distributed tracing across all services and agents
- Metrics — Run success rates, queue depths, execution times, AI token usage
- Logs — Structured logs with correlation IDs (runId, jobId, traceId)
- Dashboards — Real-time visibility into Factory health, performance, and costs
Documentation Structure¶
This runtime documentation is organized into focused topics:
- Control Plane — Deep dive into control plane vs data plane separation, responsibilities, and deployment patterns
- Execution Engine — Run lifecycle, queuing model, idempotency, and traceability
- State & Memory — Run state store, artifacts, and integration with the Knowledge & Memory System
- Failure & Recovery — Failure modes, retry policies, resume capabilities, and compensation patterns
- Observability — Traces, metrics, logs, dashboards, and alerts
Related Documentation¶
Factory Architecture¶
- Overall Platform Architecture — High-level Factory architecture and design principles
- Orchestration Layer — How orchestration coordinates agents and workflows
- Agent System Overview — Multi-agent system design and agent capabilities
Knowledge & Memory¶
- Knowledge and Memory System — How the Factory learns and improves over time
- Knowledge Indices — Vector search and semantic retrieval
- Knowledge Graph — Graph-based knowledge representation
Business Context¶
External Documentation
For business-oriented documentation covering reliability, scalability, cost, and operational excellence, see the Business Runtime Overview in the Company Documentation repository.