Skip to content

Control Plane — Workflows

Workflow orchestration is the core domain of the Control Plane. A WorkflowInstance is a durable, event-sourced state machine that turns business intent into a governed, traceable sequence of agent tasks, validations, approvals, artifacts, and a release. Orchestration is driven by WorkflowOrchestrator (MassTransit sagas) from versioned WorkflowDefinition templates and is grounded in the existing coordinators (Project Bootstrap, Sprint Execution, Milestone Lifecycle, Microservice Assembly, Release) and orchestration domain.

Target Architecture — Final-State Design

Every transition emits a canonical event, so the full lifecycle is observable and replayable. Workflows advance autonomously by default and pause only at defined approval gates or on failure (human escalation).

Main Lifecycle

The end-to-end factory lifecycle for producing a module/service:

flowchart LR
    Intent["Project intent<br/>(Factory Studio)"] --> Bootstrap["Project Bootstrap"]
    Bootstrap --> Blueprint["Blueprint design & validation"]
    Blueprint --> Workflow["Workflow instance per module"]
    Workflow --> Tasks["Agent tasks<br/>(Agent Mesh)"]
    Tasks --> Artifacts["Artifacts produced & registered"]
    Artifacts --> Assembly["Microservice assembly"]
    Assembly --> Gate["Release approval gate"]
    Gate --> Release["Release / promotion<br/>(DevOps & GitOps)"]
    Release --> Running["Running SaaS"]
    Running --> Feedback["Runtime feedback"]
    Feedback --> Intent
Hold "Alt" / "Option" to enable pan & zoom

Each stage is a workflow (or step) instantiated from a definition: Project Bootstrap creates project/environments/modules; Blueprint validates the specification; per-module Workflow instances assign agent tasks; Microservice Assembly integrates outputs; Release promotes through environments behind an approval gate.

WorkflowInstance State Machine

stateDiagram-v2
    [*] --> Created
    Created --> Running: WorkflowInstanceStarted
    Running --> AssigningTask: step ready
    AssigningTask --> AwaitingAgent: AgentTaskAssigned
    AwaitingAgent --> Running: AgentTaskCompleted
    AwaitingAgent --> Correcting: AgentTaskFailed (retryable)
    Correcting --> AwaitingAgent: reassigned (attempt <= max)
    Correcting --> Failed: max attempts exceeded
    Running --> AwaitingApproval: approval gate reached
    AwaitingApproval --> Running: ApprovalGranted
    AwaitingApproval --> Cancelled: ApprovalRejected
    AwaitingApproval --> Escalated: ApprovalExpired
    Running --> Compensating: step failed (compensation required)
    Compensating --> Failed: compensation complete
    Escalated --> Running: human resumes
    Escalated --> Cancelled: human cancels
    Running --> Completed: WorkflowInstanceCompleted
    Failed --> [*]
    Cancelled --> [*]
    Completed --> [*]
Hold "Alt" / "Option" to enable pan & zoom
State Meaning Emits
Created Instance materialized from a definition. WorkflowInstanceStarted (on start)
Running Advancing through steps. WorkflowStepCompleted
AssigningTask Translating a step into an agent task.
AwaitingAgent Agent task placed; awaiting execution result. AgentTaskAssigned
Correcting Retryable failure; reassigning with feedback. AgentTaskReassigned
AwaitingApproval Paused at a human gate. ApprovalRequested
Compensating Running compensating actions for a failed step. CompensationStarted
Escalated Handed to a human (timeout/expiry). WorkflowEscalated
Completed All steps done, no open gates. WorkflowInstanceCompleted
Failed / Cancelled Terminal failure / human cancellation. WorkflowInstanceFailed / WorkflowInstanceCancelled

Task Assignment Sequence

How a ready workflow step becomes an executed agent task:

sequenceDiagram
    participant Orchestrator as WorkflowOrchestrator
    participant TaskSvc as TaskAssignmentService
    participant Pool as AgentPoolManager
    participant Policy as ModelPolicyService
    participant Mesh as Agent Mesh
    participant Cost as CostUsageService

    Orchestrator->>TaskSvc: AssignAgentTask(step, role, skill)
    TaskSvc->>Policy: resolve model policy
    Policy-->>TaskSvc: modelPolicyId
    TaskSvc->>Pool: acquire lease(role, tenant)
    Pool-->>TaskSvc: lease granted
    TaskSvc-->>Orchestrator: AgentTaskAssigned
    TaskSvc->>Mesh: dispatch task (Agent Task Contract)
    Mesh->>Mesh: load context, execute skill, validate
    Mesh-->>TaskSvc: execution result (artifacts, tokens)
    TaskSvc->>Cost: RecordUsage(tokens, task)
    TaskSvc-->>Orchestrator: AgentTaskCompleted
    Orchestrator->>Pool: release lease
Hold "Alt" / "Option" to enable pan & zoom

If no capacity is available, AgentPoolManager defers the lease and the TaskAssignmentWorker requeues with back-off; assignment remains idempotent on (workflowInstanceId, stepId).

Approval Gate Sequence

How a sensitive transition (e.g. production release) passes a policy and human gate:

sequenceDiagram
    participant Orchestrator as WorkflowOrchestrator
    participant Policy as PolicyEngineService
    participant Approval as ApprovalService
    participant Studio as Factory Studio
    participant Reviewer as Human Reviewer
    participant Audit as AuditService

    Orchestrator->>Policy: EvaluatePolicy(action=release:promote, env=prod)
    Policy->>Audit: PolicyDecisionRecorded(effect=RequireApproval)
    Policy-->>Orchestrator: RequireApproval(role=ReleaseManager)
    Orchestrator->>Approval: RequestApproval(role=ReleaseManager)
    Approval-->>Studio: ApprovalRequested
    Studio->>Reviewer: surface gate in Human Review Center
    Reviewer->>Studio: Grant (comment)
    Studio->>Approval: GrantApproval(decidedBy)
    Approval->>Audit: AuditEntryRecorded(Granted)
    Approval-->>Orchestrator: ApprovalGranted
    Orchestrator->>Orchestrator: resume workflow (promote)
Hold "Alt" / "Option" to enable pan & zoom

A Deny decision fails the step immediately; an expired approval moves the instance to Escalated.

Failure Handling

  • Classification: failures are transient (retryable — network, throttling), validation (correctable — agent output failed checks), or terminal (unrecoverable — bad input, policy deny).
  • Transient failures use MassTransit exponential back-off at the message level.
  • Validation failures route to Correcting: the task is reassigned with validator feedback, bounded by maxCorrectionAttempts (per the Agent Task Contract).
  • Terminal failures trigger compensation and/or human escalation.

Retry & Compensation

flowchart TB
    StepFailed["Step failed"] --> Classify{Failure type}
    Classify -->|Transient| Retry["Retry with back-off"]
    Classify -->|Validation| Correct["Reassign with feedback<br/>(attempt <= max)"]
    Classify -->|Terminal| Compensate["Run compensating actions"]
    Retry --> Resume["Resume step"]
    Correct --> Resume
    Correct -->|max exceeded| Compensate
    Compensate --> Escalate["Escalate to human"]
    Escalate --> Decision{Human decision}
    Decision -->|Resume| Resume
    Decision -->|Cancel| Cancelled["WorkflowInstanceCancelled"]
Hold "Alt" / "Option" to enable pan & zoom

Compensation is forward-recovery via compensating actions, not distributed rollback. Each WorkflowStepDefinition may declare a CompensationDefinition (e.g. retract a provisioned environment, mark an artifact superseded). Compensation actions are themselves idempotent and emit events.

Human Escalation

When a step exceeds its deadline (detected by WorkflowTimeoutWorker) or an approval expires, the instance enters Escalated. The gate surfaces in Factory Studio's Human Review Center with full context (trace, failing step, prior attempts, policy decision). A human can resume, reassign, or cancel; the decision is audited.

Replay

Because the WorkflowInstance event store is append-only and immutable, the WorkflowReplayService can deterministically reconstruct any instance:

flowchart LR
    History["Append-only event store"] --> Replay["WorkflowReplayService"]
    Replay --> Shadow["Shadow instance / projection"]
    Shadow --> Inspect["Inspect state at any point"]
    Shadow --> Rederive["Re-derive outcome with new definition"]
Hold "Alt" / "Option" to enable pan & zoom

Replay is used to debug a failure (reconstruct exact state at a step), re-derive an outcome under a newer workflow/agent definition, or rebuild the ProcessStateService projection. Replays write to a shadow stream and never mutate the source history, preserving the audit trail. This depends on envelopes being immutable once published (see Event Envelope).