Execution Engine¶
Overview¶
The Execution Engine is the core runtime component that manages the lifecycle of Factory runs, from request to completion. It handles job queuing, execution, idempotency, retries, and traceability, ensuring reliable and observable execution of Factory workflows.
Run Lifecycle¶
Run States¶
A Factory run progresses through the following states:
- Requested — Run request received and validated
- Validated — Input validation and policy checks passed
- Queued — Run is queued for execution (jobs are enqueued)
- Running — Run is actively executing (one or more jobs are running)
- Succeeded — All jobs completed successfully
- Failed — One or more jobs failed (after retries exhausted)
- Cancelled — Run was cancelled by user or system
stateDiagram-v2
[*] --> Requested
Requested --> Validated: Validation Passed
Requested --> Failed: Validation Failed
Validated --> Queued: Jobs Enqueued
Queued --> Running: Jobs Started
Running --> Running: Jobs In Progress
Running --> Succeeded: All Jobs Succeeded
Running --> Failed: Jobs Failed (Retries Exhausted)
Running --> Cancelled: User Cancelled
Succeeded --> [*]
Failed --> [*]
Cancelled --> [*]
Run Metadata¶
Every run includes the following metadata:
- runId — Unique identifier for the run
- tenantId — Tenant or customer identifier (for multi-tenant scenarios)
- projectId — Project identifier
- templateRecipeId — Template or recipe identifier
- requestedBy — User or system that requested the run
- requestedAt — Timestamp when run was requested
- startedAt — Timestamp when run started executing
- completedAt — Timestamp when run completed (succeeded/failed/cancelled)
- status — Current run state
- metadata — Additional run-specific metadata (configurations, parameters, etc.)
Queueing Model¶
Run-to-Job Relationship¶
A single run is broken down into multiple jobs (discrete work units):
- One Run → Multiple Jobs — A run typically consists of multiple sequential or parallel jobs
- Job Types — Different job types (Repo Generation, Pipeline Generation, Documentation Generation, etc.)
- Dependencies — Jobs may have dependencies (e.g., "Generate Pipelines" depends on "Generate Repo")
Example Run Breakdown:
Run: "Generate Microservice for Project X"
├── Job 1: Generate Base Repository
├── Job 2: Apply Domain Template
├── Job 3: Apply Infrastructure Template
├── Job 4: Generate CI/CD Pipelines
└── Job 5: Create Documentation
Topics and Queues¶
The Factory uses topic-based queuing to route jobs to appropriate workers:
- Job Type Topics — Different topics for different job types:
factory.jobs.repo-generationfactory.jobs.pipeline-generationfactory.jobs.saas-scaffoldingfactory.jobs.documentationfactory.jobs.infrastructure
- Priority Queues — Jobs can be prioritized (high, normal, low)
- Dead Letter Queues (DLQ) — Failed jobs after retry exhaustion are moved to DLQ
Job Execution Flow¶
sequenceDiagram
participant Client
participant API
participant Orchestrator
participant Queue
participant Worker
Client->>API: POST /runs (template, config)
API->>Orchestrator: CreateRun(request)
Orchestrator->>Orchestrator: Validate + persist Run (Requested)
Orchestrator->>Orchestrator: Break down into Jobs
Orchestrator->>Queue: Enqueue Job[GenerateRepo] (idempotency key)
Orchestrator->>RunStore: Update Run state = Queued
Worker->>Queue: Dequeue Job[GenerateRepo]
Worker->>RunStore: Update Job state = Running
Worker->>Worker: Execute job (generate repo, push to Git)
Worker->>RunStore: Append step result (Succeeded)
Worker->>Queue: Enqueue next Job[GeneratePipelines]
Worker->>RunStore: Update Job state = Completed
Idempotency & Deduplication¶
Idempotency Keys¶
Every job is assigned an idempotency key to ensure safe retries:
- Format:
{runId}:{stepName}:{attempt} - Example:
run-123:generate-repo:1 - Purpose: Prevents duplicate execution if the same job is enqueued multiple times
Deduplication Strategy¶
The queue system implements deduplication to prevent duplicate job execution:
- Idempotency Key Check — Before processing, workers check if a job with the same idempotency key was already processed
- State Verification — Workers verify job state in RunStore before execution
- Atomic Operations — Job state updates are atomic to prevent race conditions
Idempotent Job Design¶
Jobs are designed to be idempotent (safe to retry):
- Check-Before-Act — Jobs check if work was already done before executing
- Conditional Operations — Use conditional Git operations (e.g., "create branch if not exists")
- State Verification — Verify current state before making changes
- No Side Effects — Retrying a job should not cause duplicate artifacts or side effects
Example: Idempotent Repo Generation
# Pseudo-code example
def generate_repo(run_id, project_id):
# Check if repo already exists
if repo_exists(project_id):
return get_existing_repo(project_id)
# Create repo only if it doesn't exist
repo = create_repo(project_id)
return repo
Retry Mechanisms¶
Retry Policy¶
The Factory implements automatic retries for transient failures:
- Max Attempts — Configurable per job type (typically 3-5 attempts)
- Backoff Strategy — Exponential backoff with jitter:
- Attempt 1: Immediate
- Attempt 2: 1 second delay
- Attempt 3: 2 seconds delay
- Attempt 4: 4 seconds delay
- Attempt 5: 8 seconds delay
- Retryable Errors — Only transient errors are retried (network timeouts, rate limits, temporary service unavailability)
- Non-Retryable Errors — Validation errors, authentication failures, and permanent errors are not retried
Poison Queue Handling¶
After max retries are exhausted, failed jobs are moved to a Dead Letter Queue (DLQ):
- DLQ Storage — Failed jobs are stored in DLQ for manual inspection
- Alerting — DLQ entries trigger alerts for operations team
- Manual Retry — Operations can manually retry DLQ jobs after investigation
- Compensation — DLQ jobs may trigger compensating actions (cleanup, rollback)
Traceability¶
Correlation IDs¶
Every run and job includes correlation IDs for end-to-end traceability:
- runId — Unique run identifier (correlates all jobs in a run)
- jobId — Unique job identifier
- traceId — Distributed tracing identifier (OpenTelemetry trace ID)
- spanId — Span identifier within a trace
- tenantId — Tenant identifier for multi-tenant scenarios
External System Correlation¶
The Factory correlates runs with external systems:
- Azure DevOps buildId — Links Factory runs to Azure DevOps builds
- repoId — Links runs to generated repositories
- pipelineId — Links runs to generated pipelines
- workItemId — Links runs to Azure DevOps work items
Trace Propagation¶
Correlation IDs are propagated through:
- HTTP Headers — Trace IDs in HTTP request/response headers
- Event Metadata — Trace IDs in event bus messages
- Log Context — Trace IDs in structured logs
- Database Records — Trace IDs stored in run state records
Execution Sequence Diagram¶
sequenceDiagram
participant Client
participant API
participant Orchestrator
participant RunStore
participant Queue
participant Worker
participant ExternalSystem
Client->>API: POST /runs
API->>Orchestrator: CreateRun(request)
Orchestrator->>RunStore: Create Run (Requested, runId, traceId)
Orchestrator->>Orchestrator: Validate Request
Orchestrator->>RunStore: Update Run (Validated)
Orchestrator->>Orchestrator: Break into Jobs
Orchestrator->>Queue: Enqueue Job1 (idempotency key)
Orchestrator->>Queue: Enqueue Job2 (idempotency key)
Orchestrator->>RunStore: Update Run (Queued)
Worker->>Queue: Dequeue Job1
Worker->>RunStore: Update Job1 (Running)
Worker->>ExternalSystem: Execute (Git, Azure DevOps)
ExternalSystem-->>Worker: Result
Worker->>RunStore: Update Job1 (Succeeded, result)
Worker->>Queue: Dequeue Job2
Worker->>RunStore: Update Job2 (Running)
Worker->>ExternalSystem: Execute
ExternalSystem-->>Worker: Error (transient)
Worker->>Queue: Retry Job2 (attempt 2)
Worker->>ExternalSystem: Retry Execute
ExternalSystem-->>Worker: Result
Worker->>RunStore: Update Job2 (Succeeded, result)
Orchestrator->>RunStore: Check All Jobs Complete
Orchestrator->>RunStore: Update Run (Succeeded)
Orchestrator->>Client: Run Complete Notification
Related Documentation¶
- Control Plane — How control plane orchestrates execution
- State & Memory — How run state is stored and managed
- Failure & Recovery — How failures are handled and recovered
- Observability — How execution is traced and monitored
- Orchestration Layer — High-level orchestration design