Execution Engine¶

Overview¶

The Execution Engine is the core runtime component that manages the lifecycle of Factory runs, from request to completion. It handles job queuing, execution, idempotency, retries, and traceability, ensuring reliable and observable execution of Factory workflows.

Run Lifecycle¶

Run States¶

A Factory run progresses through the following states:

Requested — Run request received and validated
Validated — Input validation and policy checks passed
Queued — Run is queued for execution (jobs are enqueued)
Running — Run is actively executing (one or more jobs are running)
Succeeded — All jobs completed successfully
Failed — One or more jobs failed (after retries exhausted)
Cancelled — Run was cancelled by user or system

stateDiagram-v2
    [*] --> Requested
    Requested --> Validated: Validation Passed
    Requested --> Failed: Validation Failed
    Validated --> Queued: Jobs Enqueued
    Queued --> Running: Jobs Started
    Running --> Running: Jobs In Progress
    Running --> Succeeded: All Jobs Succeeded
    Running --> Failed: Jobs Failed (Retries Exhausted)
    Running --> Cancelled: User Cancelled
    Succeeded --> [*]
    Failed --> [*]
    Cancelled --> [*]

Hold "Alt" / "Option" to enable pan & zoom

Run Metadata¶

Every run includes the following metadata:

runId — Unique identifier for the run
tenantId — Tenant or customer identifier (for multi-tenant scenarios)
projectId — Project identifier
templateRecipeId — Template or recipe identifier
requestedBy — User or system that requested the run
requestedAt — Timestamp when run was requested
startedAt — Timestamp when run started executing
completedAt — Timestamp when run completed (succeeded/failed/cancelled)
status — Current run state
metadata — Additional run-specific metadata (configurations, parameters, etc.)

Queueing Model¶

Run-to-Job Relationship¶

A single run is broken down into multiple jobs (discrete work units):

One Run → Multiple Jobs — A run typically consists of multiple sequential or parallel jobs
Job Types — Different job types (Repo Generation, Pipeline Generation, Documentation Generation, etc.)
Dependencies — Jobs may have dependencies (e.g., "Generate Pipelines" depends on "Generate Repo")

Example Run Breakdown:

Run: "Generate Microservice for Project X"
├── Job 1: Generate Base Repository
├── Job 2: Apply Domain Template
├── Job 3: Apply Infrastructure Template
├── Job 4: Generate CI/CD Pipelines
└── Job 5: Create Documentation

Topics and Queues¶

The Factory uses topic-based queuing to route jobs to appropriate workers:

Job Type Topics — Different topics for different job types:
- factory.jobs.repo-generation
- factory.jobs.pipeline-generation
- factory.jobs.saas-scaffolding
- factory.jobs.documentation
- factory.jobs.infrastructure
Priority Queues — Jobs can be prioritized (high, normal, low)
Dead Letter Queues (DLQ) — Failed jobs after retry exhaustion are moved to DLQ

Job Execution Flow¶

sequenceDiagram
    participant Client
    participant API
    participant Orchestrator
    participant Queue
    participant Worker

    Client->>API: POST /runs (template, config)
    API->>Orchestrator: CreateRun(request)
    Orchestrator->>Orchestrator: Validate + persist Run (Requested)
    Orchestrator->>Orchestrator: Break down into Jobs
    Orchestrator->>Queue: Enqueue Job[GenerateRepo] (idempotency key)
    Orchestrator->>RunStore: Update Run state = Queued

    Worker->>Queue: Dequeue Job[GenerateRepo]
    Worker->>RunStore: Update Job state = Running
    Worker->>Worker: Execute job (generate repo, push to Git)
    Worker->>RunStore: Append step result (Succeeded)
    Worker->>Queue: Enqueue next Job[GeneratePipelines]
    Worker->>RunStore: Update Job state = Completed

Hold "Alt" / "Option" to enable pan & zoom

Idempotency & Deduplication¶

Idempotency Keys¶

Every job is assigned an idempotency key to ensure safe retries:

Format: {runId}:{stepName}:{attempt}
Example: run-123:generate-repo:1
Purpose: Prevents duplicate execution if the same job is enqueued multiple times

Deduplication Strategy¶

The queue system implements deduplication to prevent duplicate job execution:

Idempotency Key Check — Before processing, workers check if a job with the same idempotency key was already processed
State Verification — Workers verify job state in RunStore before execution
Atomic Operations — Job state updates are atomic to prevent race conditions

Idempotent Job Design¶

Jobs are designed to be idempotent (safe to retry):

Check-Before-Act — Jobs check if work was already done before executing
Conditional Operations — Use conditional Git operations (e.g., "create branch if not exists")
State Verification — Verify current state before making changes
No Side Effects — Retrying a job should not cause duplicate artifacts or side effects

Example: Idempotent Repo Generation

# Pseudo-code example
def generate_repo(run_id, project_id):
    # Check if repo already exists
    if repo_exists(project_id):
        return get_existing_repo(project_id)

    # Create repo only if it doesn't exist
    repo = create_repo(project_id)
    return repo

Retry Mechanisms¶

Retry Policy¶

The Factory implements automatic retries for transient failures:

Max Attempts — Configurable per job type (typically 3-5 attempts)
Backoff Strategy — Exponential backoff with jitter:
- Attempt 1: Immediate
- Attempt 2: 1 second delay
- Attempt 3: 2 seconds delay
- Attempt 4: 4 seconds delay
- Attempt 5: 8 seconds delay
Retryable Errors — Only transient errors are retried (network timeouts, rate limits, temporary service unavailability)
Non-Retryable Errors — Validation errors, authentication failures, and permanent errors are not retried

Poison Queue Handling¶

After max retries are exhausted, failed jobs are moved to a Dead Letter Queue (DLQ):

DLQ Storage — Failed jobs are stored in DLQ for manual inspection
Alerting — DLQ entries trigger alerts for operations team
Manual Retry — Operations can manually retry DLQ jobs after investigation
Compensation — DLQ jobs may trigger compensating actions (cleanup, rollback)

Traceability¶

Correlation IDs¶

Every run and job includes correlation IDs for end-to-end traceability:

runId — Unique run identifier (correlates all jobs in a run)
jobId — Unique job identifier
traceId — Distributed tracing identifier (OpenTelemetry trace ID)
spanId — Span identifier within a trace
tenantId — Tenant identifier for multi-tenant scenarios

External System Correlation¶

The Factory correlates runs with external systems:

Azure DevOps buildId — Links Factory runs to Azure DevOps builds
repoId — Links runs to generated repositories
pipelineId — Links runs to generated pipelines
workItemId — Links runs to Azure DevOps work items

Trace Propagation¶

Correlation IDs are propagated through:

HTTP Headers — Trace IDs in HTTP request/response headers
Event Metadata — Trace IDs in event bus messages
Log Context — Trace IDs in structured logs
Database Records — Trace IDs stored in run state records

Execution Sequence Diagram¶

sequenceDiagram
    participant Client
    participant API
    participant Orchestrator
    participant RunStore
    participant Queue
    participant Worker
    participant ExternalSystem

    Client->>API: POST /runs
    API->>Orchestrator: CreateRun(request)
    Orchestrator->>RunStore: Create Run (Requested, runId, traceId)
    Orchestrator->>Orchestrator: Validate Request
    Orchestrator->>RunStore: Update Run (Validated)
    Orchestrator->>Orchestrator: Break into Jobs
    Orchestrator->>Queue: Enqueue Job1 (idempotency key)
    Orchestrator->>Queue: Enqueue Job2 (idempotency key)
    Orchestrator->>RunStore: Update Run (Queued)

    Worker->>Queue: Dequeue Job1
    Worker->>RunStore: Update Job1 (Running)
    Worker->>ExternalSystem: Execute (Git, Azure DevOps)
    ExternalSystem-->>Worker: Result
    Worker->>RunStore: Update Job1 (Succeeded, result)

    Worker->>Queue: Dequeue Job2
    Worker->>RunStore: Update Job2 (Running)
    Worker->>ExternalSystem: Execute
    ExternalSystem-->>Worker: Error (transient)
    Worker->>Queue: Retry Job2 (attempt 2)
    Worker->>ExternalSystem: Retry Execute
    ExternalSystem-->>Worker: Result
    Worker->>RunStore: Update Job2 (Succeeded, result)

    Orchestrator->>RunStore: Check All Jobs Complete
    Orchestrator->>RunStore: Update Run (Succeeded)
    Orchestrator->>Client: Run Complete Notification

Hold "Alt" / "Option" to enable pan & zoom

Control Plane — How control plane orchestrates execution
State & Memory — How run state is stored and managed
Failure & Recovery — How failures are handled and recovered
Observability — How execution is traced and monitored
Orchestration Layer — High-level orchestration design

Execution Engine¶

Overview¶

Run Lifecycle¶

Run States¶

Run Metadata¶

Queueing Model¶

Run-to-Job Relationship¶

Topics and Queues¶

Job Execution Flow¶

Idempotency & Deduplication¶

Idempotency Keys¶

Deduplication Strategy¶

Idempotent Job Design¶

Retry Mechanisms¶

Retry Policy¶

Poison Queue Handling¶

Traceability¶

Correlation IDs¶

External System Correlation¶

Trace Propagation¶

Execution Sequence Diagram¶

Related Documentation¶