Skip to content

Execution Engine

Overview

The Execution Engine is the core runtime component that manages the lifecycle of Factory runs, from request to completion. It handles job queuing, execution, idempotency, retries, and traceability, ensuring reliable and observable execution of Factory workflows.


Run Lifecycle

Run States

A Factory run progresses through the following states:

  1. Requested — Run request received and validated
  2. Validated — Input validation and policy checks passed
  3. Queued — Run is queued for execution (jobs are enqueued)
  4. Running — Run is actively executing (one or more jobs are running)
  5. Succeeded — All jobs completed successfully
  6. Failed — One or more jobs failed (after retries exhausted)
  7. Cancelled — Run was cancelled by user or system
stateDiagram-v2
    [*] --> Requested
    Requested --> Validated: Validation Passed
    Requested --> Failed: Validation Failed
    Validated --> Queued: Jobs Enqueued
    Queued --> Running: Jobs Started
    Running --> Running: Jobs In Progress
    Running --> Succeeded: All Jobs Succeeded
    Running --> Failed: Jobs Failed (Retries Exhausted)
    Running --> Cancelled: User Cancelled
    Succeeded --> [*]
    Failed --> [*]
    Cancelled --> [*]
Hold "Alt" / "Option" to enable pan & zoom

Run Metadata

Every run includes the following metadata:

  • runId — Unique identifier for the run
  • tenantId — Tenant or customer identifier (for multi-tenant scenarios)
  • projectId — Project identifier
  • templateRecipeId — Template or recipe identifier
  • requestedBy — User or system that requested the run
  • requestedAt — Timestamp when run was requested
  • startedAt — Timestamp when run started executing
  • completedAt — Timestamp when run completed (succeeded/failed/cancelled)
  • status — Current run state
  • metadata — Additional run-specific metadata (configurations, parameters, etc.)

Queueing Model

Run-to-Job Relationship

A single run is broken down into multiple jobs (discrete work units):

  • One Run → Multiple Jobs — A run typically consists of multiple sequential or parallel jobs
  • Job Types — Different job types (Repo Generation, Pipeline Generation, Documentation Generation, etc.)
  • Dependencies — Jobs may have dependencies (e.g., "Generate Pipelines" depends on "Generate Repo")

Example Run Breakdown:

Run: "Generate Microservice for Project X"
├── Job 1: Generate Base Repository
├── Job 2: Apply Domain Template
├── Job 3: Apply Infrastructure Template
├── Job 4: Generate CI/CD Pipelines
└── Job 5: Create Documentation

Topics and Queues

The Factory uses topic-based queuing to route jobs to appropriate workers:

  • Job Type Topics — Different topics for different job types:
    • factory.jobs.repo-generation
    • factory.jobs.pipeline-generation
    • factory.jobs.saas-scaffolding
    • factory.jobs.documentation
    • factory.jobs.infrastructure
  • Priority Queues — Jobs can be prioritized (high, normal, low)
  • Dead Letter Queues (DLQ) — Failed jobs after retry exhaustion are moved to DLQ

Job Execution Flow

sequenceDiagram
    participant Client
    participant API
    participant Orchestrator
    participant Queue
    participant Worker

    Client->>API: POST /runs (template, config)
    API->>Orchestrator: CreateRun(request)
    Orchestrator->>Orchestrator: Validate + persist Run (Requested)
    Orchestrator->>Orchestrator: Break down into Jobs
    Orchestrator->>Queue: Enqueue Job[GenerateRepo] (idempotency key)
    Orchestrator->>RunStore: Update Run state = Queued

    Worker->>Queue: Dequeue Job[GenerateRepo]
    Worker->>RunStore: Update Job state = Running
    Worker->>Worker: Execute job (generate repo, push to Git)
    Worker->>RunStore: Append step result (Succeeded)
    Worker->>Queue: Enqueue next Job[GeneratePipelines]
    Worker->>RunStore: Update Job state = Completed
Hold "Alt" / "Option" to enable pan & zoom

Idempotency & Deduplication

Idempotency Keys

Every job is assigned an idempotency key to ensure safe retries:

  • Format: {runId}:{stepName}:{attempt}
  • Example: run-123:generate-repo:1
  • Purpose: Prevents duplicate execution if the same job is enqueued multiple times

Deduplication Strategy

The queue system implements deduplication to prevent duplicate job execution:

  1. Idempotency Key Check — Before processing, workers check if a job with the same idempotency key was already processed
  2. State Verification — Workers verify job state in RunStore before execution
  3. Atomic Operations — Job state updates are atomic to prevent race conditions

Idempotent Job Design

Jobs are designed to be idempotent (safe to retry):

  • Check-Before-Act — Jobs check if work was already done before executing
  • Conditional Operations — Use conditional Git operations (e.g., "create branch if not exists")
  • State Verification — Verify current state before making changes
  • No Side Effects — Retrying a job should not cause duplicate artifacts or side effects

Example: Idempotent Repo Generation

# Pseudo-code example
def generate_repo(run_id, project_id):
    # Check if repo already exists
    if repo_exists(project_id):
        return get_existing_repo(project_id)

    # Create repo only if it doesn't exist
    repo = create_repo(project_id)
    return repo

Retry Mechanisms

Retry Policy

The Factory implements automatic retries for transient failures:

  • Max Attempts — Configurable per job type (typically 3-5 attempts)
  • Backoff Strategy — Exponential backoff with jitter:
  • Attempt 1: Immediate
  • Attempt 2: 1 second delay
  • Attempt 3: 2 seconds delay
  • Attempt 4: 4 seconds delay
  • Attempt 5: 8 seconds delay
  • Retryable Errors — Only transient errors are retried (network timeouts, rate limits, temporary service unavailability)
  • Non-Retryable Errors — Validation errors, authentication failures, and permanent errors are not retried

Poison Queue Handling

After max retries are exhausted, failed jobs are moved to a Dead Letter Queue (DLQ):

  • DLQ Storage — Failed jobs are stored in DLQ for manual inspection
  • Alerting — DLQ entries trigger alerts for operations team
  • Manual Retry — Operations can manually retry DLQ jobs after investigation
  • Compensation — DLQ jobs may trigger compensating actions (cleanup, rollback)

Traceability

Correlation IDs

Every run and job includes correlation IDs for end-to-end traceability:

  • runId — Unique run identifier (correlates all jobs in a run)
  • jobId — Unique job identifier
  • traceId — Distributed tracing identifier (OpenTelemetry trace ID)
  • spanId — Span identifier within a trace
  • tenantId — Tenant identifier for multi-tenant scenarios

External System Correlation

The Factory correlates runs with external systems:

  • Azure DevOps buildId — Links Factory runs to Azure DevOps builds
  • repoId — Links runs to generated repositories
  • pipelineId — Links runs to generated pipelines
  • workItemId — Links runs to Azure DevOps work items

Trace Propagation

Correlation IDs are propagated through:

  • HTTP Headers — Trace IDs in HTTP request/response headers
  • Event Metadata — Trace IDs in event bus messages
  • Log Context — Trace IDs in structured logs
  • Database Records — Trace IDs stored in run state records

Execution Sequence Diagram

sequenceDiagram
    participant Client
    participant API
    participant Orchestrator
    participant RunStore
    participant Queue
    participant Worker
    participant ExternalSystem

    Client->>API: POST /runs
    API->>Orchestrator: CreateRun(request)
    Orchestrator->>RunStore: Create Run (Requested, runId, traceId)
    Orchestrator->>Orchestrator: Validate Request
    Orchestrator->>RunStore: Update Run (Validated)
    Orchestrator->>Orchestrator: Break into Jobs
    Orchestrator->>Queue: Enqueue Job1 (idempotency key)
    Orchestrator->>Queue: Enqueue Job2 (idempotency key)
    Orchestrator->>RunStore: Update Run (Queued)

    Worker->>Queue: Dequeue Job1
    Worker->>RunStore: Update Job1 (Running)
    Worker->>ExternalSystem: Execute (Git, Azure DevOps)
    ExternalSystem-->>Worker: Result
    Worker->>RunStore: Update Job1 (Succeeded, result)

    Worker->>Queue: Dequeue Job2
    Worker->>RunStore: Update Job2 (Running)
    Worker->>ExternalSystem: Execute
    ExternalSystem-->>Worker: Error (transient)
    Worker->>Queue: Retry Job2 (attempt 2)
    Worker->>ExternalSystem: Retry Execute
    ExternalSystem-->>Worker: Result
    Worker->>RunStore: Update Job2 (Succeeded, result)

    Orchestrator->>RunStore: Check All Jobs Complete
    Orchestrator->>RunStore: Update Run (Succeeded)
    Orchestrator->>Client: Run Complete Notification
Hold "Alt" / "Option" to enable pan & zoom