Skip to content

Failure Modes & Recovery

Overview

The Factory is designed to handle failures gracefully, ensuring that transient errors don't cause permanent failures and that runs can recover from partial failures. This document describes failure types, retry policies, resume capabilities, and compensation patterns.


Failure Types

Validation Errors

Description: Errors that occur during input validation or pre-flight checks.

Examples: - Invalid template recipe ID - Missing required configuration parameters - Policy violations (e.g., resource quota exceeded) - Authentication/authorization failures

Characteristics: - Non-Retryable — Validation errors are permanent and won't succeed on retry - Immediate Failure — Failures occur before job execution starts - User Action Required — Typically require user to fix input and retry

Handling: - Fail immediately without retry - Return detailed error messages to user - Log validation failures for analysis

External System Errors

Description: Errors from external systems (Azure DevOps, Git providers, cloud services).

Examples: - Azure DevOps API rate limits - Git repository creation failures - Network timeouts - Service unavailability (temporary)

Characteristics: - Often Retryable — Many external system errors are transient - Backoff Required — Rate limits require exponential backoff - Timeout Handling — Network timeouts may succeed on retry

Handling: - Automatic retry with exponential backoff - Respect rate limits and retry-after headers - Circuit breaker pattern for persistent failures

Agent Errors

Description: Errors from AI agents during execution (LLM issues, tool errors, reasoning failures).

Examples: - LLM API failures (timeouts, rate limits, service errors) - Tool execution errors (Git operations, file system errors) - Agent reasoning failures (invalid output, parsing errors) - Context window exceeded

Characteristics: - Mixed Retryability — Some agent errors are retryable (API failures), others are not (reasoning failures) - Context Dependent — May succeed with different context or prompt - Fallback Options — Can fall back to simpler deterministic flows

Handling: - Retry transient API failures - Re-run with same context for reasoning failures - Fall back to deterministic flow if agent fails repeatedly - Escalate to human review if all retries fail

Infrastructure Errors

Description: Errors from Factory infrastructure (worker crashes, queue issues, database failures).

Examples: - Worker process crashes - Queue unavailability - Database connection failures - Kubernetes pod evictions

Characteristics: - Highly Retryable — Infrastructure errors are typically transient - Automatic Recovery — Infrastructure can recover automatically (pod restarts, queue reconnection) - State Preservation — Run state is preserved in database, enabling resume

Handling: - Automatic worker restart and job reassignment - Queue reconnection with message replay - Database retry with connection pooling - Resume from last successful checkpoint


Retry Policy

Which Failures Are Retried

The Factory retries transient failures automatically:

  • Network Timeouts — Retry with exponential backoff
  • Rate Limits — Retry after rate limit window expires
  • Service Unavailability — Retry with exponential backoff
  • Infrastructure Failures — Retry after infrastructure recovers
  • Validation Errors — Not retried (permanent errors)
  • Authentication Failures — Not retried (require user action)
  • Permanent API Errors — Not retried (e.g., "resource not found")

Backoff Strategy

The Factory uses exponential backoff with jitter:

  • Attempt 1: Immediate (0 seconds)
  • Attempt 2: 1 second + jitter (0-500ms)
  • Attempt 3: 2 seconds + jitter (0-1s)
  • Attempt 4: 4 seconds + jitter (0-2s)
  • Attempt 5: 8 seconds + jitter (0-4s)

Jitter prevents thundering herd problems when multiple workers retry simultaneously.

Max Retry Counts

Retry counts are configurable per job type:

  • High Priority Jobs: 5 attempts
  • Normal Priority Jobs: 3 attempts
  • Low Priority Jobs: 2 attempts

Escalation to "Failed"

After max retries are exhausted:

  1. Job StatusFailed
  2. Run StatusFailed (if critical job) or continues (if non-critical)
  3. Dead Letter Queue → Job moved to DLQ for manual inspection
  4. Alerting → Operations team notified
  5. User Notification → User notified of failure

Resume & Partial Completion

Resume Capabilities

The Factory supports resuming runs from the last successful step:

  • Per Run — Resume entire run from last successful job
  • Per Step/Job — Resume individual jobs that failed

Resume Rules

Idempotent Steps

Idempotent steps can be safely retried:

  • Generate Repository — Check if repo exists before creating
  • Create Branch — Check if branch exists before creating
  • Generate Files — Overwrite existing files (idempotent)
  • Update Work Items — Update operations are idempotent

Resume Strategy: - Skip already-completed steps - Retry failed steps from beginning - Continue from next pending step

Non-Idempotent Steps

Non-idempotent steps require special handling:

  • ⚠️ Delete Resources — Cannot safely retry (may delete already-deleted resources)
  • ⚠️ Append Operations — May cause duplicates on retry
  • ⚠️ Stateful Operations — May have side effects

Resume Strategy: - Option 1: Fully compensating — Rollback and retry from beginning - Option 2: Manual intervention — Require human review before retry - Option 3: Skip and continue — Mark as failed, continue with next step

Partial Completion

Runs can partially complete (some jobs succeed, others fail):

  • Non-Critical Jobs — Run can succeed even if some non-critical jobs fail
  • Critical Jobs — Run fails if any critical job fails
  • Dependency Handling — Failed jobs block dependent jobs

Example:

Run: Generate Microservice
├── Job 1: Generate Repo ✅ (Succeeded)
├── Job 2: Generate Code ✅ (Succeeded)
├── Job 3: Generate Tests ❌ (Failed - non-critical)
├── Job 4: Generate Pipelines ✅ (Succeeded)
└── Job 5: Generate Docs ⚠️ (Skipped - depends on Job 3)

Result: Run Partially Succeeded (3/5 jobs succeeded)


Rollback & Compensation

Soft Rollback

Soft rollback marks the run as failed but leaves external artifacts:

  • Run StatusFailed
  • Artifacts Preserved — Generated repos, pipelines remain (for debugging)
  • User Action — User can manually clean up or fix artifacts
  • Audit Trail — Failure is logged for analysis

Use Cases: - Non-critical failures - Partial completion scenarios - When artifacts may be useful for debugging

Compensating Workflows

Compensating workflows automatically undo or clean up work:

  • Delete Generated Repos — Remove repositories created by failed run
  • Cancel Pipelines — Cancel or delete generated pipelines
  • Clean Up Resources — Remove cloud resources (if any)
  • Revert Commits — Revert commits made during failed run

Use Cases: - Critical failures that require cleanup - When leaving artifacts would cause confusion - Compliance requirements (data cleanup)

Compensation Patterns

flowchart TD
    RunStart[Run Started]
    Job1[Job 1: Create Repo]
    Job2[Job 2: Generate Code]
    Job3[Job 3: Create Pipeline]
    Failure{Job 3 Failed}
    Compensate[Compensating Workflow]
    DeleteRepo[Delete Repo]
    DeleteCode[Delete Code]
    RunFailed[Run Failed]

    RunStart --> Job1
    Job1 --> Job2
    Job2 --> Job3
    Job3 --> Failure
    Failure --> Compensate
    Compensate --> DeleteCode
    DeleteCode --> DeleteRepo
    DeleteRepo --> RunFailed
Hold "Alt" / "Option" to enable pan & zoom

Agent Failures

Handling AI Agent Failures

AI agents can fail in different ways:

LLM API Failures

  • Retry Strategy: Retry with exponential backoff
  • Fallback: Use cached responses or simpler models
  • Escalation: Escalate to human if all retries fail

Agent Reasoning Failures

  • Re-run Strategy: Re-run agent with same context
  • Context Adjustment: Adjust prompt or context if reasoning fails
  • Fallback: Fall back to deterministic template-based generation
  • Escalation: Escalate to human review if agent consistently fails

Tool Execution Failures

  • Retry Strategy: Retry tool execution (if idempotent)
  • Error Handling: Return detailed error to agent for retry
  • Fallback: Skip tool if non-critical, or escalate if critical

Agent Failure Recovery Flow

stateDiagram-v2
    [*] --> AgentExecuting
    AgentExecuting --> LLMFailure: LLM API Error
    AgentExecuting --> ReasoningFailure: Invalid Output
    AgentExecuting --> ToolFailure: Tool Error
    AgentExecuting --> Success: Success

    LLMFailure --> Retry: Retry with Backoff
    ReasoningFailure --> ReRun: Re-run with Context
    ToolFailure --> RetryTool: Retry Tool

    Retry --> AgentExecuting
    ReRun --> AgentExecuting
    RetryTool --> AgentExecuting

    Retry --> Fallback: Max Retries
    ReRun --> Fallback: Max Retries
    RetryTool --> Fallback: Max Retries

    Fallback --> DeterministicFlow: Use Template
    Fallback --> HumanReview: Escalate

    Success --> [*]
    DeterministicFlow --> [*]
    HumanReview --> [*]
Hold "Alt" / "Option" to enable pan & zoom

Failure State Diagram

stateDiagram-v2
    [*] --> Requested
    Requested --> Validated: Validation Passed
    Requested --> Failed: Validation Failed
    Validated --> Queued
    Queued --> Running
    Running --> JobRunning
    JobRunning --> JobSucceeded: Job Success
    JobRunning --> JobFailed: Job Failure
    JobFailed --> Retrying: Retry Available
    JobFailed --> JobFailedFinal: Max Retries
    Retrying --> JobRunning
    JobSucceeded --> AllJobsComplete: All Jobs Done
    JobFailedFinal --> RunFailed: Critical Job
    JobFailedFinal --> PartialSuccess: Non-Critical
    AllJobsComplete --> Succeeded
    RunFailed --> Compensating: Compensation Needed
    RunFailed --> Failed: No Compensation
    Compensating --> Failed
    PartialSuccess --> Succeeded
    Succeeded --> [*]
    Failed --> [*]
Hold "Alt" / "Option" to enable pan & zoom