Failure Modes & Recovery¶
Overview¶
The Factory is designed to handle failures gracefully, ensuring that transient errors don't cause permanent failures and that runs can recover from partial failures. This document describes failure types, retry policies, resume capabilities, and compensation patterns.
Failure Types¶
Validation Errors¶
Description: Errors that occur during input validation or pre-flight checks.
Examples: - Invalid template recipe ID - Missing required configuration parameters - Policy violations (e.g., resource quota exceeded) - Authentication/authorization failures
Characteristics: - Non-Retryable — Validation errors are permanent and won't succeed on retry - Immediate Failure — Failures occur before job execution starts - User Action Required — Typically require user to fix input and retry
Handling: - Fail immediately without retry - Return detailed error messages to user - Log validation failures for analysis
External System Errors¶
Description: Errors from external systems (Azure DevOps, Git providers, cloud services).
Examples: - Azure DevOps API rate limits - Git repository creation failures - Network timeouts - Service unavailability (temporary)
Characteristics: - Often Retryable — Many external system errors are transient - Backoff Required — Rate limits require exponential backoff - Timeout Handling — Network timeouts may succeed on retry
Handling: - Automatic retry with exponential backoff - Respect rate limits and retry-after headers - Circuit breaker pattern for persistent failures
Agent Errors¶
Description: Errors from AI agents during execution (LLM issues, tool errors, reasoning failures).
Examples: - LLM API failures (timeouts, rate limits, service errors) - Tool execution errors (Git operations, file system errors) - Agent reasoning failures (invalid output, parsing errors) - Context window exceeded
Characteristics: - Mixed Retryability — Some agent errors are retryable (API failures), others are not (reasoning failures) - Context Dependent — May succeed with different context or prompt - Fallback Options — Can fall back to simpler deterministic flows
Handling: - Retry transient API failures - Re-run with same context for reasoning failures - Fall back to deterministic flow if agent fails repeatedly - Escalate to human review if all retries fail
Infrastructure Errors¶
Description: Errors from Factory infrastructure (worker crashes, queue issues, database failures).
Examples: - Worker process crashes - Queue unavailability - Database connection failures - Kubernetes pod evictions
Characteristics: - Highly Retryable — Infrastructure errors are typically transient - Automatic Recovery — Infrastructure can recover automatically (pod restarts, queue reconnection) - State Preservation — Run state is preserved in database, enabling resume
Handling: - Automatic worker restart and job reassignment - Queue reconnection with message replay - Database retry with connection pooling - Resume from last successful checkpoint
Retry Policy¶
Which Failures Are Retried¶
The Factory retries transient failures automatically:
- ✅ Network Timeouts — Retry with exponential backoff
- ✅ Rate Limits — Retry after rate limit window expires
- ✅ Service Unavailability — Retry with exponential backoff
- ✅ Infrastructure Failures — Retry after infrastructure recovers
- ❌ Validation Errors — Not retried (permanent errors)
- ❌ Authentication Failures — Not retried (require user action)
- ❌ Permanent API Errors — Not retried (e.g., "resource not found")
Backoff Strategy¶
The Factory uses exponential backoff with jitter:
- Attempt 1: Immediate (0 seconds)
- Attempt 2: 1 second + jitter (0-500ms)
- Attempt 3: 2 seconds + jitter (0-1s)
- Attempt 4: 4 seconds + jitter (0-2s)
- Attempt 5: 8 seconds + jitter (0-4s)
Jitter prevents thundering herd problems when multiple workers retry simultaneously.
Max Retry Counts¶
Retry counts are configurable per job type:
- High Priority Jobs: 5 attempts
- Normal Priority Jobs: 3 attempts
- Low Priority Jobs: 2 attempts
Escalation to "Failed"¶
After max retries are exhausted:
- Job Status →
Failed - Run Status →
Failed(if critical job) or continues (if non-critical) - Dead Letter Queue → Job moved to DLQ for manual inspection
- Alerting → Operations team notified
- User Notification → User notified of failure
Resume & Partial Completion¶
Resume Capabilities¶
The Factory supports resuming runs from the last successful step:
- Per Run — Resume entire run from last successful job
- Per Step/Job — Resume individual jobs that failed
Resume Rules¶
Idempotent Steps¶
Idempotent steps can be safely retried:
- ✅ Generate Repository — Check if repo exists before creating
- ✅ Create Branch — Check if branch exists before creating
- ✅ Generate Files — Overwrite existing files (idempotent)
- ✅ Update Work Items — Update operations are idempotent
Resume Strategy: - Skip already-completed steps - Retry failed steps from beginning - Continue from next pending step
Non-Idempotent Steps¶
Non-idempotent steps require special handling:
- ⚠️ Delete Resources — Cannot safely retry (may delete already-deleted resources)
- ⚠️ Append Operations — May cause duplicates on retry
- ⚠️ Stateful Operations — May have side effects
Resume Strategy: - Option 1: Fully compensating — Rollback and retry from beginning - Option 2: Manual intervention — Require human review before retry - Option 3: Skip and continue — Mark as failed, continue with next step
Partial Completion¶
Runs can partially complete (some jobs succeed, others fail):
- Non-Critical Jobs — Run can succeed even if some non-critical jobs fail
- Critical Jobs — Run fails if any critical job fails
- Dependency Handling — Failed jobs block dependent jobs
Example:
Run: Generate Microservice
├── Job 1: Generate Repo ✅ (Succeeded)
├── Job 2: Generate Code ✅ (Succeeded)
├── Job 3: Generate Tests ❌ (Failed - non-critical)
├── Job 4: Generate Pipelines ✅ (Succeeded)
└── Job 5: Generate Docs ⚠️ (Skipped - depends on Job 3)
Result: Run Partially Succeeded (3/5 jobs succeeded)
Rollback & Compensation¶
Soft Rollback¶
Soft rollback marks the run as failed but leaves external artifacts:
- Run Status →
Failed - Artifacts Preserved — Generated repos, pipelines remain (for debugging)
- User Action — User can manually clean up or fix artifacts
- Audit Trail — Failure is logged for analysis
Use Cases: - Non-critical failures - Partial completion scenarios - When artifacts may be useful for debugging
Compensating Workflows¶
Compensating workflows automatically undo or clean up work:
- Delete Generated Repos — Remove repositories created by failed run
- Cancel Pipelines — Cancel or delete generated pipelines
- Clean Up Resources — Remove cloud resources (if any)
- Revert Commits — Revert commits made during failed run
Use Cases: - Critical failures that require cleanup - When leaving artifacts would cause confusion - Compliance requirements (data cleanup)
Compensation Patterns¶
flowchart TD
RunStart[Run Started]
Job1[Job 1: Create Repo]
Job2[Job 2: Generate Code]
Job3[Job 3: Create Pipeline]
Failure{Job 3 Failed}
Compensate[Compensating Workflow]
DeleteRepo[Delete Repo]
DeleteCode[Delete Code]
RunFailed[Run Failed]
RunStart --> Job1
Job1 --> Job2
Job2 --> Job3
Job3 --> Failure
Failure --> Compensate
Compensate --> DeleteCode
DeleteCode --> DeleteRepo
DeleteRepo --> RunFailed
Agent Failures¶
Handling AI Agent Failures¶
AI agents can fail in different ways:
LLM API Failures¶
- Retry Strategy: Retry with exponential backoff
- Fallback: Use cached responses or simpler models
- Escalation: Escalate to human if all retries fail
Agent Reasoning Failures¶
- Re-run Strategy: Re-run agent with same context
- Context Adjustment: Adjust prompt or context if reasoning fails
- Fallback: Fall back to deterministic template-based generation
- Escalation: Escalate to human review if agent consistently fails
Tool Execution Failures¶
- Retry Strategy: Retry tool execution (if idempotent)
- Error Handling: Return detailed error to agent for retry
- Fallback: Skip tool if non-critical, or escalate if critical
Agent Failure Recovery Flow¶
stateDiagram-v2
[*] --> AgentExecuting
AgentExecuting --> LLMFailure: LLM API Error
AgentExecuting --> ReasoningFailure: Invalid Output
AgentExecuting --> ToolFailure: Tool Error
AgentExecuting --> Success: Success
LLMFailure --> Retry: Retry with Backoff
ReasoningFailure --> ReRun: Re-run with Context
ToolFailure --> RetryTool: Retry Tool
Retry --> AgentExecuting
ReRun --> AgentExecuting
RetryTool --> AgentExecuting
Retry --> Fallback: Max Retries
ReRun --> Fallback: Max Retries
RetryTool --> Fallback: Max Retries
Fallback --> DeterministicFlow: Use Template
Fallback --> HumanReview: Escalate
Success --> [*]
DeterministicFlow --> [*]
HumanReview --> [*]
Failure State Diagram¶
stateDiagram-v2
[*] --> Requested
Requested --> Validated: Validation Passed
Requested --> Failed: Validation Failed
Validated --> Queued
Queued --> Running
Running --> JobRunning
JobRunning --> JobSucceeded: Job Success
JobRunning --> JobFailed: Job Failure
JobFailed --> Retrying: Retry Available
JobFailed --> JobFailedFinal: Max Retries
Retrying --> JobRunning
JobSucceeded --> AllJobsComplete: All Jobs Done
JobFailedFinal --> RunFailed: Critical Job
JobFailedFinal --> PartialSuccess: Non-Critical
AllJobsComplete --> Succeeded
RunFailed --> Compensating: Compensation Needed
RunFailed --> Failed: No Compensation
Compensating --> Failed
PartialSuccess --> Succeeded
Succeeded --> [*]
Failed --> [*]
Related Documentation¶
- Execution Engine — How runs and jobs are executed
- Control Plane — How control plane handles failures
- Observability — How failures are monitored and alerted
- State & Memory — How state enables resume and recovery