Skip to content

Observability

Overview

Observability is a first-class requirement for the Factory runtime. Every run, job, agent action, and external call must be fully observable through traces, metrics, and logs. This enables rapid detection, diagnosis, and resolution of issues, as well as performance optimization and cost tracking.


Traces

Required Spans

Every Factory operation generates distributed tracing spans:

Run Overall Span

  • Span Name: factory.run.execute
  • Duration: Entire run lifecycle (from request to completion)
  • Attributes:
  • run.id — Run identifier
  • run.tenant_id — Tenant identifier
  • run.project_id — Project identifier
  • run.template_recipe_id — Template recipe identifier
  • run.status — Run status (Requested, Running, Succeeded, Failed)
  • run.duration_ms — Total run duration

Per Step/Job Span

  • Span Name: factory.job.execute
  • Duration: Individual job execution
  • Attributes:
  • job.id — Job identifier
  • job.step_name — Step name (e.g., "generate-repo")
  • job.attempt — Retry attempt number
  • job.status — Job status
  • job.duration_ms — Job execution duration

External Calls Span

  • Span Name: factory.external.{service}
  • Duration: External system call duration
  • Examples:
  • factory.external.azure_devops — Azure DevOps API calls
  • factory.external.git — Git operations
  • factory.external.llm — LLM API calls
  • factory.external.storage — Storage operations

Correlation IDs

Every span includes correlation IDs for end-to-end traceability:

  • traceId — OpenTelemetry trace ID (correlates all spans in a trace)
  • spanId — OpenTelemetry span ID (unique within a trace)
  • runId — Factory run identifier
  • jobId — Factory job identifier
  • tenantId — Tenant identifier

Trace Propagation

Correlation IDs are propagated through:

  • HTTP Headerstraceparent, tracestate headers
  • Event Metadata — Trace IDs in event bus messages
  • Database Records — Trace IDs stored in run state
  • Log Context — Trace IDs in structured logs

Metrics

Core Metrics

Run Metrics

  • run_duration_seconds — Histogram of run durations
  • Labels: status, template_recipe_id, tenant_id
  • run_success_total — Counter of successful runs
  • Labels: template_recipe_id, tenant_id
  • run_failure_total — Counter of failed runs
  • Labels: template_recipe_id, tenant_id, failure_type
  • run_cancelled_total — Counter of cancelled runs
  • Labels: template_recipe_id, tenant_id

Job Metrics

  • jobs_in_queue — Gauge of jobs waiting in queue
  • Labels: job_type, priority
  • jobs_processing — Gauge of jobs currently processing
  • Labels: job_type, worker_pool
  • job_duration_seconds — Histogram of job durations
  • Labels: job_type, status
  • retries_total — Counter of job retries
  • Labels: job_type, attempt

Queue Metrics

  • queue_depth — Gauge of queue depth
  • Labels: queue_name, priority
  • queue_processing_rate — Rate of jobs processed per second
  • Labels: queue_name
  • dlq_size — Gauge of dead letter queue size
  • Labels: queue_name

Token Usage

  • tokens_used_total — Counter of AI tokens used
  • Labels: model, operation_type, tenant_id, run_id
  • tokens_input_total — Counter of input tokens
  • Labels: model, operation_type
  • tokens_output_total — Counter of output tokens
  • Labels: model, operation_type

Cost Metrics

  • cost_estimate_per_run — Estimated cost per run
  • Labels: template_recipe_id, tenant_id
  • cost_ai_total — Total AI costs
  • Labels: model, tenant_id, time_period
  • cost_infrastructure_total — Total infrastructure costs
  • Labels: resource_type, tenant_id, time_period

System Metrics

  • worker_cpu_usage — CPU usage per worker
  • Labels: worker_pool, worker_id
  • worker_memory_usage — Memory usage per worker
  • Labels: worker_pool, worker_id
  • worker_errors_total — Counter of worker errors
  • Labels: worker_pool, error_type

Logs

Structured Logging

All logs are structured (JSON format) with consistent fields:

Required Fields

  • timestamp — ISO 8601 timestamp
  • level — Log level (DEBUG, INFO, WARN, ERROR, CRITICAL)
  • message — Human-readable log message
  • runId — Run identifier (if applicable)
  • jobId — Job identifier (if applicable)
  • stepName — Step name (if applicable)
  • tenantId — Tenant identifier
  • templateRecipeId — Template recipe identifier
  • traceId — Distributed trace ID
  • spanId — Span ID
  • severity — Severity level (for filtering)

Context Fields

  • workerId — Worker instance identifier
  • workerPool — Worker pool name
  • operation — Operation name (e.g., "generate_repo")
  • duration_ms — Operation duration
  • error — Error details (if error occurred)
  • metadata — Additional context (key-value pairs)

Log Levels

  • DEBUG — Detailed diagnostic information (development only)
  • INFO — General informational messages (run started, job completed)
  • WARN — Warning messages (retries, degraded performance)
  • ERROR — Error messages (job failures, external system errors)
  • CRITICAL — Critical errors (system failures, data corruption)

Dashboards

Factory Health Dashboard

Purpose: Overall Factory health and performance

Metrics: - Run success rate (last 1 hour, 24 hours, 7 days) - Run failure rate by failure type - Average run duration - Jobs in queue (by type) - Worker pool health - Queue processing rate

Visualizations: - Success/failure rate trends - Run duration percentiles (p50, p95, p99) - Queue depth over time - Worker utilization

Queues & Workers Dashboard

Purpose: Queue and worker pool performance

Metrics: - Queue depth by queue name - Jobs processing by worker pool - Worker error rate - Job retry rate - Dead letter queue size

Visualizations: - Queue depth trends - Worker pool utilization - Job processing rate - Error rate by worker pool

AI Usage Dashboard

Purpose: AI token usage and costs

Metrics: - Tokens used per tenant - Tokens used per template - Tokens used per run type - Cost per run - Cost per tenant - Cost trends over time

Visualizations: - Token usage trends - Cost per tenant/project - Model usage distribution - Cost optimization opportunities

Run Details Dashboard

Purpose: Individual run execution details

Metrics: - Run status and progress - Job execution timeline - External call durations - Error details and stack traces - Trace visualization

Visualizations: - Run execution timeline - Distributed trace view - Job dependency graph - Error occurrence timeline


Alerts

High-Level Alert Rules

Run Failure Rate

  • Alert: factory_run_failure_rate_high
  • Condition: Run failure rate > 5% over 5 minutes
  • Severity: Critical
  • Action: Page on-call engineer, investigate failure patterns

Queue Depth

  • Alert: factory_queue_depth_high
  • Condition: Queue depth > 1000 jobs for 10 minutes
  • Severity: Warning
  • Action: Scale up workers, investigate bottleneck

Worker Error Rate

  • Alert: factory_worker_error_rate_spike
  • Condition: Worker error rate > 10% over 5 minutes
  • Severity: Critical
  • Action: Page on-call engineer, investigate worker health

AI Cost Anomalies

  • Alert: factory_ai_cost_anomaly
  • Condition: AI cost > 2x average over 1 hour
  • Severity: Warning
  • Action: Investigate token usage, check for runaway agents

Control Plane Availability

  • Alert: factory_control_plane_down
  • Condition: Control plane health check fails for 2 minutes
  • Severity: Critical
  • Action: Page on-call engineer, failover to backup

Alert Routing

  • Critical Alerts → PagerDuty / On-call rotation
  • Warning Alerts → Slack / Email notifications
  • Info Alerts → Dashboard notifications only

Observability Pipeline

graph LR
    Orchestrator[Orchestrator]
    Workers[Workers]
    Otel[OpenTelemetry<br/>Collector]
    Logs[(Logs Store)]
    Metrics[(Metrics Store)]
    Traces[(Traces Store)]
    Dashboards[Dashboards<br/>Grafana]
    Alerts[Alerts<br/>Prometheus AlertManager]

    Orchestrator --> Otel
    Workers --> Otel
    Otel --> Logs
    Otel --> Metrics
    Otel --> Traces
    Metrics --> Dashboards
    Metrics --> Alerts
    Logs --> Dashboards
    Traces --> Dashboards
Hold "Alt" / "Option" to enable pan & zoom

Data Flow:

  1. Instrumentation — Services emit traces, metrics, and logs
  2. Collection — OpenTelemetry Collector aggregates telemetry
  3. Storage — Telemetry stored in appropriate backends (logs, metrics, traces)
  4. Visualization — Dashboards query storage for visualization
  5. Alerting — Alert rules evaluate metrics and trigger alerts