Observability¶
Overview¶
Observability is a first-class requirement for the Factory runtime. Every run, job, agent action, and external call must be fully observable through traces, metrics, and logs. This enables rapid detection, diagnosis, and resolution of issues, as well as performance optimization and cost tracking.
Traces¶
Required Spans¶
Every Factory operation generates distributed tracing spans:
Run Overall Span¶
- Span Name:
factory.run.execute - Duration: Entire run lifecycle (from request to completion)
- Attributes:
run.id— Run identifierrun.tenant_id— Tenant identifierrun.project_id— Project identifierrun.template_recipe_id— Template recipe identifierrun.status— Run status (Requested, Running, Succeeded, Failed)run.duration_ms— Total run duration
Per Step/Job Span¶
- Span Name:
factory.job.execute - Duration: Individual job execution
- Attributes:
job.id— Job identifierjob.step_name— Step name (e.g., "generate-repo")job.attempt— Retry attempt numberjob.status— Job statusjob.duration_ms— Job execution duration
External Calls Span¶
- Span Name:
factory.external.{service} - Duration: External system call duration
- Examples:
factory.external.azure_devops— Azure DevOps API callsfactory.external.git— Git operationsfactory.external.llm— LLM API callsfactory.external.storage— Storage operations
Correlation IDs¶
Every span includes correlation IDs for end-to-end traceability:
- traceId — OpenTelemetry trace ID (correlates all spans in a trace)
- spanId — OpenTelemetry span ID (unique within a trace)
- runId — Factory run identifier
- jobId — Factory job identifier
- tenantId — Tenant identifier
Trace Propagation¶
Correlation IDs are propagated through:
- HTTP Headers —
traceparent,tracestateheaders - Event Metadata — Trace IDs in event bus messages
- Database Records — Trace IDs stored in run state
- Log Context — Trace IDs in structured logs
Metrics¶
Core Metrics¶
Run Metrics¶
- run_duration_seconds — Histogram of run durations
- Labels:
status,template_recipe_id,tenant_id - run_success_total — Counter of successful runs
- Labels:
template_recipe_id,tenant_id - run_failure_total — Counter of failed runs
- Labels:
template_recipe_id,tenant_id,failure_type - run_cancelled_total — Counter of cancelled runs
- Labels:
template_recipe_id,tenant_id
Job Metrics¶
- jobs_in_queue — Gauge of jobs waiting in queue
- Labels:
job_type,priority - jobs_processing — Gauge of jobs currently processing
- Labels:
job_type,worker_pool - job_duration_seconds — Histogram of job durations
- Labels:
job_type,status - retries_total — Counter of job retries
- Labels:
job_type,attempt
Queue Metrics¶
- queue_depth — Gauge of queue depth
- Labels:
queue_name,priority - queue_processing_rate — Rate of jobs processed per second
- Labels:
queue_name - dlq_size — Gauge of dead letter queue size
- Labels:
queue_name
AI-Related Metrics¶
Token Usage¶
- tokens_used_total — Counter of AI tokens used
- Labels:
model,operation_type,tenant_id,run_id - tokens_input_total — Counter of input tokens
- Labels:
model,operation_type - tokens_output_total — Counter of output tokens
- Labels:
model,operation_type
Cost Metrics¶
- cost_estimate_per_run — Estimated cost per run
- Labels:
template_recipe_id,tenant_id - cost_ai_total — Total AI costs
- Labels:
model,tenant_id,time_period - cost_infrastructure_total — Total infrastructure costs
- Labels:
resource_type,tenant_id,time_period
System Metrics¶
- worker_cpu_usage — CPU usage per worker
- Labels:
worker_pool,worker_id - worker_memory_usage — Memory usage per worker
- Labels:
worker_pool,worker_id - worker_errors_total — Counter of worker errors
- Labels:
worker_pool,error_type
Logs¶
Structured Logging¶
All logs are structured (JSON format) with consistent fields:
Required Fields¶
- timestamp — ISO 8601 timestamp
- level — Log level (DEBUG, INFO, WARN, ERROR, CRITICAL)
- message — Human-readable log message
- runId — Run identifier (if applicable)
- jobId — Job identifier (if applicable)
- stepName — Step name (if applicable)
- tenantId — Tenant identifier
- templateRecipeId — Template recipe identifier
- traceId — Distributed trace ID
- spanId — Span ID
- severity — Severity level (for filtering)
Context Fields¶
- workerId — Worker instance identifier
- workerPool — Worker pool name
- operation — Operation name (e.g., "generate_repo")
- duration_ms — Operation duration
- error — Error details (if error occurred)
- metadata — Additional context (key-value pairs)
Log Levels¶
- DEBUG — Detailed diagnostic information (development only)
- INFO — General informational messages (run started, job completed)
- WARN — Warning messages (retries, degraded performance)
- ERROR — Error messages (job failures, external system errors)
- CRITICAL — Critical errors (system failures, data corruption)
Dashboards¶
Factory Health Dashboard¶
Purpose: Overall Factory health and performance
Metrics: - Run success rate (last 1 hour, 24 hours, 7 days) - Run failure rate by failure type - Average run duration - Jobs in queue (by type) - Worker pool health - Queue processing rate
Visualizations: - Success/failure rate trends - Run duration percentiles (p50, p95, p99) - Queue depth over time - Worker utilization
Queues & Workers Dashboard¶
Purpose: Queue and worker pool performance
Metrics: - Queue depth by queue name - Jobs processing by worker pool - Worker error rate - Job retry rate - Dead letter queue size
Visualizations: - Queue depth trends - Worker pool utilization - Job processing rate - Error rate by worker pool
AI Usage Dashboard¶
Purpose: AI token usage and costs
Metrics: - Tokens used per tenant - Tokens used per template - Tokens used per run type - Cost per run - Cost per tenant - Cost trends over time
Visualizations: - Token usage trends - Cost per tenant/project - Model usage distribution - Cost optimization opportunities
Run Details Dashboard¶
Purpose: Individual run execution details
Metrics: - Run status and progress - Job execution timeline - External call durations - Error details and stack traces - Trace visualization
Visualizations: - Run execution timeline - Distributed trace view - Job dependency graph - Error occurrence timeline
Alerts¶
High-Level Alert Rules¶
Run Failure Rate¶
- Alert:
factory_run_failure_rate_high - Condition: Run failure rate > 5% over 5 minutes
- Severity: Critical
- Action: Page on-call engineer, investigate failure patterns
Queue Depth¶
- Alert:
factory_queue_depth_high - Condition: Queue depth > 1000 jobs for 10 minutes
- Severity: Warning
- Action: Scale up workers, investigate bottleneck
Worker Error Rate¶
- Alert:
factory_worker_error_rate_spike - Condition: Worker error rate > 10% over 5 minutes
- Severity: Critical
- Action: Page on-call engineer, investigate worker health
AI Cost Anomalies¶
- Alert:
factory_ai_cost_anomaly - Condition: AI cost > 2x average over 1 hour
- Severity: Warning
- Action: Investigate token usage, check for runaway agents
Control Plane Availability¶
- Alert:
factory_control_plane_down - Condition: Control plane health check fails for 2 minutes
- Severity: Critical
- Action: Page on-call engineer, failover to backup
Alert Routing¶
- Critical Alerts → PagerDuty / On-call rotation
- Warning Alerts → Slack / Email notifications
- Info Alerts → Dashboard notifications only
Observability Pipeline¶
graph LR
Orchestrator[Orchestrator]
Workers[Workers]
Otel[OpenTelemetry<br/>Collector]
Logs[(Logs Store)]
Metrics[(Metrics Store)]
Traces[(Traces Store)]
Dashboards[Dashboards<br/>Grafana]
Alerts[Alerts<br/>Prometheus AlertManager]
Orchestrator --> Otel
Workers --> Otel
Otel --> Logs
Otel --> Metrics
Otel --> Traces
Metrics --> Dashboards
Metrics --> Alerts
Logs --> Dashboards
Traces --> Dashboards
Data Flow:
- Instrumentation — Services emit traces, metrics, and logs
- Collection — OpenTelemetry Collector aggregates telemetry
- Storage — Telemetry stored in appropriate backends (logs, metrics, traces)
- Visualization — Dashboards query storage for visualization
- Alerting — Alert rules evaluate metrics and trigger alerts
Related Documentation¶
- Execution Engine — How execution generates telemetry
- Control Plane — How control plane is observed
- State & Memory — How state changes are logged
- Failure & Recovery — How failures are monitored and alerted
- Platform Foundations - Observability — Overall observability strategy