Observability¶

Overview¶

Observability is a first-class requirement for the Factory runtime. Every run, job, agent action, and external call must be fully observable through traces, metrics, and logs. This enables rapid detection, diagnosis, and resolution of issues, as well as performance optimization and cost tracking.

Traces¶

Required Spans¶

Every Factory operation generates distributed tracing spans:

Run Overall Span¶

Span Name: factory.run.execute
Duration: Entire run lifecycle (from request to completion)
Attributes:
- run.id — Run identifier
- run.tenant_id — Tenant identifier
- run.project_id — Project identifier
- run.template_recipe_id — Template recipe identifier
- run.status — Run status (Requested, Running, Succeeded, Failed)
- run.duration_ms — Total run duration

Per Step/Job Span¶

Span Name: factory.job.execute
Duration: Individual job execution
Attributes:
- job.id — Job identifier
- job.step_name — Step name (e.g., "generate-repo")
- job.attempt — Retry attempt number
- job.status — Job status
- job.duration_ms — Job execution duration

External Calls Span¶

Span Name: factory.external.{service}
Duration: External system call duration
Examples:
- factory.external.azure_devops — Azure DevOps API calls
- factory.external.git — Git operations
- factory.external.llm — LLM API calls
- factory.external.storage — Storage operations

Correlation IDs¶

Every span includes correlation IDs for end-to-end traceability:

traceId — OpenTelemetry trace ID (correlates all spans in a trace)
spanId — OpenTelemetry span ID (unique within a trace)
runId — Factory run identifier
jobId — Factory job identifier
tenantId — Tenant identifier

Trace Propagation¶

Correlation IDs are propagated through:

HTTP Headers — traceparent, tracestate headers
Event Metadata — Trace IDs in event bus messages
Database Records — Trace IDs stored in run state
Log Context — Trace IDs in structured logs

Metrics¶

Core Metrics¶

Run Metrics¶

run_duration_seconds — Histogram of run durations
- Labels: status, template_recipe_id, tenant_id
run_success_total — Counter of successful runs
- Labels: template_recipe_id, tenant_id
run_failure_total — Counter of failed runs
- Labels: template_recipe_id, tenant_id, failure_type
run_cancelled_total — Counter of cancelled runs
- Labels: template_recipe_id, tenant_id

Job Metrics¶

jobs_in_queue — Gauge of jobs waiting in queue
- Labels: job_type, priority
jobs_processing — Gauge of jobs currently processing
- Labels: job_type, worker_pool
job_duration_seconds — Histogram of job durations
- Labels: job_type, status
retries_total — Counter of job retries
- Labels: job_type, attempt

Queue Metrics¶

queue_depth — Gauge of queue depth
- Labels: queue_name, priority
queue_processing_rate — Rate of jobs processed per second
- Labels: queue_name
dlq_size — Gauge of dead letter queue size
- Labels: queue_name

Token Usage¶

tokens_used_total — Counter of AI tokens used
- Labels: model, operation_type, tenant_id, run_id
tokens_input_total — Counter of input tokens
- Labels: model, operation_type
tokens_output_total — Counter of output tokens
- Labels: model, operation_type

Cost Metrics¶

cost_estimate_per_run — Estimated cost per run
- Labels: template_recipe_id, tenant_id
cost_ai_total — Total AI costs
- Labels: model, tenant_id, time_period
cost_infrastructure_total — Total infrastructure costs
- Labels: resource_type, tenant_id, time_period

System Metrics¶

worker_cpu_usage — CPU usage per worker
- Labels: worker_pool, worker_id
worker_memory_usage — Memory usage per worker
- Labels: worker_pool, worker_id
worker_errors_total — Counter of worker errors
- Labels: worker_pool, error_type

Logs¶

Structured Logging¶

All logs are structured (JSON format) with consistent fields:

Required Fields¶

timestamp — ISO 8601 timestamp
level — Log level (DEBUG, INFO, WARN, ERROR, CRITICAL)
message — Human-readable log message
runId — Run identifier (if applicable)
jobId — Job identifier (if applicable)
stepName — Step name (if applicable)
tenantId — Tenant identifier
templateRecipeId — Template recipe identifier
traceId — Distributed trace ID
spanId — Span ID
severity — Severity level (for filtering)

Context Fields¶

workerId — Worker instance identifier
workerPool — Worker pool name
operation — Operation name (e.g., "generate_repo")
duration_ms — Operation duration
error — Error details (if error occurred)
metadata — Additional context (key-value pairs)

Log Levels¶

DEBUG — Detailed diagnostic information (development only)
INFO — General informational messages (run started, job completed)
WARN — Warning messages (retries, degraded performance)
ERROR — Error messages (job failures, external system errors)
CRITICAL — Critical errors (system failures, data corruption)

Dashboards¶

Factory Health Dashboard¶

Purpose: Overall Factory health and performance

Metrics:

Run success rate (last 1 hour, 24 hours, 7 days)
Run failure rate by failure type
Average run duration
Jobs in queue (by type)
Worker pool health
Queue processing rate

Visualizations:

Success/failure rate trends
Run duration percentiles (p50, p95, p99)
Queue depth over time
Worker utilization

Queues & Workers Dashboard¶

Purpose: Queue and worker pool performance

Metrics:

Queue depth by queue name
Jobs processing by worker pool
Worker error rate
Job retry rate
Dead letter queue size

Visualizations:

Queue depth trends
Worker pool utilization
Job processing rate
Error rate by worker pool

AI Usage Dashboard¶

Purpose: AI token usage and costs

Metrics:

Tokens used per tenant
Tokens used per template
Tokens used per run type
Cost per run
Cost per tenant
Cost trends over time

Visualizations:

Token usage trends
Cost per tenant/project
Model usage distribution
Cost optimization opportunities

Run Details Dashboard¶

Purpose: Individual run execution details

Metrics:

Run status and progress
Job execution timeline
External call durations
Error details and stack traces
Trace visualization

Visualizations:

Run execution timeline
Distributed trace view
Job dependency graph
Error occurrence timeline

Alerts¶

High-Level Alert Rules¶

Run Failure Rate¶

Alert: factory_run_failure_rate_high
Condition: Run failure rate > 5% over 5 minutes
Severity: Critical
Action: Page on-call engineer, investigate failure patterns

Queue Depth¶

Alert: factory_queue_depth_high
Condition: Queue depth > 1000 jobs for 10 minutes
Severity: Warning
Action: Scale up workers, investigate bottleneck

Worker Error Rate¶

Alert: factory_worker_error_rate_spike
Condition: Worker error rate > 10% over 5 minutes
Severity: Critical
Action: Page on-call engineer, investigate worker health

AI Cost Anomalies¶

Alert: factory_ai_cost_anomaly
Condition: AI cost > 2x average over 1 hour
Severity: Warning
Action: Investigate token usage, check for runaway agents

Control Plane Availability¶

Alert: factory_control_plane_down
Condition: Control plane health check fails for 2 minutes
Severity: Critical
Action: Page on-call engineer, failover to backup

Alert Routing¶

Critical Alerts → PagerDuty / On-call rotation
Warning Alerts → Slack / Email notifications
Info Alerts → Dashboard notifications only

Observability Pipeline¶

graph LR
    Orchestrator[Orchestrator]
    Workers[Workers]
    Otel[OpenTelemetry<br/>Collector]
    Logs[(Logs Store)]
    Metrics[(Metrics Store)]
    Traces[(Traces Store)]
    Dashboards[Dashboards<br/>Grafana]
    Alerts[Alerts<br/>Prometheus AlertManager]

    Orchestrator --> Otel
    Workers --> Otel
    Otel --> Logs
    Otel --> Metrics
    Otel --> Traces
    Metrics --> Dashboards
    Metrics --> Alerts
    Logs --> Dashboards
    Traces --> Dashboards

Hold "Alt" / "Option" to enable pan & zoom

Data Flow:

Instrumentation — Services emit traces, metrics, and logs
Collection — OpenTelemetry Collector aggregates telemetry
Storage — Telemetry stored in appropriate backends (logs, metrics, traces)
Visualization — Dashboards query storage for visualization
Alerting — Alert rules evaluate metrics and trigger alerts

Execution Engine — How execution generates telemetry
Control Plane — How control plane is observed
State & Memory — How state changes are logged
Failure & Recovery — How failures are monitored and alerted
Platform Foundations - Observability — Overall observability strategy

Observability¶

Overview¶

Traces¶

Required Spans¶

Run Overall Span¶

Per Step/Job Span¶

External Calls Span¶

Correlation IDs¶

Trace Propagation¶

Metrics¶

Core Metrics¶

Run Metrics¶

Job Metrics¶

Queue Metrics¶

AI-Related Metrics¶

Token Usage¶

Cost Metrics¶

System Metrics¶

Logs¶

Structured Logging¶

Required Fields¶

Context Fields¶

Log Levels¶

Dashboards¶

Factory Health Dashboard¶

Queues & Workers Dashboard¶

AI Usage Dashboard¶

Run Details Dashboard¶

Alerts¶

High-Level Alert Rules¶

Run Failure Rate¶

Queue Depth¶

Worker Error Rate¶

AI Cost Anomalies¶

Control Plane Availability¶

Alert Routing¶

Observability Pipeline¶

Related Documentation¶