π‘ Observability Blueprint¶
π What Is an Observability Blueprint?¶
An Observability Blueprint is a structured, agent-generated artifact that defines the observability posture for a ConnectSoft-generated component β whether it's a microservice, module, API gateway, library, or infrastructure resource.
It represents the observability definition of record, created during the generation pipeline and continuously evaluated by downstream monitoring agents, incident response systems, and CI/CD pipelines.
In the AI Software Factory, the Observability Blueprint is not just documentation β it's a machine-readable contract for telemetry expectations, alerting behavior, SLO compliance, and operational visibility.
π§ Blueprint Roles in the Factory¶
The Observability Blueprint plays a pivotal role in making monitoring composable, alerting consistent, and operations auditable:
- π Defines metrics taxonomy, custom counters, gauges, and histograms with naming conventions
- π¨ Maps alert rules to severity levels, escalation policies, and on-call routing
- π― Encodes SLO/SLA targets, error budgets, burn rate alerts, and compliance windows
- π Drives dashboard-as-code definitions for Grafana, Azure Monitor, and custom UIs
- π Specifies distributed trace topology, span definitions, and sampling strategies
- π Defines structured logging schemas, retention policies, and anomaly detection rules
- π Configures on-call rotations, escalation chains, and notification channels
- π Links incident playbooks, automated response actions, and runbook references
It ensures that observability is not an afterthought, but a first-class agent responsibility in the generation pipeline.
π§© Blueprint Consumers and Usage¶
| Stakeholder / Agent | Usage |
|---|---|
Observability Engineer Agent |
Designs dashboards, metrics taxonomy, trace topology |
Alerting and Incident Manager Agent |
Defines alert rules, on-call routing, incident triggers |
SLO/SLA Compliance Agent |
Defines SLO targets, error budgets, compliance reports |
Log Analysis Agent |
Configures log patterns, anomaly detection rules, retention policies |
DevOps Engineer Agent |
Integrates observability into deployment pipelines |
Incident Response Agent |
Uses telemetry for containment, triage, and post-mortem |
CI/CD Pipeline |
Validates observability readiness before deploy |
Security Architect Agent |
Consumes security telemetry signals and audit trail bindings |
Infrastructure Engineer Agent |
Uses resource metrics and health probes for capacity planning |
π§Ύ Output Shape¶
Each Observability Blueprint is saved as:
π Markdown: human-readable form for inspection, design validation, and documentationπ§Ύ JSON: machine-readable structure for automated enforcement and agent consumptionπ YAML: configuration files for Prometheus, Grafana, OpenTelemetry, and alert managersπ§ Embedding: vector-encoded for memory graph and context tracking
π Storage Convention¶
blueprints/observability/{component-name}/observability-blueprint.md
blueprints/observability/{component-name}/observability-blueprint.json
blueprints/observability/{component-name}/alert-rules.yaml
blueprints/observability/{component-name}/slo-definitions.yaml
blueprints/observability/{component-name}/dashboards/
blueprints/observability/{component-name}/otel-config.yaml
blueprints/observability/{component-name}/on-call.yaml
π― Purpose and Motivation¶
The Observability Blueprint exists to solve one of the most persistent problems in modern distributed software delivery:
"Monitoring is either fragmented across tools, inconsistently configured across services, or entirely absent from the design phase β leading to blind spots in production."
In the ConnectSoft AI Software Factory, observability is integrated at the blueprint level, making it:
- β Deterministic (agent-generated, based on traceable inputs)
- β Repeatable (diffable and validated through CI/CD)
- β Composable (aligned with service, security, and infrastructure blueprints)
- β Actionable (directly drives dashboards, alerts, and incident response)
- β Compliant (SLO/SLA tracking with error budgets and burn rate alerts)
π¨ Problems It Solves¶
| Problem Area | How the Observability Blueprint Helps |
|---|---|
| π§© Fragmented Monitoring | Centralizes metrics, logs, and traces into a single declarative contract |
| π Inconsistent Alerting | Standardizes alert rules, severity levels, and escalation policies |
| π― No SLO Tracking | Encodes SLO targets, error budgets, and burn rate alerts as first-class data |
| π Opaque Log Patterns | Defines structured logging schemas with correlation and retention rules |
| π Disconnected Traces | Configures distributed trace topology with span definitions and sampling |
| π Dashboard Drift | Generates dashboards-as-code, versioned and diffable alongside services |
| π Alert Fatigue | Implements deduplication, grouping, and intelligent routing strategies |
| π Slow Incident Response | Links telemetry to playbooks, runbooks, and automated containment actions |
| π Lack of Operational Visibility | Makes system health observable across all layers and environments |
| π Missing Security Telemetry | Integrates security signals into the unified observability pipeline |
π§ Why Blueprints, Not Just Configs?¶
While traditional environments rely on ad hoc monitoring scripts, scattered dashboard JSONs, or manually-maintained alert configs, the Factory approach uses blueprints because:
- Blueprints are memory-linked to every module and trace ID
- They are machine-generated and human-readable
- They support forward/backward analysis across versions and changes
- They coordinate multiple agents across Monitoring, Ops, and SRE clusters
- They validate observability readiness before any deployment reaches production
This allows observability to be treated as code β but also as a living architectural asset.
π§ Agent-Created, Trace-Ready Artifact¶
In the ConnectSoft AI Software Factory, the Observability Blueprint is not written manually β it is generated, enriched, and validated by multiple agents, then stored as part of the system's memory graph.
This ensures every observability contract is:
- π Traceable to its origin prompt or product feature
- π Regenerable with context-aware mutation
- π Auditable through observability-first design
- π§ Embedded into the long-term agentic memory system
π€ Agents Involved in Creation¶
| Agent | Responsibility |
|---|---|
π Observability Engineer Agent |
Designs metrics taxonomy, dashboard layouts, and trace topology |
π¨ Alerting and Incident Manager Agent |
Defines alert rules, severity mappings, escalation chains |
π― SLO/SLA Compliance Agent |
Sets SLO targets, error budgets, burn rate thresholds |
π Log Analysis Agent |
Configures structured logging schemas and anomaly detection |
π Distributed Tracing Agent |
Designs span definitions, sampling strategies, context propagation |
π Pipeline Agent |
Applies CI/CD validation gates and observability readiness checks |
π Security Architect Agent |
Integrates security telemetry requirements into the observability posture |
Each agent contributes signals, decisions, and enriched metadata to create a complete, executable artifact.
π Memory Traceability¶
Observability Blueprints are:
- π Linked to the project-wide trace ID
- π Associated with the microservice, module, or gateway
- π§ Indexed in vector memory for AI reasoning and enforcement
- π Versioned and tagged (
v1,approved,drifted,incident-updated, etc.)
This makes the blueprint machine-auditable, AI-searchable, and human-explainable.
π Example Storage and Trace Metadata¶
traceId: trc_92ab_OrderService_v1
agentId: obs-engineer-001
serviceName: OrderService
observabilityProfile: comprehensive
tags:
- metrics
- alerting
- slo
- tracing
- dashboards
- production
version: v1
state: approved
createdAt: "2025-08-14T09:30:00Z"
lastModifiedBy: slo-compliance-agent
π¦ What It Captures¶
The Observability Blueprint encodes a comprehensive set of observability dimensions that affect a service or module throughout its lifecycle β from build to runtime to incident response.
It defines what needs to be monitored, how, and under what thresholds β making it a living contract between the generated component and its operational environment.
π Core Observability Elements Captured¶
| Category | Captured Details |
|---|---|
| Metrics Taxonomy | Custom counters, gauges, histograms with naming conventions and label standards |
| Alert Rules | Thresholds, severity levels, escalation policies, deduplication strategies |
| SLO/SLA Definitions | Targets, error budgets, burn rate alerts, compliance windows |
| Dashboard Layouts | Grafana/Azure Monitor panel definitions, row grouping, variable templates |
| Log Aggregation | Structured logging schemas, retention policies, correlation rules |
| Trace Topology | Distributed trace configuration, span definitions, sampling strategies |
| On-Call Configuration | Rotation schedules, escalation chains, notification channels |
| Incident Playbooks | Automated response actions, runbook references, containment steps |
| Health Probes | Liveness, readiness, and startup probe configurations |
| Capacity Indicators | Resource utilization thresholds, scaling triggers, saturation metrics |
π Blueprint Snippet (Example)¶
metrics:
namespace: connectsoft.orderservice
counters:
- name: http_requests_total
description: "Total HTTP requests processed"
labels: [method, status_code, endpoint]
- name: orders_created_total
description: "Total orders successfully created"
labels: [payment_method, region]
histograms:
- name: http_request_duration_seconds
description: "HTTP request latency distribution"
labels: [method, endpoint]
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
gauges:
- name: active_connections
description: "Current number of active connections"
labels: [protocol]
alerts:
- name: HighErrorRate
expr: "rate(http_requests_total{status_code=~'5..'}[5m]) / rate(http_requests_total[5m]) > 0.05"
severity: critical
for: "5m"
annotations:
summary: "Error rate exceeds 5% for {{ $labels.service }}"
runbook: "https://runbooks.connectsoft.io/high-error-rate"
slo:
- name: availability
target: 99.9
indicator: "1 - (rate(http_requests_total{status_code=~'5..'}[30d]) / rate(http_requests_total[30d]))"
window: 30d
errorBudget: 0.1
burnRateAlert:
fast: { factor: 14.4, window: "1h", severity: critical }
slow: { factor: 6, window: "6h", severity: warning }
π§ Cross-Blueprint Intersections¶
- Security Blueprint β defines security telemetry signals, audit trail metrics, threat detection alerts
- Infrastructure Blueprint β defines resource metrics, health probes, capacity indicators, scaling triggers
- Test Blueprint β defines test observability, coverage metrics, regression alerts
- Pipeline Blueprint β defines CI/CD telemetry, deployment metrics, rollback triggers
- Service Blueprint β defines business metrics, domain event counters, SLA contracts
The Observability Blueprint aggregates, links, and applies monitoring rules across all of these β ensuring coherence and alignment.
ποΈ Output Formats and Structure¶
The Observability Blueprint is generated and consumed across multiple layers of the AI Software Factory β from human-readable design reviews to machine-enforced CI/CD validations to runtime telemetry configuration.
To support both automation and collaboration, it is produced in four coordinated formats, each aligned with a different set of use cases.
π Human-Readable Markdown (.md)¶
Used in Studio, code reviews, audits, and documentation layers.
- Sectioned by category: metrics, alerts, SLOs, dashboards, traces, logs
- Rich formatting with annotations and cross-references
- Includes YAML code samples and configuration excerpts
- Links to upstream and downstream blueprints
π Machine-Readable JSON (.json)¶
Used by agents, pipelines, and enforcement scripts.
- Flattened and typed
- Includes metadata and trace headers
- Validated against a shared schema
- Compatible with observability-as-code validators
Example excerpt:
{
"traceId": "trc_92ab_order_service",
"serviceName": "OrderService",
"metrics": {
"namespace": "connectsoft.orderservice",
"counters": [
{
"name": "http_requests_total",
"labels": ["method", "status_code", "endpoint"],
"description": "Total HTTP requests processed"
}
],
"histograms": [
{
"name": "http_request_duration_seconds",
"labels": ["method", "endpoint"],
"buckets": [0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
}
]
},
"slo": {
"availability": {
"target": 99.9,
"window": "30d",
"errorBudget": 0.1
}
},
"alertRules": {
"count": 12,
"critical": 3,
"warning": 5,
"info": 4
}
}
π CI/CD Compatible Snippets (.yaml fragments)¶
Used to inject observability logic into pipelines, sidecars, and runtime configurations.
- Prometheus alert rule files
- Grafana dashboard provisioning JSON
- OpenTelemetry Collector configuration
- SLO definition files for Sloth or Pyrra
- On-call rotation manifests for PagerDuty/OpsGenie
π§ Embedded Memory Shape (Vectorized)¶
- Captured in agent long-term memory
- Indexed by concept (e.g.,
slo,alerting,tracing,dashboards) - Linked to all agent discussions, generations, and validations
- Enables trace-based enforcement and reuse
π Naming Convention¶
blueprints/observability/{service-name}/observability-blueprint.md
blueprints/observability/{service-name}/observability-blueprint.json
blueprints/observability/{service-name}/alert-rules.yaml
blueprints/observability/{service-name}/slo-definitions.yaml
blueprints/observability/{service-name}/dashboards/overview.json
blueprints/observability/{service-name}/otel-config.yaml
blueprints/observability/{service-name}/on-call.yaml
Each blueprint instance is traceable to a single component.
π Metrics Taxonomy and Naming¶
Consistent, well-structured metrics are the foundation of effective observability. The Observability Blueprint defines a rigorous metrics taxonomy with naming conventions, label standards, and cardinality management rules that every generated service must follow.
π·οΈ Naming Conventions¶
All metrics follow the OpenTelemetry Semantic Conventions and adhere to the Factory's naming standard:
| Component | Convention | Examples |
|---|---|---|
namespace |
Organization or product prefix | connectsoft, factory |
subsystem |
Service or domain identifier | orderservice, gateway, auth |
metric_name |
Snake_case, descriptive, action-oriented | requests_total, duration_seconds |
unit |
Suffix indicating unit of measurement | _seconds, _bytes, _total, _ratio |
π Metric Types¶
| Type | Use Case | Examples |
|---|---|---|
| Counter | Monotonically increasing values | http_requests_total, errors_total |
| Gauge | Values that go up and down | active_connections, queue_depth |
| Histogram | Distribution of values across buckets | http_request_duration_seconds, payload_size_bytes |
| Summary | Pre-calculated quantiles (use sparingly) | gc_pause_seconds |
π·οΈ Label Standards¶
Labels (dimensions) provide context for metrics but must be managed carefully to prevent cardinality explosion:
| Rule | Description |
|---|---|
| Bounded cardinality | Labels must have a known, finite set of values |
| No high-cardinality values | Never use user IDs, request IDs, or UUIDs as label values |
| Consistent naming | Use snake_case: status_code, http_method, service_name |
| Standard labels | Always include service, environment, version where relevant |
| Avoid label proliferation | Maximum 7 labels per metric to control storage costs |
π Blueprint Metrics Definition Example¶
metricsDefinition:
namespace: connectsoft.orderservice
standardLabels:
- service: orderservice
- environment: "{{ .Environment }}"
- version: "{{ .AppVersion }}"
counters:
- name: connectsoft_orderservice_http_requests_total
description: "Total number of HTTP requests received"
labels: [method, status_code, endpoint]
- name: connectsoft_orderservice_orders_created_total
description: "Total number of orders successfully created"
labels: [payment_method, region, order_type]
- name: connectsoft_orderservice_orders_failed_total
description: "Total number of order creation failures"
labels: [failure_reason, region]
- name: connectsoft_orderservice_events_published_total
description: "Total domain events published to message bus"
labels: [event_type, destination]
gauges:
- name: connectsoft_orderservice_active_connections
description: "Current number of active client connections"
labels: [protocol]
- name: connectsoft_orderservice_queue_depth
description: "Current depth of the processing queue"
labels: [queue_name, priority]
- name: connectsoft_orderservice_circuit_breaker_state
description: "Current circuit breaker state (0=closed, 1=half-open, 2=open)"
labels: [dependency_name]
histograms:
- name: connectsoft_orderservice_http_request_duration_seconds
description: "HTTP request latency distribution"
labels: [method, endpoint, status_code]
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
- name: connectsoft_orderservice_db_query_duration_seconds
description: "Database query execution time distribution"
labels: [operation, table]
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
- name: connectsoft_orderservice_message_processing_duration_seconds
description: "Message consumer processing time distribution"
labels: [message_type, consumer]
buckets: [0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 30.0, 60.0]
cardinalityBudget:
maxTimeSeriesPerMetric: 1000
maxTotalTimeSeries: 50000
alertOnCardinalityExceeded: true
π Cardinality Management¶
| Strategy | Description |
|---|---|
| Label allowlisting | Only approved label values are permitted |
| Cardinality budgets | Per-metric and per-service limits on unique time series |
| Aggregation rules | High-cardinality metrics are pre-aggregated before storage |
| Recording rules | Frequently-queried expressions are pre-computed |
| Metric lifecycle | Unused or deprecated metrics are decommissioned systematically |
π€ Agent Collaboration¶
| Agent | Role |
|---|---|
Observability Engineer Agent |
Designs metrics taxonomy and naming conventions |
SLO/SLA Compliance Agent |
Validates metrics support SLI calculations |
Infrastructure Engineer Agent |
Ensures metric exporters are deployed and scraped |
DevOps Engineer Agent |
Configures Prometheus scrape targets and recording rules |
π Metrics are not just numbers β they are typed, labeled, budgeted, and lifecycle-managed observability assets.
π¨ Alert Rules Lifecycle¶
Alerting is the bridge between passive monitoring and active incident response. The Observability Blueprint defines not just what alerts exist, but their entire lifecycle β from definition through testing, deployment, tuning, and eventual retirement.
π Alert Rule Structure¶
Every alert rule in the blueprint follows a standardized structure:
alertRules:
- name: HighErrorRate
expr: >
rate(connectsoft_orderservice_http_requests_total{status_code=~"5.."}[5m])
/ rate(connectsoft_orderservice_http_requests_total[5m]) > 0.05
for: "5m"
severity: critical
team: platform-sre
labels:
component: orderservice
category: availability
annotations:
summary: "Error rate exceeds 5% for OrderService"
description: "HTTP 5xx error rate has been above 5% for 5 minutes"
runbook: "https://runbooks.connectsoft.io/high-error-rate"
dashboard: "https://grafana.connectsoft.io/d/orderservice/overview"
routing:
notifyChannels: [pagerduty, slack]
escalationPolicy: sre-oncall-escalation
- name: HighLatencyP99
expr: >
histogram_quantile(0.99, rate(connectsoft_orderservice_http_request_duration_seconds_bucket[5m])) > 2.0
for: "10m"
severity: warning
team: platform-sre
labels:
component: orderservice
category: latency
annotations:
summary: "P99 latency exceeds 2 seconds for OrderService"
description: "99th percentile HTTP latency has been above 2s for 10 minutes"
runbook: "https://runbooks.connectsoft.io/high-latency"
routing:
notifyChannels: [slack]
escalationPolicy: sre-oncall-soft
- name: ErrorBudgetBurnRateFast
expr: >
slo:connectsoft_orderservice_availability:burn_rate_1h > 14.4
for: "2m"
severity: critical
team: platform-sre
labels:
component: orderservice
category: slo
annotations:
summary: "Fast error budget burn detected for OrderService"
description: "1-hour burn rate is 14.4x normal β error budget will exhaust in ~1 hour"
runbook: "https://runbooks.connectsoft.io/error-budget-burn"
routing:
notifyChannels: [pagerduty, slack, email]
escalationPolicy: sre-oncall-critical
ποΈ Severity Levels¶
| Severity | Response Time | Notification Channel | Escalation |
|---|---|---|---|
critical |
< 5 minutes | PagerDuty + Slack + Email | Immediate on-call page |
warning |
< 30 minutes | Slack + Email | On-call notification, no page |
info |
Best effort | Slack only | Dashboard annotation, no notification |
none |
N/A | Recording rule only | Used for pre-computation, not routing |
π Alert Lifecycle Stages¶
flowchart LR
Draft["π Draft"] --> Review["π Review"]
Review --> Test["π§ͺ Test"]
Test --> Deploy["π Deploy"]
Deploy --> Active["β
Active"]
Active --> Tune["π§ Tune"]
Tune --> Active
Active --> Silence["π Silence"]
Silence --> Active
Active --> Retire["ποΈ Retire"]
| Stage | Description |
|---|---|
| Draft | Alert rule defined in blueprint, not yet validated |
| Review | Reviewed by SRE and owning team for correctness and noise potential |
| Test | Tested against historical data and synthetic scenarios |
| Deploy | Pushed to Prometheus/AlertManager via CI/CD |
| Active | Live in production, routing to notification channels |
| Tune | Thresholds or routing adjusted based on operational feedback |
| Silence | Temporarily suppressed during planned maintenance or known issues |
| Retire | Decommissioned when the underlying metric or service is deprecated |
π Alert Fatigue Reduction¶
| Strategy | Description |
|---|---|
| Grouping | Related alerts grouped by service/component into single notifications |
| Deduplication | Identical firing alerts suppressed within configurable window |
| Inhibition rules | Lower-severity alerts suppressed when higher-severity fires |
Minimum for duration |
Alerts must persist before firing to avoid transient noise |
| Rate-limited routing | Maximum notification frequency per channel per service |
| Actionability review | Periodic audit: every alert must have a runbook and clear next step |
π Escalation Policy Example¶
escalationPolicies:
- name: sre-oncall-critical
steps:
- delayMinutes: 0
targets:
- type: oncall-schedule
id: sre-primary
- delayMinutes: 15
targets:
- type: oncall-schedule
id: sre-secondary
- delayMinutes: 30
targets:
- type: user
id: engineering-manager
- delayMinutes: 60
targets:
- type: team
id: platform-leadership
- name: sre-oncall-soft
steps:
- delayMinutes: 0
targets:
- type: slack-channel
id: "#sre-alerts"
- delayMinutes: 30
targets:
- type: oncall-schedule
id: sre-primary
π€ Agent Collaboration¶
| Agent | Role |
|---|---|
Alerting and Incident Manager Agent |
Defines alert rules, severity mappings, routing configuration |
Observability Engineer Agent |
Validates alerts align with metrics taxonomy |
SLO/SLA Compliance Agent |
Creates burn rate alerts tied to error budgets |
DevOps Engineer Agent |
Deploys alert rules to AlertManager via CI/CD |
Incident Response Agent |
Validates runbook links and response procedures |
π¨ Alerts are not just thresholds β they are lifecycle-managed, routable, actionable artifacts that evolve with the system.
π― SLO/SLA Specification¶
Service Level Objectives (SLOs) and Service Level Agreements (SLAs) are the quantitative contracts that define acceptable system behavior. The Observability Blueprint encodes these as first-class, measurable, alertable specifications.
π SLO Structure¶
Each SLO specification includes:
| Field | Description |
|---|---|
name |
Descriptive SLO identifier |
sli |
Service Level Indicator β the metric expression |
target |
Target percentage (e.g., 99.9%) |
window |
Compliance window (e.g., 30 days rolling) |
errorBudget |
Allowed failure percentage (100% - target) |
burnRateAlerts |
Multi-window burn rate alerting thresholds |
complianceReports |
Automated report generation schedule |
π SLO Definition Example¶
sloDefinitions:
- name: orderservice-availability
description: "OrderService HTTP availability"
sli:
type: availability
goodEvents: "http_requests_total{status_code!~'5..'}"
totalEvents: "http_requests_total"
target: 99.9
window: 30d
errorBudget:
total: 0.1 # percent
remaining: 0.073
consumed: 27 # percent of budget consumed
burnRateAlerts:
- name: fast-burn
factor: 14.4
shortWindow: "1h"
longWindow: "5m"
severity: critical
pageOnCall: true
- name: slow-burn
factor: 6.0
shortWindow: "6h"
longWindow: "30m"
severity: warning
pageOnCall: false
- name: gradual-burn
factor: 3.0
shortWindow: "1d"
longWindow: "2h"
severity: info
pageOnCall: false
complianceReports:
frequency: weekly
recipients: [sre-team, product-manager, engineering-lead]
format: markdown
includeGraphs: true
- name: orderservice-latency
description: "OrderService HTTP P99 latency"
sli:
type: latency
threshold: 500ms
expression: "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
target: 99.0
window: 30d
errorBudget:
total: 1.0
burnRateAlerts:
- name: fast-burn
factor: 14.4
shortWindow: "1h"
longWindow: "5m"
severity: critical
pageOnCall: true
- name: slow-burn
factor: 6.0
shortWindow: "6h"
longWindow: "30m"
severity: warning
pageOnCall: false
- name: orderservice-throughput
description: "OrderService minimum throughput"
sli:
type: throughput
expression: "rate(http_requests_total[5m])"
minimumRps: 100
target: 99.5
window: 7d
π Error Budget Calculations¶
The error budget quantifies how much unreliability is tolerable within a given window:
For a 99.9% SLO over 30 days:
| Metric | Value |
|---|---|
| Total minutes in window | 43,200 |
| Allowed downtime | 43.2 minutes |
| Error budget (%) | 0.1% |
| Budget per day | ~1.44 minutes |
π₯ Burn Rate Alert Mathematics¶
Burn rate measures how fast the error budget is being consumed relative to the window:
| Burn Rate Factor | Meaning | Budget Exhaustion Time |
|---|---|---|
| 1.0 | Consuming budget at exactly the allowed rate | 30 days (full window) |
| 3.0 | 3x normal consumption | 10 days |
| 6.0 | 6x normal consumption | 5 days |
| 14.4 | Critical burn β budget will exhaust soon | ~2 days |
| 36.0 | Severe incident β immediate budget drain | ~20 hours |
π SLA Breach Notifications¶
When SLO compliance drops below SLA-contractual thresholds, the blueprint triggers:
slaBreachPolicy:
thresholds:
- level: warning
condition: "error_budget_remaining < 30%"
notify: [sre-team, product-manager]
- level: critical
condition: "error_budget_remaining < 10%"
notify: [sre-team, engineering-director, customer-success]
- level: breach
condition: "slo_compliance < sla_target"
notify: [executive-team, legal, customer-success]
actions:
- createIncident
- freezeDeployments
- generateComplianceReport
π€ Agent Collaboration¶
| Agent | Role |
|---|---|
SLO/SLA Compliance Agent |
Defines SLO targets, calculates error budgets, sets burn rate thresholds |
Alerting and Incident Manager Agent |
Creates burn rate alert rules and routing |
Observability Engineer Agent |
Validates SLI metric expressions against metrics taxonomy |
DevOps Engineer Agent |
Deploys SLO recording rules and dashboards |
Incident Response Agent |
Triggers automated actions on SLA breach |
π― SLOs are not aspirational targets β they are error-budget-backed, burn-rate-alerted, compliance-reported contracts.
π Dashboard-as-Code¶
Dashboards in the ConnectSoft AI Software Factory are not manually created β they are declaratively defined in the Observability Blueprint, generated from templates, and versioned alongside the service they monitor.
ποΈ Dashboard Architecture¶
flowchart TD
Blueprint["π‘ Observability Blueprint"] --> DashboardDef["π Dashboard Definition"]
DashboardDef --> Generator["π€ Dashboard Generator Agent"]
Generator --> GrafanaJSON["π Grafana JSON"]
Generator --> AzureDash["βοΈ Azure Dashboard ARM"]
GrafanaJSON --> Provisioning["π¦ Grafana Provisioning"]
AzureDash --> ArmDeploy["π ARM Deployment"]
Provisioning --> LiveDash["π Live Dashboard"]
ArmDeploy --> LiveDash
π Dashboard Definition Format¶
Dashboards are defined declaratively in the blueprint and then rendered into provider-specific formats:
dashboards:
- name: orderservice-overview
title: "OrderService Overview"
description: "Primary operational dashboard for OrderService"
tags: [orderservice, production, sre]
refresh: "30s"
timeRange: "6h"
variables:
- name: environment
type: query
query: "label_values(connectsoft_orderservice_http_requests_total, environment)"
default: production
- name: method
type: custom
values: [GET, POST, PUT, DELETE]
includeAll: true
rows:
- title: "Traffic & Availability"
panels:
- type: stat
title: "Request Rate"
expr: "sum(rate(connectsoft_orderservice_http_requests_total[5m]))"
unit: "reqps"
thresholds: { green: 0, yellow: 500, red: 1000 }
- type: stat
title: "Error Rate"
expr: "sum(rate(connectsoft_orderservice_http_requests_total{status_code=~'5..'}[5m])) / sum(rate(connectsoft_orderservice_http_requests_total[5m])) * 100"
unit: "percent"
thresholds: { green: 0, yellow: 1, red: 5 }
- type: gauge
title: "SLO Compliance"
expr: "slo:connectsoft_orderservice_availability:compliance"
unit: "percent"
thresholds: { red: 0, yellow: 99, green: 99.9 }
- title: "Latency Distribution"
panels:
- type: heatmap
title: "Request Duration Heatmap"
expr: "sum(rate(connectsoft_orderservice_http_request_duration_seconds_bucket[5m])) by (le)"
yAxisUnit: "seconds"
- type: timeseries
title: "Latency Percentiles"
queries:
- expr: "histogram_quantile(0.50, sum(rate(connectsoft_orderservice_http_request_duration_seconds_bucket[5m])) by (le))"
legend: "P50"
- expr: "histogram_quantile(0.90, sum(rate(connectsoft_orderservice_http_request_duration_seconds_bucket[5m])) by (le))"
legend: "P90"
- expr: "histogram_quantile(0.99, sum(rate(connectsoft_orderservice_http_request_duration_seconds_bucket[5m])) by (le))"
legend: "P99"
- title: "Error Budget"
panels:
- type: timeseries
title: "Error Budget Remaining"
expr: "slo:connectsoft_orderservice_availability:error_budget_remaining"
unit: "percent"
- type: stat
title: "Budget Burn Rate (1h)"
expr: "slo:connectsoft_orderservice_availability:burn_rate_1h"
thresholds: { green: 0, yellow: 6, red: 14.4 }
π Grafana Panel Definition Example (Generated JSON)¶
{
"id": 1,
"type": "timeseries",
"title": "Request Rate by Method",
"datasource": "Prometheus",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum(rate(connectsoft_orderservice_http_requests_total{environment=\"$environment\"}[5m])) by (method)",
"legendFormat": "{{ method }}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 500 },
{ "color": "red", "value": 1000 }
]
}
}
},
"options": {
"tooltip": { "mode": "multi" },
"legend": { "displayMode": "table", "placement": "bottom" }
}
}
π Dashboard Catalog¶
Each service generates a standard set of dashboards:
| Dashboard | Purpose |
|---|---|
| Overview | Traffic, errors, latency, saturation at a glance |
| SLO Compliance | Error budget burn, compliance trends, burn rate history |
| Latency Deep Dive | Percentile breakdowns, endpoint-level latency, slow queries |
| Error Analysis | Error classification, status code distribution, retries |
| Resource Utilization | CPU, memory, disk, network per pod/container |
| Dependency Health | Upstream/downstream service health, circuit breaker state |
| Business Metrics | Domain-specific counters and KPIs |
π€ Agent Collaboration¶
| Agent | Role |
|---|---|
Observability Engineer Agent |
Designs dashboard layouts, panel configurations |
SLO/SLA Compliance Agent |
Adds SLO compliance panels and error budget visualizations |
DevOps Engineer Agent |
Deploys dashboards via Grafana provisioning or ARM templates |
Infrastructure Engineer Agent |
Adds resource utilization panels from infrastructure metrics |
π Dashboards are not manually crafted β they are generated, versioned, and deployed as code from the Observability Blueprint.
π Log Aggregation and Analysis¶
Structured logging is the diagnostic backbone of any observable system. The Observability Blueprint defines how logs are structured, correlated, stored, and analyzed β transforming raw log lines into queryable, actionable intelligence.
π Structured Logging Schema¶
All services emit logs in a standardized JSON schema:
logSchema:
format: json
standardFields:
- name: timestamp
type: datetime
format: "ISO8601"
required: true
- name: level
type: enum
values: [Trace, Debug, Information, Warning, Error, Critical]
required: true
- name: message
type: string
required: true
- name: service
type: string
source: "environment"
required: true
- name: traceId
type: string
source: "W3C traceparent"
required: true
- name: spanId
type: string
source: "W3C traceparent"
required: true
- name: correlationId
type: string
source: "x-correlation-id header"
required: true
- name: userId
type: string
piiRedacted: true
required: false
- name: tenantId
type: string
required: false
- name: environment
type: enum
values: [development, staging, production]
required: true
- name: version
type: string
required: true
- name: exception
type: object
fields: [type, message, stackTrace]
required: false
π Example Structured Log Entry¶
{
"timestamp": "2025-08-14T09:32:15.123Z",
"level": "Error",
"message": "Failed to process order: payment gateway timeout",
"service": "orderservice",
"traceId": "abc123def456",
"spanId": "span789",
"correlationId": "corr-001-xyz",
"tenantId": "tenant-acme",
"environment": "production",
"version": "2.3.1",
"exception": {
"type": "TimeoutException",
"message": "Payment gateway did not respond within 30s",
"stackTrace": "at OrderService.ProcessPayment() in PaymentHandler.cs:line 42..."
},
"metadata": {
"orderId": "ORD-12345",
"paymentMethod": "credit_card",
"gatewayResponseCode": null
}
}
π Log-Trace Correlation¶
Logs and traces are linked through shared context fields:
| Field | Purpose |
|---|---|
traceId |
Links log entry to the distributed trace |
spanId |
Links to the specific operation span within the trace |
correlationId |
Business-level correlation across multiple service calls |
parentSpanId |
Enables reconstruction of the call hierarchy |
This enables jump-to-trace from any log entry and jump-to-logs from any trace span.
π¦ Retention Policies¶
logRetention:
policies:
- level: [Error, Critical]
retentionDays: 365
archiveTo: coldStorage
- level: [Warning]
retentionDays: 90
- level: [Information]
retentionDays: 30
- level: [Debug, Trace]
retentionDays: 7
environments: [development, staging]
productionRules:
minLevel: Information
prohibitedLevels: [Trace, Debug]
piiRedaction: enabled
maxLogSizeKb: 64
π Anomaly Detection Rules¶
logAnomalyDetection:
enabled: true
rules:
- name: error-rate-spike
condition: "count(level == 'Error') in 5m > 3x baseline"
action: createAlert
severity: warning
- name: new-exception-type
condition: "exception.type NOT IN known_exceptions"
action: createTicket
severity: info
- name: repeated-timeout-pattern
condition: "count(message CONTAINS 'timeout') in 10m > 50"
action: createAlert
severity: critical
- name: log-volume-anomaly
condition: "log_volume in 5m deviates > 2 stddev from rolling_avg"
action: annotate
severity: info
π€ Agent Collaboration¶
| Agent | Role |
|---|---|
Log Analysis Agent |
Defines log schema, anomaly detection rules, retention policies |
Observability Engineer Agent |
Ensures log-trace correlation is properly configured |
Security Architect Agent |
Enforces PII redaction, audit log requirements |
DevOps Engineer Agent |
Configures log shipping, storage backends, indexing |
π Logs are not unstructured noise β they are schema-validated, trace-correlated, anomaly-monitored observability signals.
π Distributed Tracing Configuration¶
Distributed tracing provides the end-to-end visibility needed to understand request flows across microservices, queues, databases, and external dependencies. The Observability Blueprint defines the trace topology, instrumentation rules, and sampling strategies for every generated component.
π Trace Architecture¶
flowchart LR
Client["π Client"] --> Gateway["πͺ API Gateway"]
Gateway --> ServiceA["π§± Order Service"]
ServiceA --> ServiceB["π§± Payment Service"]
ServiceA --> Queue["π¨ Message Queue"]
Queue --> ServiceC["π§± Notification Service"]
ServiceB --> DB["ποΈ Database"]
ServiceA --> Cache["β‘ Redis Cache"]
Gateway -.->|spans| Collector["π‘ OTEL Collector"]
ServiceA -.->|spans| Collector
ServiceB -.->|spans| Collector
ServiceC -.->|spans| Collector
Collector --> Jaeger["π Jaeger / Tempo"]
Collector --> Analytics["π Trace Analytics"]
π OpenTelemetry Configuration¶
otelConfiguration:
serviceName: "orderservice"
serviceVersion: "{{ .AppVersion }}"
environment: "{{ .Environment }}"
exporters:
otlp:
endpoint: "otel-collector.observability.svc:4317"
protocol: grpc
headers:
x-api-key: "{{ .OtelApiKey }}"
compression: gzip
timeout: "10s"
retry:
enabled: true
maxElapsedTime: "300s"
tracing:
sampler:
type: parentBasedTraceIdRatio
ratio: 0.1 # 10% of traces in production
overrides:
- condition: "http.status_code >= 500"
ratio: 1.0 # always sample errors
- condition: "span.duration > 5s"
ratio: 1.0 # always sample slow requests
- condition: "http.route == '/health'"
ratio: 0.0 # never sample health checks
propagation:
formats: [tracecontext, baggage]
customHeaders:
- x-correlation-id
- x-tenant-id
spanLimits:
maxAttributes: 128
maxEvents: 128
maxLinks: 128
maxAttributeLength: 1024
instrumentation:
autoInstrument:
- aspnetcore
- httpclient
- sqlclient
- entityframeworkcore
- masstransit
- redis
- grpc
customSpans:
- name: "order.process"
type: internal
attributes: [orderId, paymentMethod, orderTotal]
- name: "payment.authorize"
type: client
attributes: [gatewayProvider, amount, currency]
- name: "notification.send"
type: producer
attributes: [notificationType, recipientCount]
π Span Definitions¶
| Span Name | Type | Key Attributes | Purpose |
|---|---|---|---|
http.server |
Server | method, route, status_code, user_agent | Incoming HTTP request processing |
http.client |
Client | method, url, status_code | Outgoing HTTP calls to dependencies |
db.query |
Client | db.system, db.statement, db.operation | Database query execution |
messaging.publish |
Producer | messaging.system, destination, message_id | Publishing messages to queues/topics |
messaging.consume |
Consumer | messaging.system, source, message_id | Consuming messages from queues/topics |
cache.get / cache.set |
Client | cache.system, key_pattern, hit | Cache operations |
order.process |
Internal | orderId, paymentMethod, total | Business logic span |
ποΈ Sampling Strategies¶
| Strategy | Use Case | Configuration |
|---|---|---|
| AlwaysOn | Development and staging environments | ratio: 1.0 |
| Probability | Production baseline sampling | ratio: 0.1 (10%) |
| Error-biased | Always capture errors regardless of sampling | ratio: 1.0 on 5xx status codes |
| Latency-biased | Always capture slow requests | ratio: 1.0 on spans > threshold |
| Head-based | Decision made at trace root | parentBasedTraceIdRatio |
| Tail-based | Decision made after all spans collected | Requires collector-side sampling |
π Context Propagation¶
contextPropagation:
w3cTraceContext: true
w3cBaggage: true
customPropagation:
headers:
- name: x-correlation-id
inject: true
extract: true
- name: x-tenant-id
inject: true
extract: true
- name: x-user-context
inject: true
extract: true
redactInLogs: true
messageBusPropagation:
masstransit:
headers: [TraceParent, TraceState, CorrelationId, TenantId]
rawRabbitMq:
headers: [traceparent, x-correlation-id]
π€ Agent Collaboration¶
| Agent | Role |
|---|---|
Observability Engineer Agent |
Designs trace topology, span definitions, sampling strategies |
DevOps Engineer Agent |
Deploys OTEL Collector, configures exporters and pipelines |
Security Architect Agent |
Ensures sensitive data is not leaked through trace attributes |
Infrastructure Engineer Agent |
Provisions tracing backends (Jaeger, Tempo, Azure Monitor) |
π Traces are not just diagnostic tools β they are topology maps of runtime behavior, structured and sampled by design.
π On-Call and Incident Management¶
The Observability Blueprint extends beyond passive monitoring into active operational response β defining how alerts translate into human action through on-call rotations, escalation chains, notification channels, and incident creation workflows.
π₯ On-Call Rotation Definition¶
onCallRotations:
- name: sre-primary
description: "Primary SRE on-call rotation"
timezone: "UTC"
rotationType: weekly
handoffDay: Monday
handoffTime: "09:00"
participants:
- name: Alice Chen
id: user-alice
contactMethods:
- type: phone
value: "+1-555-0101"
- type: sms
value: "+1-555-0101"
- type: email
value: "alice@connectsoft.io"
- type: slack
value: "@alice.chen"
- name: Bob Martinez
id: user-bob
contactMethods:
- type: phone
value: "+1-555-0102"
- type: email
value: "bob@connectsoft.io"
- type: slack
value: "@bob.martinez"
- name: Carol Kim
id: user-carol
contactMethods:
- type: phone
value: "+1-555-0103"
- type: email
value: "carol@connectsoft.io"
overrides:
- startDate: "2025-12-24"
endDate: "2025-12-26"
participant: user-bob
reason: "Holiday coverage swap"
- name: sre-secondary
description: "Secondary SRE escalation rotation"
timezone: "UTC"
rotationType: weekly
handoffDay: Monday
handoffTime: "09:00"
participants:
- name: David Okafor
id: user-david
- name: Eva Johansson
id: user-eva
π‘ Notification Channels¶
notificationChannels:
- name: slack-sre-alerts
type: slack
target: "#sre-alerts"
severities: [critical, warning]
templates:
critical: |
:rotating_light: *CRITICAL ALERT*
*Service:* {{ .Labels.service }}
*Alert:* {{ .Annotations.summary }}
*Runbook:* {{ .Annotations.runbook }}
warning: |
:warning: *Warning Alert*
*Service:* {{ .Labels.service }}
*Alert:* {{ .Annotations.summary }}
- name: pagerduty-sre
type: pagerduty
integrationKey: "{{ .PagerDutyKey }}"
severities: [critical]
dedupKeyTemplate: "{{ .Labels.alertname }}-{{ .Labels.service }}"
- name: email-engineering
type: email
recipients:
- engineering@connectsoft.io
severities: [critical]
throttle: "15m"
- name: teams-operations
type: microsoftTeams
webhookUrl: "{{ .TeamsWebhookUrl }}"
severities: [critical, warning]
π¨ Incident Creation from Alerts¶
incidentCreation:
enabled: true
triggers:
- condition: "severity == 'critical' AND alertState == 'firing' AND duration > '5m'"
action: createIncident
template:
title: "[{{ .Severity }}] {{ .AlertName }} β {{ .Labels.service }}"
description: "{{ .Annotations.description }}"
severity: "{{ .Severity }}"
assignTo: currentOnCall
runbook: "{{ .Annotations.runbook }}"
tags: [auto-created, {{ .Labels.component }}, {{ .Labels.category }}]
slackChannel: "#incident-{{ .Labels.service }}"
- condition: "error_budget_remaining < 10%"
action: createIncident
template:
title: "SLO Breach Risk β {{ .Labels.service }}"
description: "Error budget is below 10% for {{ .SloName }}"
severity: warning
assignTo: sre-primary
tags: [slo-breach, error-budget]
postMortem:
autoGenerate: true
template: "postmortem-template-v2"
requiredSections:
- timeline
- rootCause
- impact
- actionItems
- lessonsLearned
dueAfterIncidentClose: "72h"
π Escalation Flow¶
flowchart TD
Alert["π¨ Alert Fires"] --> Dedup["π Deduplication"]
Dedup --> Route["π‘ Route by Severity"]
Route -->|Critical| PagePrimary["π Page Primary On-Call"]
Route -->|Warning| SlackNotify["π¬ Slack Notification"]
Route -->|Info| Dashboard["π Dashboard Annotation"]
PagePrimary -->|No ACK in 15m| PageSecondary["π Page Secondary On-Call"]
PageSecondary -->|No ACK in 15m| PageManager["π Page Engineering Manager"]
PageManager -->|No ACK in 30m| PageLeadership["π Page Platform Leadership"]
PagePrimary -->|ACK| Investigate["π Investigate"]
Investigate --> Resolve["β
Resolve"]
Investigate -->|Incident| CreateIncident["π« Create Incident"]
CreateIncident --> Mitigate["π‘οΈ Mitigate"]
Mitigate --> Resolve
Resolve --> PostMortem["π Post-Mortem"]
π€ Agent Collaboration¶
| Agent | Role |
|---|---|
Alerting and Incident Manager Agent |
Defines on-call rotations, notification channels, escalation |
Incident Response Agent |
Creates incidents, triggers containment playbooks |
Observability Engineer Agent |
Links alert telemetry to incident context |
DevOps Engineer Agent |
Configures PagerDuty/OpsGenie integrations |
π On-call is not just a schedule β it's an automated, escalation-driven response pipeline that turns alerts into actions.
π Cross-Blueprint Intersections¶
The Observability Blueprint does not exist in isolation. It integrates deeply with other blueprints in the ConnectSoft AI Software Factory, consuming their signals and providing telemetry that enriches the entire platform.
π‘οΈ Security Blueprint Integration¶
| From Security Blueprint | Used in Observability Blueprint |
|---|---|
| Auth failure events | Metrics: auth_failures_total, alerts on spike patterns |
| Secret access audit logs | Log correlation, anomaly detection for unauthorized access |
| Threat model risk tags | Dashboard panels for security posture, threat signal tracking |
| Penetration test results | SLO impact analysis, security incident creation triggers |
| WAF/firewall events | Real-time security dashboards, rate-limit effectiveness metrics |
securityTelemetryIntegration:
metrics:
- name: connectsoft_security_auth_failures_total
source: securityBlueprint.authentication
labels: [auth_method, failure_reason, source_ip]
- name: connectsoft_security_threat_detections_total
source: securityBlueprint.threatModel
labels: [threat_vector, severity, service]
alerts:
- name: AuthFailureSpike
expr: "rate(connectsoft_security_auth_failures_total[5m]) > 10"
severity: warning
dashboards:
- name: security-posture
panels: [auth-failures, threat-detections, secret-access-audit]
π¦ Infrastructure Blueprint Integration¶
| From Infrastructure Blueprint | Used in Observability Blueprint |
|---|---|
| Resource definitions (CPU, memory) | Capacity metrics, utilization dashboards, scaling alerts |
| Health probe configurations | Liveness/readiness monitoring, uptime tracking |
| Network policies | Network latency metrics, connection pool monitoring |
| Scaling rules (HPA/KEDA) | Autoscaling event dashboards, capacity burn rate tracking |
| Container specifications | Container resource metrics, OOM kill tracking |
π§ͺ Test Blueprint Integration¶
| From Test Blueprint | Used in Observability Blueprint |
|---|---|
| Test coverage metrics | Quality dashboards, regression alert triggers |
| Test execution telemetry | Test duration trends, flaky test detection |
| Chaos test results | Resilience score metrics, fault injection impact tracking |
| Security test findings | Vulnerability count metrics, compliance dashboards |
π Pipeline Blueprint Integration¶
| From Pipeline Blueprint | Used in Observability Blueprint |
|---|---|
| Deployment events | Deployment markers on dashboards, change-failure rate metrics |
| Build duration metrics | CI/CD efficiency dashboards, build time trend alerts |
| Rollback events | Rollback frequency metrics, deployment health scoring |
| Pipeline gate results | Observability readiness compliance tracking |
π§± Service Blueprint Integration¶
| From Service Blueprint | Used in Observability Blueprint |
|---|---|
| API endpoint definitions | Per-endpoint metrics, latency breakdowns, error classification |
| Domain event contracts | Event processing metrics, consumer lag monitoring |
| Dependency declarations | Dependency health dashboards, circuit breaker monitoring |
| Business operation definitions | Business KPI metrics, domain-level SLOs |
π€ Agent Collaboration for Cross-Blueprint¶
| Agent | Role |
|---|---|
Observability Engineer Agent |
Aggregates signals from all connected blueprints |
Security Architect Agent |
Provides security telemetry requirements |
Infrastructure Engineer Agent |
Provides resource and infrastructure metric definitions |
Pipeline Agent |
Provides deployment event schemas for correlation |
π The Observability Blueprint is the connective tissue of the entire blueprint ecosystem β every other blueprint both feeds and consumes it.
π Blueprint Location, Traceability, and Versioning¶
An Observability Blueprint is not just content β it's a traceable artifact, part of a multi-agent lineage graph, and lives at a predictable location in the Factory's file and memory hierarchy.
This enables cross-agent validation, rollback, comparison, and regeneration.
π File System Location¶
Each blueprint is stored in a consistent location within the Factory workspace:
blueprints/observability/{service-name}/observability-blueprint.md
blueprints/observability/{service-name}/observability-blueprint.json
blueprints/observability/{service-name}/alert-rules.yaml
blueprints/observability/{service-name}/slo-definitions.yaml
blueprints/observability/{service-name}/dashboards/overview.json
blueprints/observability/{service-name}/dashboards/slo-compliance.json
blueprints/observability/{service-name}/dashboards/latency-deep-dive.json
blueprints/observability/{service-name}/otel-config.yaml
blueprints/observability/{service-name}/on-call.yaml
blueprints/observability/{service-name}/log-schema.yaml
- Markdown is human-readable and Studio-rendered.
- JSON is parsed by orchestrators and enforcement agents.
- YAML files are directly deployable configuration artifacts.
π§ Traceability Fields¶
Each blueprint includes a set of required metadata fields for trace alignment:
| Field | Purpose |
|---|---|
traceId |
Links blueprint to full generation pipeline |
agentId |
Records which agent(s) emitted the artifact |
originPrompt |
Captures human-initiated signal or intent |
createdAt |
ISO timestamp for audit |
observabilityProfile |
Level of observability depth (minimal, standard, comprehensive) |
sloCount |
Number of SLO definitions in the blueprint |
alertCount |
Number of alert rules defined |
dashboardCount |
Number of dashboard definitions |
These fields ensure full trace and observability for regeneration, validation, and compliance review.
π Versioning and Mutation Tracking¶
| Mechanism | Purpose |
|---|---|
v1, v2, ... |
Manual or automatic version bumping by agents |
diff-link: metadata |
References upstream and downstream changes |
| GitOps snapshot tags | Bind blueprint versions to commit hashes or releases |
| Drift monitors | Alert if effective observability config deviates from blueprint |
| Incident-triggered updates | Auto-update blueprint after post-mortem action items |
π Mutation History Example¶
metadata:
traceId: "trc_92ab_orderservice_obs"
agentId: "obs-engineer-agent"
originPrompt: "Add P99 latency SLO for OrderService"
createdAt: "2025-08-14T09:30:00Z"
version: "v3"
diffFrom: "v2"
changedFields:
- "sloDefinitions[1]"
- "alertRules[4]"
- "dashboards.slo-compliance.panels[2]"
changeReason: "Post-mortem action item: INC-2025-0847"
approvedBy: "sre-lead"
These mechanisms ensure that observability is not an afterthought, but a tracked, versioned, observable system artifact.
β Observability-First Validation in CI/CD¶
The Observability Blueprint is not just a design artifact β it is actively validated in the CI/CD pipeline to ensure every deployment meets observability readiness requirements before reaching production.
π¦ Observability Gates¶
flowchart LR
Build["π¨ Build"] --> UnitTest["π§ͺ Unit Tests"]
UnitTest --> ObsValidation["π‘ Observability Validation"]
ObsValidation -->|Pass| IntegrationTest["π Integration Tests"]
ObsValidation -->|Fail| Block["π Block Deploy"]
IntegrationTest --> SLOCheck["π― SLO Readiness Check"]
SLOCheck -->|Pass| Deploy["π Deploy to Staging"]
SLOCheck -->|Fail| Block
Deploy --> SmokeTest["π¨ Smoke Tests"]
SmokeTest --> TelemetryVerify["π Telemetry Verification"]
TelemetryVerify -->|Pass| Production["π Promote to Production"]
TelemetryVerify -->|Fail| Rollback["βͺ Rollback"]
π Validation Checklist¶
| Gate | Validation | Blocks Deploy? |
|---|---|---|
| Blueprint exists | observability-blueprint.md and .json present |
β Yes |
| Metrics defined | At least RED metrics (Rate, Errors, Duration) are specified | β Yes |
| Alert rules present | Critical alerts with runbooks are defined | β Yes |
| SLOs defined | At least one availability SLO with error budget | β Yes |
| Dashboard provisioned | Overview dashboard JSON is valid and deployable | β Yes |
| OTEL config valid | OpenTelemetry config passes schema validation | β Yes |
| On-call configured | On-call rotation references valid schedules | β οΈ Warning |
| Log schema compliant | Service logs match the declared structured schema | β Yes |
| Trace propagation tested | End-to-end trace context verified in integration tests | β οΈ Warning |
| No metric naming violations | All metrics follow naming conventions | β Yes |
π Pipeline Step Example¶
- stage: ObservabilityValidation
displayName: "π‘ Observability Readiness"
jobs:
- job: ValidateBlueprint
displayName: "Validate Observability Blueprint"
steps:
- task: Bash@3
displayName: "Check blueprint exists"
inputs:
targetType: inline
script: |
if [ ! -f "blueprints/observability/$(ServiceName)/observability-blueprint.json" ]; then
echo "##vso[task.logissue type=error]Observability blueprint not found"
exit 1
fi
- task: Bash@3
displayName: "Validate metrics naming"
inputs:
targetType: inline
script: |
python scripts/validate-metrics-naming.py \
--blueprint "blueprints/observability/$(ServiceName)/observability-blueprint.json" \
--conventions "standards/metrics-naming.yaml"
- task: Bash@3
displayName: "Validate alert rules"
inputs:
targetType: inline
script: |
promtool check rules \
"blueprints/observability/$(ServiceName)/alert-rules.yaml"
- task: Bash@3
displayName: "Validate SLO definitions"
inputs:
targetType: inline
script: |
python scripts/validate-slo-definitions.py \
--slo-file "blueprints/observability/$(ServiceName)/slo-definitions.yaml" \
--min-availability 99.0
- task: Bash@3
displayName: "Validate dashboard JSON"
inputs:
targetType: inline
script: |
python scripts/validate-grafana-dashboard.py \
--dashboard-dir "blueprints/observability/$(ServiceName)/dashboards/"
- stage: TelemetryVerification
displayName: "π Post-Deploy Telemetry Check"
dependsOn: DeployStaging
jobs:
- job: VerifyTelemetry
displayName: "Verify Telemetry Emission"
steps:
- task: Bash@3
displayName: "Verify metrics are being scraped"
inputs:
targetType: inline
script: |
python scripts/verify-metrics-emission.py \
--service "$(ServiceName)" \
--prometheus-url "$(PrometheusUrl)" \
--expected-metrics "blueprints/observability/$(ServiceName)/observability-blueprint.json"
- task: Bash@3
displayName: "Verify traces are flowing"
inputs:
targetType: inline
script: |
python scripts/verify-trace-emission.py \
--service "$(ServiceName)" \
--tempo-url "$(TempoUrl)" \
--timeout 300
π€ Agent Collaboration¶
| Agent | Role |
|---|---|
DevOps Engineer Agent |
Defines pipeline gates and validation scripts |
Observability Engineer Agent |
Maintains validation schemas and naming convention rules |
Pipeline Agent |
Executes validation steps and reports results |
SLO/SLA Compliance Agent |
Verifies SLO readiness gates |
β No service ships to production without proven observability readiness β validated by agents and enforced by pipelines.
π§ Memory Graph Representation¶
The Observability Blueprint is not only stored as files β it is embedded into the AI Software Factory's memory graph, enabling agents to reason about observability context, retrieve relevant telemetry configurations, and trace decisions across the entire system.
π§© Memory Node Structure¶
Each Observability Blueprint creates a memory node with the following schema:
memoryNode:
type: observability-blueprint
id: "obs-bp-orderservice-v3"
serviceName: OrderService
version: v3
state: approved
linkedEntities:
- type: service-blueprint
id: "svc-bp-orderservice-v5"
- type: infrastructure-blueprint
id: "infra-bp-orderservice-v2"
- type: security-blueprint
id: "sec-bp-orderservice-v4"
- type: test-blueprint
id: "test-bp-orderservice-v3"
- type: pipeline-blueprint
id: "pipe-bp-orderservice-v2"
concepts:
- metrics-taxonomy
- alerting
- slo-compliance
- distributed-tracing
- dashboard-as-code
- log-aggregation
- incident-management
- on-call-rotation
embeddings:
model: "text-embedding-ada-002"
dimensions: 1536
sections:
- metricsDefinition
- alertRules
- sloDefinitions
- dashboardLayouts
- logSchema
- otelConfiguration
- onCallRotations
- incidentPlaybooks
metadata:
traceId: "trc_92ab_orderservice_obs"
agentId: "obs-engineer-agent"
createdAt: "2025-08-14T09:30:00Z"
lastModifiedAt: "2025-09-22T14:15:00Z"
lastModifiedBy: "incident-response-agent"
changeReason: "Post-mortem update: added P99 latency SLO"
π Memory Graph Connections¶
graph TD
OBS["π‘ Observability Blueprint"] --> SVC["π§± Service Blueprint"]
OBS --> INFRA["π¦ Infrastructure Blueprint"]
OBS --> SEC["π‘οΈ Security Blueprint"]
OBS --> TEST["π§ͺ Test Blueprint"]
OBS --> PIPE["π Pipeline Blueprint"]
OBS --> INC["π¨ Incident Records"]
OBS --> PM["π Post-Mortem Reports"]
OBS --> SLO["π― SLO Compliance History"]
OBS --> DASH["π Live Dashboards"]
SVC -->|"exposes metrics"| OBS
SEC -->|"security telemetry"| OBS
INFRA -->|"resource metrics"| OBS
TEST -->|"test telemetry"| OBS
PIPE -->|"deployment events"| OBS
INC -->|"updates blueprint"| OBS
PM -->|"action items"| OBS
π§ Agent Interaction with Memory¶
| Agent Action | Memory Operation |
|---|---|
| Generate new blueprint | CREATE node with full embeddings and entity links |
| Update alert rules | MUTATE node, record diff, update version |
| Query SLO status | RETRIEVE by concept slo-compliance + service name |
| Cross-reference incident | LINK incident record to blueprint node |
| Regenerate after post-mortem | CLONE + MUTATE with updated sections, bump version |
| Search related blueprints | SEMANTIC_SEARCH by embedding similarity across blueprint nodes |
π Semantic Search Examples¶
Agents can query the memory graph with natural language:
| Query | Returns |
|---|---|
| "What SLOs exist for OrderService?" | SLO definition section from the observability blueprint |
| "Which services have P99 latency alerts?" | All blueprints with latency-category alert rules |
| "Show me the on-call rotation for payment services" | On-call configuration from payment service blueprints |
| "What changed in observability after INC-2025-0847?" | Diff between blueprint versions linked to the incident |
π§ The memory graph transforms the Observability Blueprint from a static document into a living, queryable, agent-accessible knowledge node.
π Incident Playbooks and Automated Response¶
The Observability Blueprint includes incident playbooks β structured, pre-defined response procedures that are triggered automatically or semi-automatically when specific alert conditions are met.
π Playbook Structure¶
incidentPlaybooks:
- name: high-error-rate-response
description: "Automated response for sustained high error rates"
triggerCondition: "alert.name == 'HighErrorRate' AND alert.severity == 'critical'"
steps:
- order: 1
action: gatherContext
description: "Collect recent deployment events, error logs, and trace samples"
automated: true
script: "scripts/gather-incident-context.sh"
timeout: "60s"
- order: 2
action: checkRecentDeploys
description: "Check if a deployment occurred in the last 30 minutes"
automated: true
script: "scripts/check-recent-deploys.sh"
timeout: "30s"
onMatch:
action: suggestRollback
message: "Recent deployment detected β consider rollback"
- order: 3
action: isolateDependencyFailure
description: "Check downstream dependency health"
automated: true
script: "scripts/check-dependency-health.sh"
timeout: "60s"
- order: 4
action: scaleUp
description: "Scale up service replicas if resource-constrained"
automated: false
requiresApproval: true
approvers: [sre-primary]
command: "kubectl scale deployment orderservice --replicas=5"
- order: 5
action: notifyStakeholders
description: "Send status update to stakeholders"
automated: true
channels: [slack, email]
template: "incident-status-update-v1"
postIncident:
autoCreatePostMortem: true
updateBlueprint: true
linkToSloImpact: true
- name: error-budget-exhaustion
description: "Response when error budget drops below critical threshold"
triggerCondition: "error_budget_remaining < 5%"
steps:
- order: 1
action: freezeDeployments
description: "Halt all non-critical deployments to the affected service"
automated: true
duration: "24h"
- order: 2
action: notifyProductOwner
description: "Alert product owner about SLO compliance risk"
automated: true
channels: [email, slack]
- order: 3
action: prioritizeReliability
description: "Shift engineering focus to reliability improvements"
automated: false
requiresApproval: true
approvers: [engineering-manager]
π Playbook Execution Flow¶
sequenceDiagram
participant Alert as π¨ Alert Manager
participant Engine as βοΈ Playbook Engine
participant Script as π Automation Scripts
participant OnCall as π€ On-Call Engineer
participant Slack as π¬ Slack
participant K8s as βΈοΈ Kubernetes
Alert->>Engine: Critical alert fires
Engine->>Script: Step 1: Gather context
Script-->>Engine: Context collected
Engine->>Script: Step 2: Check recent deploys
Script-->>Engine: Deployment found 15m ago
Engine->>Slack: Suggest rollback to on-call
Engine->>OnCall: Step 4: Approve scale-up?
OnCall-->>Engine: Approved
Engine->>K8s: Scale up replicas
Engine->>Slack: Step 5: Status update sent
Engine->>Engine: Create post-mortem ticket
π€ Agent Collaboration¶
| Agent | Role |
|---|---|
Incident Response Agent |
Designs playbooks, defines automated steps |
Alerting and Incident Manager Agent |
Links playbooks to alert triggers |
DevOps Engineer Agent |
Implements automation scripts |
Observability Engineer Agent |
Provides context-gathering queries for playbook steps |
π Playbooks are not documentation β they are executable, automated response workflows triggered by observability signals.
ποΈ Health Probes and Readiness Configuration¶
The Observability Blueprint defines health probe configurations for every generated service, ensuring that orchestration platforms can accurately determine service health and readiness.
π Probe Definitions¶
healthProbes:
liveness:
path: "/health/live"
port: 8080
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1
checks:
- name: process-alive
type: system
- name: deadlock-detection
type: application
readiness:
path: "/health/ready"
port: 8080
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 1
checks:
- name: database-connectivity
type: dependency
critical: true
- name: cache-connectivity
type: dependency
critical: false
- name: message-bus-connectivity
type: dependency
critical: true
startup:
path: "/health/startup"
port: 8080
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30
successThreshold: 1
checks:
- name: migrations-complete
type: initialization
- name: config-loaded
type: initialization
- name: warmup-complete
type: initialization
π Health Metrics¶
Health probe results are also emitted as metrics:
healthMetrics:
- name: connectsoft_health_check_status
type: gauge
description: "Health check status (1=healthy, 0=unhealthy)"
labels: [check_name, check_type, probe_type]
- name: connectsoft_health_check_duration_seconds
type: histogram
description: "Health check execution duration"
labels: [check_name, probe_type]
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
ποΈ Health probes are not just Kubernetes config β they are observable, metric-emitting indicators of service readiness.
π Capacity Planning and Saturation Metrics¶
The Observability Blueprint includes capacity indicators that enable proactive scaling and resource management before services reach saturation.
π USE Method Implementation¶
The blueprint follows the USE Method (Utilization, Saturation, Errors) for resource monitoring:
capacityIndicators:
utilization:
- name: connectsoft_cpu_utilization_ratio
description: "CPU utilization as a ratio (0-1)"
alertThreshold: 0.8
scalingTrigger: 0.7
- name: connectsoft_memory_utilization_ratio
description: "Memory utilization as a ratio (0-1)"
alertThreshold: 0.85
scalingTrigger: 0.75
- name: connectsoft_disk_utilization_ratio
description: "Disk utilization as a ratio (0-1)"
alertThreshold: 0.9
saturation:
- name: connectsoft_request_queue_depth
description: "Number of requests waiting to be processed"
alertThreshold: 100
- name: connectsoft_thread_pool_saturation_ratio
description: "Thread pool utilization ratio"
alertThreshold: 0.9
- name: connectsoft_connection_pool_saturation_ratio
description: "Database connection pool saturation"
alertThreshold: 0.8
errors:
- name: connectsoft_oom_kills_total
description: "Out-of-memory kill events"
alertOnAny: true
- name: connectsoft_disk_errors_total
description: "Disk I/O errors"
alertOnAny: true
scalingRules:
hpa:
minReplicas: 2
maxReplicas: 20
targetCpuUtilization: 70
targetMemoryUtilization: 75
scaleDownStabilization: "300s"
keda:
triggers:
- type: prometheus
metadata:
query: "sum(rate(connectsoft_orderservice_http_requests_total[2m]))"
threshold: "500"
π€ Agent Collaboration¶
| Agent | Role |
|---|---|
Infrastructure Engineer Agent |
Defines resource limits, scaling rules, capacity thresholds |
Observability Engineer Agent |
Creates capacity dashboards and saturation alerts |
DevOps Engineer Agent |
Configures HPA/KEDA scaling from blueprint specifications |
π Capacity is not guessed β it is measured, alerted, and auto-scaled based on blueprint-defined thresholds.
π§βπ€βπ§ Who Consumes the Observability Blueprint¶
The Observability Blueprint is not an isolated artifact. It is actively consumed across the ConnectSoft AI Software Factory by agents, humans, and CI systems to monitor, alert, and evolve observable-by-design practices.
Each consumer interprets the blueprint differently based on its context, but all share a common source of truth.
π Agent Consumers¶
| Agent | Role in Consumption |
|---|---|
π Observability Engineer Agent |
Validates alignment with platform-wide conventions, updates metrics and dashboards |
π¨ Alerting and Incident Manager Agent |
Generates alert rules and routing configs from blueprint data |
π― SLO/SLA Compliance Agent |
Computes error budgets, generates compliance reports |
π Log Analysis Agent |
Configures log pipelines and anomaly detection from schema |
π CI/CD Pipeline Agent |
Injects validation gates and telemetry checks into deployment steps |
π¦ Infrastructure Engineer Agent |
Uses it to configure exporters, collectors, and scaling triggers |
π Security Architect Agent |
Consumes security telemetry definitions for threat correlation |
π Incident Response Agent |
Selects playbooks and containment actions based on alert-blueprint linkage |
π€ Human Stakeholders¶
| Role | Value Derived from Observability Blueprint |
|---|---|
| SRE Lead | Verifies SLOs, reviews alert rules, approves on-call configurations |
| Engineering Manager | Reviews error budget reports, prioritizes reliability work |
| Product Manager | Understands SLA compliance status and customer-facing reliability metrics |
| Developer | Understands what metrics and traces their service emits |
| DevOps Engineer | Deploys dashboards, configures alert channels, manages on-call rotations |
| Compliance Officer | Maps SLO compliance to contractual SLA requirements |
π§ Machine Consumers¶
- Prometheus/Grafana: consumes alert rules and dashboard definitions directly
- OpenTelemetry Collector: configured from OTEL config YAML
- PagerDuty/OpsGenie: receives escalation policies and on-call schedules
- CI/CD Pipelines: validates observability readiness as a deployment gate
- Memory Indexing Layer: links blueprint observability assertions to downstream incidents and SLO reports
The Observability Blueprint becomes a boundary contract, providing guarantees that must be respected by services, infrastructure, pipelines, and incident response processes alike.
π Final Summary¶
The Observability Blueprint is one of the most interconnected and operationally critical artifacts in the ConnectSoft AI Software Factory. It transforms observability from a manual, inconsistent practice into a declarative, agent-generated, CI/CD-validated, and incident-aware system.
π Summary Table¶
| Dimension | Details |
|---|---|
| π Format | Markdown + JSON + YAML + Embedded Vector |
| π§ Generated by | Observability Engineer + Alerting Manager + SLO Compliance + Log Analysis Agents |
| π― Purpose | Define complete observability posture for a generated component |
| π Metrics | Counters, gauges, histograms with naming conventions and cardinality budgets |
| π¨ Alerting | Lifecycle-managed rules with severity, routing, and fatigue reduction |
| π― SLOs | Error-budget-backed targets with multi-window burn rate alerts |
| π Dashboards | Declarative dashboard-as-code, versioned and auto-provisioned |
| π Logging | Structured schemas with trace correlation and anomaly detection |
| π Tracing | OpenTelemetry config with span definitions and sampling strategies |
| π On-Call | Rotation schedules, escalation chains, notification channels |
| π Incident Playbooks | Automated response workflows triggered by observability signals |
| ποΈ Health Probes | Liveness, readiness, startup probe configurations with metrics |
| π Capacity | USE method metrics with auto-scaling triggers |
| π Cross-Blueprint | Deep integration with Security, Infrastructure, Test, Pipeline, Service |
| β CI/CD Validation | Observability readiness gates block unmonitored deployments |
| π§ Memory Graph | Embedded, linked, semantically searchable in the agent memory system |
| π Lifecycle | Regenerable, diffable, GitOps-compatible, incident-updatable |
| π Tags | traceId, agentId, serviceId, observabilityProfile, version |
π Key Principles¶
| Principle | Description |
|---|---|
| Observability-First | Every service is born observable β not retrofitted |
| Declarative Over Imperative | All configs are defined in blueprints, not manually created |
| SLO-Driven Operations | Error budgets and burn rates guide operational decisions |
| Alert Actionability | Every alert has a runbook, owner, and clear next step |
| Trace Everything | Distributed traces provide end-to-end request visibility |
| Automate Response | Incident playbooks reduce MTTR through automated containment |
| Version Everything | Dashboards, alerts, SLOs are versioned and diffable |
| Validate Before Deploy | CI/CD gates ensure observability readiness |
π‘ In the ConnectSoft AI Software Factory, observability is not optional, not manual, and not an afterthought. It is a generated, validated, enforced, and continuously evolving first-class architectural concern.