๐ SLO/SLA Compliance Agent Specification¶
๐ฏ Purpose¶
The SLO/SLA Compliance Agent is responsible for:
SLO/SLA validation, reporting, compliance tracking, error budget management, and SLA breach notification โ ensuring that every service in the ConnectSoft AI Software Factory meets its reliability commitments and that teams have clear, actionable visibility into service health.
Service Level Objectives (SLOs) and Service Level Agreements (SLAs) define the reliability promises a platform makes. This agent ensures that:
- โ SLOs are formally defined, tracked, and reported for every generated service
- ๐ Error budgets are continuously calculated and monitored with burn rate alerting
- ๐ SLA breaches are detected and notified before they become customer-impacting incidents
- ๐ Compliance reports are generated automatically for stakeholders, auditors, and customers
- ๐ SLO performance data feeds back into release decisions and deployment strategies
- ๐ง Historical SLO data enables trend analysis and proactive reliability improvement
๐งฑ What Sets It Apart from Other Observability Agents?¶
| Agent | Primary Role |
|---|---|
| ๐ฐ๏ธ Observability Engineer | Injects telemetry, traces, logs, and metrics into generated code |
| ๐จ Alerting/Incident Manager | Creates incidents from alerts and routes to on-call teams |
| ๐ Log Analysis Agent | Analyzes log patterns and detects anomalies |
| ๐ SLO/SLA Compliance Agent | Tracks, reports, and enforces service level objectives and agreements |
| ๐ฅ Incident Response Agent | Coordinates active incident response and resolution |
๐งญ Role in Platform¶
The SLO/SLA Compliance Agent operates as the reliability measurement layer, continuously evaluating whether services meet their defined objectives and feeding compliance data into release and operational decisions.
๐ Positioning Diagram¶
flowchart LR
ObsEng[Observability Engineer Agent]
AlertMgr[Alerting/Incident Manager Agent]
SLO[SLO/SLA Compliance Agent]
Release[Release Manager Agent]
CustSuccess[Customer Success Agent]
ObsEng --> SLO
SLO --> AlertMgr
SLO --> Release
SLO --> CustSuccess
The SLO/SLA Compliance Agent translates raw metrics into reliability verdicts that drive release decisions, incident prioritization, and customer communication.
๐ง Why It Exists¶
Without this agent, the factory would suffer from:
- Undefined reliability targets โ services deployed without clear performance expectations
- Error budget blindness โ teams unaware of how much risk budget remains for releases
- Reactive SLA management โ breaches discovered by customers, not by the platform
- Manual compliance reporting โ time-consuming manual report generation for stakeholders
- No release-reliability linkage โ releases proceed regardless of SLO health
This agent makes reliability measurable, actionable, and continuously enforced.
๐ Triggering Events¶
| Event | Description |
|---|---|
metrics_aggregated |
Periodic metrics aggregation cycle completes, enabling SLO window evaluation |
error_budget_consumed |
Error budget consumption exceeds a warning or critical threshold |
release_deployed |
A new release is deployed, requiring SLO baseline comparison and impact assessment |
slo_definition_updated |
An SLO target or window configuration is modified by an engineer or architect |
compliance_report_requested |
A scheduled or manual request for SLA compliance reporting |
incident_resolved |
An incident is resolved, requiring SLO impact recalculation |
๐ Responsibilities and Deliverables¶
โ Core Responsibilities¶
| Responsibility | Description |
|---|---|
| Define SLOs from Service Blueprints | Generates baseline SLO definitions when new services are scaffolded, based on service type and tier |
| Calculate SLI Metrics | Computes Service Level Indicators (availability, latency, error rate) from raw telemetry data |
| Track Error Budget Consumption | Maintains rolling error budget calculations with burn rate analysis |
| Generate Compliance Reports | Produces SLO/SLA compliance reports for internal teams, management, and external customers |
| Detect SLO Violations | Identifies when SLO targets are breached within evaluation windows |
| Emit Error Budget Warnings | Alerts when error budget burn rate indicates impending exhaustion |
| Block Risky Releases | Signals Release Manager when error budget is exhausted, recommending release freeze |
| Correlate SLO Impact with Incidents | Links SLO degradation to specific incidents for root cause attribution |
| Track SLO Trends Over Time | Maintains historical SLO performance for trend analysis and capacity planning |
Emit SLOStatusUpdated and SLABreachAlert |
Signals downstream agents about SLO health changes and SLA breaches |
๐ค Output Deliverables¶
| Output Type | Format | Description |
|---|---|---|
slo-report |
.md, .json, .yaml |
Periodic SLO performance report with target vs actual metrics |
error-budget-analysis |
.json, .md |
Error budget consumption, burn rate, and projected exhaustion timeline |
sla-compliance-report |
.md, .pdf, .json |
Customer-facing SLA compliance documentation |
slo-definition.yaml |
.yaml |
Versioned SLO definitions per service |
execution-metadata.json |
.json |
Trace-tagged metadata of the SLO evaluation run |
๐ Example: SLO Definition¶
service: booking-service
slos:
- name: availability
description: "Proportion of successful requests"
sli:
type: availability
query: "1 - (rate(http_requests_total{service='booking-service', status=~'5..'}[30d]) / rate(http_requests_total{service='booking-service'}[30d]))"
target: 99.9
window: 30d
errorBudget: 0.1%
- name: latency-p99
description: "99th percentile request latency"
sli:
type: latency
query: "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service='booking-service'}[30d]))"
target: 500ms
window: 30d
๐ Example: Error Budget Analysis¶
{
"service": "booking-service",
"sloName": "availability",
"window": "30d",
"target": 99.9,
"actual": 99.82,
"errorBudgetTotal": "43.2 minutes",
"errorBudgetConsumed": "34.56 minutes",
"errorBudgetRemaining": "8.64 minutes",
"consumptionRate": "80%",
"burnRate": 1.8,
"projectedExhaustion": "2026-04-02T00:00:00Z",
"status": "Warning",
"recommendation": "Reduce deployment frequency until error budget recovers",
"linkedIncidents": ["INC-2026-0325-0018", "INC-2026-0327-0031"]
}
๐ Example: SLO Compliance Report (Markdown)¶
### ๐ SLO Compliance Report โ BookingService (March 2026)
๐ Reporting Window: 2026-03-01 to 2026-03-29
๐ท๏ธ Service: BookingService | Tenant: vetclinic-001
| SLO | Target | Actual | Status | Error Budget Remaining |
| ------------------ | -------- | -------- | ---------- | ---------------------- |
| Availability | 99.9% | 99.82% | โ ๏ธ Warning | 8.64 min (20%) |
| Latency P99 | 500ms | 420ms | โ
Healthy | N/A |
| Error Rate | < 0.1% | 0.08% | โ
Healthy | N/A |
#### ๐ Error Budget Burn Rate
- Current burn rate: **1.8x** (consuming faster than sustainable)
- Projected exhaustion: **April 2, 2026**
- Linked incidents: INC-2026-0325-0018, INC-2026-0327-0031
#### ๐ Recommendations
- Consider release freeze until error budget recovers above 40%
- Investigate recurring 5xx errors in booking confirmation flow
๐ค Collaboration Patterns¶
๐ Direct Agent Collaborations¶
| Collaborating Agent | Interaction Summary |
|---|---|
| ๐ฐ๏ธ Observability Engineer Agent | Provides the metrics and telemetry data used to compute SLIs |
| ๐จ Alerting/Incident Manager Agent | Receives error budget warnings to create burn-rate alerts; links incidents to SLO impact |
| ๐ฆ Release Manager Agent | Consumes SLO health status as a release gate; receives freeze recommendations |
| ๐ค Customer Success Agent | Receives SLA compliance reports for customer communication |
| ๐ง DevOps Engineer Agent | Deploys SLO recording rules and dashboards into monitoring infrastructure |
๐ฌ Events Emitted & Consumed¶
| Event Name | Role |
|---|---|
metrics_aggregated |
๐ Consumed โ triggers SLO window evaluation and error budget update |
error_budget_consumed |
๐ Consumed โ evaluates severity and emits warnings or freeze signals |
release_deployed |
๐ Consumed โ recalculates SLO baselines post-deployment |
SLOStatusUpdated |
โ Emitted โ notifies Studio, Release Manager, and dashboards |
SLABreachAlert |
โ Emitted โ notifies Customer Success and triggers incident creation |
ErrorBudgetWarning |
โ ๏ธ Emitted โ signals Alerting Agent and Release Manager |
๐งญ Coordination Flow¶
sequenceDiagram
participant Obs as Observability Engineer Agent
participant SLO as SLO/SLA Compliance Agent
participant Alert as Alerting/Incident Manager Agent
participant Release as Release Manager Agent
participant CustSuccess as Customer Success Agent
Obs->>SLO: metrics_aggregated
SLO->>SLO: Calculate SLIs and error budgets
SLO->>Alert: ErrorBudgetWarning (if burn rate high)
SLO->>Release: SLOStatusUpdated (release gate signal)
SLO->>CustSuccess: SLA compliance report (if breach detected)
๐ง Memory and Knowledge¶
๐งฉ Memory Components¶
| Memory Store | Content |
|---|---|
| ๐ SLO Definition Registry | All active SLO definitions with version history per service and tenant |
| ๐ SLO Performance History | Time-series SLO compliance data for trend analysis and reporting |
| ๐ง Error Budget Tracking Store | Rolling error budget calculations, burn rates, and projected exhaustion dates |
| ๐ SLA Compliance Archive | Historical SLA reports generated for customers and auditors |
| ๐ Incident-SLO Correlation DB | Mappings between incidents and their impact on specific SLOs |
๐ Example Memory Entry¶
{
"service": "booking-service",
"sloName": "availability",
"window": "2026-03",
"target": 99.9,
"actual": 99.82,
"errorBudgetConsumedPercent": 80,
"incidentImpact": [
{ "incidentId": "INC-2026-0325-0018", "downtimeMinutes": 18.5 },
{ "incidentId": "INC-2026-0327-0031", "downtimeMinutes": 16.1 }
],
"traceId": "slo-2026-0329-eval"
}
โ Validation Mechanisms¶
๐ What Is Validated?¶
| Component | Validation Criteria |
|---|---|
| SLO Definitions | Must have valid SLI queries, realistic targets, and defined evaluation windows |
| SLI Calculations | Computed values must be within expected ranges; anomalous spikes are flagged |
| Error Budget Accuracy | Budget calculations must align with actual incident downtime and metric data |
| Report Completeness | Compliance reports must cover all SLOs for the reporting window with no gaps |
| Breach Detection | SLA breaches must be detected within the evaluation cycle, not delayed |
| Incident Correlation | SLO degradation periods must be linked to corresponding incidents when available |
๐งช Validation Workflow¶
flowchart TD
Start[Metrics Aggregation Cycle Complete]
LoadSLOs[Load SLO definitions for all services]
ComputeSLIs[Calculate SLI values from metrics]
EvalBudget[Evaluate error budget consumption and burn rate]
CheckThresholds{Budget Warning or Breach?}
EmitWarning[Emit ErrorBudgetWarning]
EmitBreach[Emit SLABreachAlert]
GenerateReport[Generate SLO compliance report]
UpdateMemory[Store results in performance history]
EmitStatus[Emit SLOStatusUpdated]
Start --> LoadSLOs --> ComputeSLIs --> EvalBudget --> CheckThresholds
CheckThresholds -->|Warning| EmitWarning --> GenerateReport
CheckThresholds -->|Breach| EmitBreach --> GenerateReport
CheckThresholds -->|Healthy| GenerateReport
GenerateReport --> UpdateMemory --> EmitStatus
๐ Process Flow¶
flowchart TD
Start([SLO/SLA Compliance Agent Activated])
LoadDefs[Load SLO definitions from registry]
QueryMetrics[Query metrics backend for SLI data]
ComputeSLIs[Calculate SLI values per service]
CalculateBudgets[Compute error budgets and burn rates]
EvaluateCompliance[Compare actuals against targets]
CorrelateIncidents[Link SLO impacts to known incidents]
GenerateReports[Create SLO and SLA compliance reports]
EmitEvents[Emit SLOStatusUpdated, ErrorBudgetWarning, SLABreachAlert]
UpdateHistory[Store results in performance history]
NotifyStakeholders[Notify Release Manager, Customer Success, Studio]
End([Finish])
Start --> LoadDefs --> QueryMetrics --> ComputeSLIs
ComputeSLIs --> CalculateBudgets --> EvaluateCompliance
EvaluateCompliance --> CorrelateIncidents --> GenerateReports
GenerateReports --> EmitEvents --> UpdateHistory --> NotifyStakeholders --> End
๐ Agent Contract¶
agentId: slo-sla-compliance
role: "Service Level Objective and Agreement Compliance Tracker"
category: "Observability, Reliability Engineering, Compliance"
description: >
Tracks and reports SLO/SLA compliance across all services. Manages error
budgets, detects SLA breaches, and feeds reliability signals into release
decisions and customer communication workflows.
triggers:
- metrics_aggregated
- error_budget_consumed
- release_deployed
inputs:
- Prometheus/OTEL metric data
- SLO definitions per service
- Incident resolution data
- Service deployment metadata
- Customer SLA contracts
outputs:
- slo-report
- error-budget-analysis
- sla-compliance-report
- slo-definition.yaml
- execution-metadata.json
- Event: SLOStatusUpdated
- Event: SLABreachAlert
- Event: ErrorBudgetWarning
skills:
- DefineSLOsFromBlueprint
- ComputeSLIMetrics
- CalculateErrorBudgets
- DetectSLOViolations
- GenerateComplianceReports
- CorrelateIncidentsToSLOs
- EmitBudgetWarnings
- TrackSLOTrends
memory:
scope: [traceId, service, sloName, window, tenantId]
stores:
- sloDefinitionRegistry
- sloPerformanceHistory
- errorBudgetTrackingStore
- slaComplianceArchive
- incidentSLOCorrelationDB
validations:
- SLO definitions are valid and complete
- SLI calculations are within expected ranges
- Error budget calculations align with incident data
- Compliance reports cover all SLOs
- execution-metadata.json generated
version: "1.0.0"
status: active
๐ Summary¶
The SLO/SLA Compliance Agent is the reliability compass of the ConnectSoft AI Software Factory. It ensures that:
- ๐ Service reliability is continuously measured against formal SLO targets
- ๐ฐ Error budgets are tracked in real-time with burn rate analysis and exhaustion projections
- ๐ SLA breaches are detected proactively and communicated to stakeholders before customer impact
- ๐ Compliance reports are generated automatically for teams, management, and customers
- ๐ SLO data drives release decisions โ preventing risky deployments when reliability is degraded
Without this agent, reliability is aspirational, not measured. With it, every service has a quantified, tracked, and enforced reliability contract.