Skip to content

๐Ÿ“ˆ SLO/SLA Compliance Agent Specification

๐ŸŽฏ Purpose

The SLO/SLA Compliance Agent is responsible for:

SLO/SLA validation, reporting, compliance tracking, error budget management, and SLA breach notification โ€” ensuring that every service in the ConnectSoft AI Software Factory meets its reliability commitments and that teams have clear, actionable visibility into service health.

Service Level Objectives (SLOs) and Service Level Agreements (SLAs) define the reliability promises a platform makes. This agent ensures that:

  • โœ… SLOs are formally defined, tracked, and reported for every generated service
  • ๐Ÿ“Š Error budgets are continuously calculated and monitored with burn rate alerting
  • ๐Ÿ”” SLA breaches are detected and notified before they become customer-impacting incidents
  • ๐Ÿ“‹ Compliance reports are generated automatically for stakeholders, auditors, and customers
  • ๐Ÿ” SLO performance data feeds back into release decisions and deployment strategies
  • ๐Ÿง  Historical SLO data enables trend analysis and proactive reliability improvement

๐Ÿงฑ What Sets It Apart from Other Observability Agents?

Agent Primary Role
๐Ÿ›ฐ๏ธ Observability Engineer Injects telemetry, traces, logs, and metrics into generated code
๐Ÿšจ Alerting/Incident Manager Creates incidents from alerts and routes to on-call teams
๐Ÿ“‹ Log Analysis Agent Analyzes log patterns and detects anomalies
๐Ÿ“ˆ SLO/SLA Compliance Agent Tracks, reports, and enforces service level objectives and agreements
๐Ÿ”ฅ Incident Response Agent Coordinates active incident response and resolution

๐Ÿงญ Role in Platform

The SLO/SLA Compliance Agent operates as the reliability measurement layer, continuously evaluating whether services meet their defined objectives and feeding compliance data into release and operational decisions.

๐Ÿ“Š Positioning Diagram

flowchart LR
    ObsEng[Observability Engineer Agent]
    AlertMgr[Alerting/Incident Manager Agent]
    SLO[SLO/SLA Compliance Agent]
    Release[Release Manager Agent]
    CustSuccess[Customer Success Agent]

    ObsEng --> SLO
    SLO --> AlertMgr
    SLO --> Release
    SLO --> CustSuccess
Hold "Alt" / "Option" to enable pan & zoom

The SLO/SLA Compliance Agent translates raw metrics into reliability verdicts that drive release decisions, incident prioritization, and customer communication.


๐Ÿง  Why It Exists

Without this agent, the factory would suffer from:

  • Undefined reliability targets โ€” services deployed without clear performance expectations
  • Error budget blindness โ€” teams unaware of how much risk budget remains for releases
  • Reactive SLA management โ€” breaches discovered by customers, not by the platform
  • Manual compliance reporting โ€” time-consuming manual report generation for stakeholders
  • No release-reliability linkage โ€” releases proceed regardless of SLO health

This agent makes reliability measurable, actionable, and continuously enforced.


๐Ÿ“‹ Triggering Events

Event Description
metrics_aggregated Periodic metrics aggregation cycle completes, enabling SLO window evaluation
error_budget_consumed Error budget consumption exceeds a warning or critical threshold
release_deployed A new release is deployed, requiring SLO baseline comparison and impact assessment
slo_definition_updated An SLO target or window configuration is modified by an engineer or architect
compliance_report_requested A scheduled or manual request for SLA compliance reporting
incident_resolved An incident is resolved, requiring SLO impact recalculation

๐Ÿ“‹ Responsibilities and Deliverables

โœ… Core Responsibilities

Responsibility Description
Define SLOs from Service Blueprints Generates baseline SLO definitions when new services are scaffolded, based on service type and tier
Calculate SLI Metrics Computes Service Level Indicators (availability, latency, error rate) from raw telemetry data
Track Error Budget Consumption Maintains rolling error budget calculations with burn rate analysis
Generate Compliance Reports Produces SLO/SLA compliance reports for internal teams, management, and external customers
Detect SLO Violations Identifies when SLO targets are breached within evaluation windows
Emit Error Budget Warnings Alerts when error budget burn rate indicates impending exhaustion
Block Risky Releases Signals Release Manager when error budget is exhausted, recommending release freeze
Correlate SLO Impact with Incidents Links SLO degradation to specific incidents for root cause attribution
Track SLO Trends Over Time Maintains historical SLO performance for trend analysis and capacity planning
Emit SLOStatusUpdated and SLABreachAlert Signals downstream agents about SLO health changes and SLA breaches

๐Ÿ“ค Output Deliverables

Output Type Format Description
slo-report .md, .json, .yaml Periodic SLO performance report with target vs actual metrics
error-budget-analysis .json, .md Error budget consumption, burn rate, and projected exhaustion timeline
sla-compliance-report .md, .pdf, .json Customer-facing SLA compliance documentation
slo-definition.yaml .yaml Versioned SLO definitions per service
execution-metadata.json .json Trace-tagged metadata of the SLO evaluation run

๐Ÿ“˜ Example: SLO Definition

service: booking-service
slos:
  - name: availability
    description: "Proportion of successful requests"
    sli:
      type: availability
      query: "1 - (rate(http_requests_total{service='booking-service', status=~'5..'}[30d]) / rate(http_requests_total{service='booking-service'}[30d]))"
    target: 99.9
    window: 30d
    errorBudget: 0.1%

  - name: latency-p99
    description: "99th percentile request latency"
    sli:
      type: latency
      query: "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service='booking-service'}[30d]))"
    target: 500ms
    window: 30d

๐Ÿ“˜ Example: Error Budget Analysis

{
  "service": "booking-service",
  "sloName": "availability",
  "window": "30d",
  "target": 99.9,
  "actual": 99.82,
  "errorBudgetTotal": "43.2 minutes",
  "errorBudgetConsumed": "34.56 minutes",
  "errorBudgetRemaining": "8.64 minutes",
  "consumptionRate": "80%",
  "burnRate": 1.8,
  "projectedExhaustion": "2026-04-02T00:00:00Z",
  "status": "Warning",
  "recommendation": "Reduce deployment frequency until error budget recovers",
  "linkedIncidents": ["INC-2026-0325-0018", "INC-2026-0327-0031"]
}

๐Ÿ“˜ Example: SLO Compliance Report (Markdown)

### ๐Ÿ“ˆ SLO Compliance Report โ€” BookingService (March 2026)

๐Ÿ“Ž Reporting Window: 2026-03-01 to 2026-03-29
๐Ÿท๏ธ Service: BookingService | Tenant: vetclinic-001

| SLO               | Target   | Actual   | Status     | Error Budget Remaining |
| ------------------ | -------- | -------- | ---------- | ---------------------- |
| Availability       | 99.9%    | 99.82%   | โš ๏ธ Warning | 8.64 min (20%)         |
| Latency P99        | 500ms    | 420ms    | โœ… Healthy  | N/A                    |
| Error Rate         | < 0.1%   | 0.08%    | โœ… Healthy  | N/A                    |

#### ๐Ÿ“Š Error Budget Burn Rate
- Current burn rate: **1.8x** (consuming faster than sustainable)
- Projected exhaustion: **April 2, 2026**
- Linked incidents: INC-2026-0325-0018, INC-2026-0327-0031

#### ๐Ÿ”” Recommendations
- Consider release freeze until error budget recovers above 40%
- Investigate recurring 5xx errors in booking confirmation flow

๐Ÿค Collaboration Patterns

๐Ÿ”— Direct Agent Collaborations

Collaborating Agent Interaction Summary
๐Ÿ›ฐ๏ธ Observability Engineer Agent Provides the metrics and telemetry data used to compute SLIs
๐Ÿšจ Alerting/Incident Manager Agent Receives error budget warnings to create burn-rate alerts; links incidents to SLO impact
๐Ÿ“ฆ Release Manager Agent Consumes SLO health status as a release gate; receives freeze recommendations
๐Ÿ‘ค Customer Success Agent Receives SLA compliance reports for customer communication
๐Ÿ”ง DevOps Engineer Agent Deploys SLO recording rules and dashboards into monitoring infrastructure

๐Ÿ“ฌ Events Emitted & Consumed

Event Name Role
metrics_aggregated ๐Ÿ”„ Consumed โ†’ triggers SLO window evaluation and error budget update
error_budget_consumed ๐Ÿ”„ Consumed โ†’ evaluates severity and emits warnings or freeze signals
release_deployed ๐Ÿ”„ Consumed โ†’ recalculates SLO baselines post-deployment
SLOStatusUpdated โœ… Emitted โ†’ notifies Studio, Release Manager, and dashboards
SLABreachAlert โŒ Emitted โ†’ notifies Customer Success and triggers incident creation
ErrorBudgetWarning โš ๏ธ Emitted โ†’ signals Alerting Agent and Release Manager

๐Ÿงญ Coordination Flow

sequenceDiagram
    participant Obs as Observability Engineer Agent
    participant SLO as SLO/SLA Compliance Agent
    participant Alert as Alerting/Incident Manager Agent
    participant Release as Release Manager Agent
    participant CustSuccess as Customer Success Agent

    Obs->>SLO: metrics_aggregated
    SLO->>SLO: Calculate SLIs and error budgets
    SLO->>Alert: ErrorBudgetWarning (if burn rate high)
    SLO->>Release: SLOStatusUpdated (release gate signal)
    SLO->>CustSuccess: SLA compliance report (if breach detected)
Hold "Alt" / "Option" to enable pan & zoom

๐Ÿง  Memory and Knowledge

๐Ÿงฉ Memory Components

Memory Store Content
๐Ÿ“‚ SLO Definition Registry All active SLO definitions with version history per service and tenant
๐Ÿ“š SLO Performance History Time-series SLO compliance data for trend analysis and reporting
๐Ÿง  Error Budget Tracking Store Rolling error budget calculations, burn rates, and projected exhaustion dates
๐Ÿ“Š SLA Compliance Archive Historical SLA reports generated for customers and auditors
๐Ÿ” Incident-SLO Correlation DB Mappings between incidents and their impact on specific SLOs

๐Ÿ“˜ Example Memory Entry

{
  "service": "booking-service",
  "sloName": "availability",
  "window": "2026-03",
  "target": 99.9,
  "actual": 99.82,
  "errorBudgetConsumedPercent": 80,
  "incidentImpact": [
    { "incidentId": "INC-2026-0325-0018", "downtimeMinutes": 18.5 },
    { "incidentId": "INC-2026-0327-0031", "downtimeMinutes": 16.1 }
  ],
  "traceId": "slo-2026-0329-eval"
}

โœ… Validation Mechanisms

๐Ÿ” What Is Validated?

Component Validation Criteria
SLO Definitions Must have valid SLI queries, realistic targets, and defined evaluation windows
SLI Calculations Computed values must be within expected ranges; anomalous spikes are flagged
Error Budget Accuracy Budget calculations must align with actual incident downtime and metric data
Report Completeness Compliance reports must cover all SLOs for the reporting window with no gaps
Breach Detection SLA breaches must be detected within the evaluation cycle, not delayed
Incident Correlation SLO degradation periods must be linked to corresponding incidents when available

๐Ÿงช Validation Workflow

flowchart TD
    Start[Metrics Aggregation Cycle Complete]
    LoadSLOs[Load SLO definitions for all services]
    ComputeSLIs[Calculate SLI values from metrics]
    EvalBudget[Evaluate error budget consumption and burn rate]
    CheckThresholds{Budget Warning or Breach?}
    EmitWarning[Emit ErrorBudgetWarning]
    EmitBreach[Emit SLABreachAlert]
    GenerateReport[Generate SLO compliance report]
    UpdateMemory[Store results in performance history]
    EmitStatus[Emit SLOStatusUpdated]

    Start --> LoadSLOs --> ComputeSLIs --> EvalBudget --> CheckThresholds
    CheckThresholds -->|Warning| EmitWarning --> GenerateReport
    CheckThresholds -->|Breach| EmitBreach --> GenerateReport
    CheckThresholds -->|Healthy| GenerateReport
    GenerateReport --> UpdateMemory --> EmitStatus
Hold "Alt" / "Option" to enable pan & zoom

๐Ÿ” Process Flow

flowchart TD
    Start([SLO/SLA Compliance Agent Activated])
    LoadDefs[Load SLO definitions from registry]
    QueryMetrics[Query metrics backend for SLI data]
    ComputeSLIs[Calculate SLI values per service]
    CalculateBudgets[Compute error budgets and burn rates]
    EvaluateCompliance[Compare actuals against targets]
    CorrelateIncidents[Link SLO impacts to known incidents]
    GenerateReports[Create SLO and SLA compliance reports]
    EmitEvents[Emit SLOStatusUpdated, ErrorBudgetWarning, SLABreachAlert]
    UpdateHistory[Store results in performance history]
    NotifyStakeholders[Notify Release Manager, Customer Success, Studio]
    End([Finish])

    Start --> LoadDefs --> QueryMetrics --> ComputeSLIs
    ComputeSLIs --> CalculateBudgets --> EvaluateCompliance
    EvaluateCompliance --> CorrelateIncidents --> GenerateReports
    GenerateReports --> EmitEvents --> UpdateHistory --> NotifyStakeholders --> End
Hold "Alt" / "Option" to enable pan & zoom

๐Ÿ“ƒ Agent Contract

agentId: slo-sla-compliance
role: "Service Level Objective and Agreement Compliance Tracker"
category: "Observability, Reliability Engineering, Compliance"
description: >
  Tracks and reports SLO/SLA compliance across all services. Manages error
  budgets, detects SLA breaches, and feeds reliability signals into release
  decisions and customer communication workflows.

triggers:
  - metrics_aggregated
  - error_budget_consumed
  - release_deployed

inputs:
  - Prometheus/OTEL metric data
  - SLO definitions per service
  - Incident resolution data
  - Service deployment metadata
  - Customer SLA contracts

outputs:
  - slo-report
  - error-budget-analysis
  - sla-compliance-report
  - slo-definition.yaml
  - execution-metadata.json
  - Event: SLOStatusUpdated
  - Event: SLABreachAlert
  - Event: ErrorBudgetWarning

skills:
  - DefineSLOsFromBlueprint
  - ComputeSLIMetrics
  - CalculateErrorBudgets
  - DetectSLOViolations
  - GenerateComplianceReports
  - CorrelateIncidentsToSLOs
  - EmitBudgetWarnings
  - TrackSLOTrends

memory:
  scope: [traceId, service, sloName, window, tenantId]
  stores:
    - sloDefinitionRegistry
    - sloPerformanceHistory
    - errorBudgetTrackingStore
    - slaComplianceArchive
    - incidentSLOCorrelationDB

validations:
  - SLO definitions are valid and complete
  - SLI calculations are within expected ranges
  - Error budget calculations align with incident data
  - Compliance reports cover all SLOs
  - execution-metadata.json generated

version: "1.0.0"
status: active

๐Ÿ“ Summary

The SLO/SLA Compliance Agent is the reliability compass of the ConnectSoft AI Software Factory. It ensures that:

  • ๐Ÿ“ˆ Service reliability is continuously measured against formal SLO targets
  • ๐Ÿ’ฐ Error budgets are tracked in real-time with burn rate analysis and exhaustion projections
  • ๐Ÿ”” SLA breaches are detected proactively and communicated to stakeholders before customer impact
  • ๐Ÿ“‹ Compliance reports are generated automatically for teams, management, and customers
  • ๐Ÿ” SLO data drives release decisions โ€” preventing risky deployments when reliability is degraded

Without this agent, reliability is aspirational, not measured. With it, every service has a quantified, tracked, and enforced reliability contract.