Extension Roadmap¶
Target Architecture — Final-State Design
This page describes how the Observability & Feedback Platform is designed to grow. The current eleven services, eight workers, and nine APIs are the stable core; everything here is additive and follows the platform's extension principles.
The platform is built to extend without disruption: new signal sources, detectors, and feedback channels plug into the existing event backbone and required-dimension model rather than reshaping the core. The north star is autonomous observability — a factory that not only watches itself but increasingly diagnoses, scores, and improves itself.
Extension Principles¶
- Additive via events. New capabilities subscribe to existing topics or publish new
NounVerbPastTenseevents; they never require breaking changes to the canonical envelope. - Required dimensions are sacred. Every new signal, service, or worker emits the full set of required telemetry dimensions, preserving end-to-end correlation.
- Single-owner aggregates. New data classes get a single owning service and store, consistent with the existing topology.
- Tenant-isolated by default. Every extension carries
tenantIdas an isolation boundary. - Dogfood. New detectors and dashboards are first proven on the platform's own self-telemetry before being offered to generated SaaS.
- Governed. New feedback and cost signals integrate with Governance, Security & Compliance for policy and audit.
Future Services¶
- AnomalyDetectionService — generalised ML-based anomaly detection across metrics and traces (beyond cost), emitting
AnomalyDetectedsignals into the feedback loop. - RootCauseService — automated root-cause analysis correlating traces, logs, deploys, and recent generation changes to propose likely causes on
IncidentOpened. - SyntheticMonitoringService — proactive synthetic probes of generated SaaS endpoints feeding the same metric/SLO machinery.
- FeedbackRoutingService — intelligent routing of feedback items to the right Knowledge Platform partition, agent cluster, or human reviewer.
- ExperimentService — A/B and canary outcome measurement, attributing quality/cost deltas to generation variants.
Future Workers¶
- RootCauseWorker — runs RCA pipelines on incident open and attaches hypotheses.
- FeedbackApplicationWorker — tracks whether a feedback item was applied in a subsequent generation and emits
FeedbackAppliedto close loop visibility. - SloForecastWorker — forecasts error-budget burn and raises early-warning signals before breach.
- CostForecastWorker — projects month-end cost per tenant/project and flags trajectory anomalies.
- TraceSamplingWorker — adaptive, tenant-aware sampling to balance fidelity and cost.
Future APIs¶
POST /anomalies/query— query detected anomalies across signal types.GET /incidents/{incidentId}/root-cause— retrieve automated RCA hypotheses.POST /synthetics/probes— define synthetic monitors.GET /quality/artifacts/{artifactId}— artifact-level quality and feedback history.POST /experiments— register a generation experiment and its success metrics.
Marketplace — Dashboard Packs¶
The platform contributes dashboard packs to the Marketplace: versioned, reusable bundles of DashboardDefinition, AlertRule, and SloDefinition templates tuned for common SaaS shapes (e.g. Booking SaaS Pack, Multi-tenant API Pack, Event-Driven Backend Pack).
- Packs install into a tenant/project and instantiate dashboards, alerts, and SLOs pre-wired to the required dimensions.
- Packs are versioned and governed; updates flow through GitOps.
- Quality-scoring rubrics and cost-anomaly baselines can ship as packs too, so the definition of good is itself reusable across projects.
Agent Opportunities¶
The platform is a natural home for autonomous observability agents in the Agent Mesh:
- Triage Agent — consumes
IncidentOpened/AlertTriggered, correlates telemetry, and proposes severity, owner, and likely root cause. - Quality Reviewer Agent — reads
QualityScoreComputedand feedback streams and authors structuredFeedbackItemrecords back into the loop. - Cost Optimizer Agent — acts on
CostAnomalyDetected, attributes cost to generation paths, and proposes cheaper templates/patterns to the Knowledge Platform. - SLO Steward Agent — manages SLO definitions, error-budget policy, and breach response autonomously.
- Self-Healing Agent — for well-understood incident classes, proposes or applies mitigations (with governance approval), closing the loop from detection to remediation.
These agents turn the trust-and-improvement loop from human-driven to increasingly autonomous, while keeping governance and audit at the center.