Extension Roadmap¶

Target Architecture — Final-State Design

This page describes how the Observability & Feedback Platform is designed to grow. The current eleven services, eight workers, and nine APIs are the stable core; everything here is additive and follows the platform's extension principles.

The platform is built to extend without disruption: new signal sources, detectors, and feedback channels plug into the existing event backbone and required-dimension model rather than reshaping the core. The north star is autonomous observability — a factory that not only watches itself but increasingly diagnoses, scores, and improves itself.

Extension Principles¶

Additive via events. New capabilities subscribe to existing topics or publish new NounVerbPastTense events; they never require breaking changes to the canonical envelope.
Required dimensions are sacred. Every new signal, service, or worker emits the full set of required telemetry dimensions, preserving end-to-end correlation.
Single-owner aggregates. New data classes get a single owning service and store, consistent with the existing topology.
Tenant-isolated by default. Every extension carries tenantId as an isolation boundary.
Dogfood. New detectors and dashboards are first proven on the platform's own self-telemetry before being offered to generated SaaS.
Governed. New feedback and cost signals integrate with Governance, Security & Compliance for policy and audit.

Future Services¶

AnomalyDetectionService — generalised ML-based anomaly detection across metrics and traces (beyond cost), emitting AnomalyDetected signals into the feedback loop.
RootCauseService — automated root-cause analysis correlating traces, logs, deploys, and recent generation changes to propose likely causes on IncidentOpened.
SyntheticMonitoringService — proactive synthetic probes of generated SaaS endpoints feeding the same metric/SLO machinery.
FeedbackRoutingService — intelligent routing of feedback items to the right Knowledge Platform partition, agent cluster, or human reviewer.
ExperimentService — A/B and canary outcome measurement, attributing quality/cost deltas to generation variants.

Future Workers¶

RootCauseWorker — runs RCA pipelines on incident open and attaches hypotheses.
FeedbackApplicationWorker — tracks whether a feedback item was applied in a subsequent generation and emits FeedbackApplied to close loop visibility.
SloForecastWorker — forecasts error-budget burn and raises early-warning signals before breach.
CostForecastWorker — projects month-end cost per tenant/project and flags trajectory anomalies.
TraceSamplingWorker — adaptive, tenant-aware sampling to balance fidelity and cost.

Future APIs¶

POST /anomalies/query — query detected anomalies across signal types.
GET /incidents/{incidentId}/root-cause — retrieve automated RCA hypotheses.
POST /synthetics/probes — define synthetic monitors.
GET /quality/artifacts/{artifactId} — artifact-level quality and feedback history.
POST /experiments — register a generation experiment and its success metrics.

Marketplace — Dashboard Packs¶

The platform contributes dashboard packs to the Marketplace: versioned, reusable bundles of DashboardDefinition, AlertRule, and SloDefinition templates tuned for common SaaS shapes (e.g. Booking SaaS Pack, Multi-tenant API Pack, Event-Driven Backend Pack).

Packs install into a tenant/project and instantiate dashboards, alerts, and SLOs pre-wired to the required dimensions.
Packs are versioned and governed; updates flow through GitOps.
Quality-scoring rubrics and cost-anomaly baselines can ship as packs too, so the definition of good is itself reusable across projects.

Agent Opportunities¶

The platform is a natural home for autonomous observability agents in the Agent Mesh:

Triage Agent — consumes IncidentOpened/AlertTriggered, correlates telemetry, and proposes severity, owner, and likely root cause.
Quality Reviewer Agent — reads QualityScoreComputed and feedback streams and authors structured FeedbackItem records back into the loop.
Cost Optimizer Agent — acts on CostAnomalyDetected, attributes cost to generation paths, and proposes cheaper templates/patterns to the Knowledge Platform.
SLO Steward Agent — manages SLO definitions, error-budget policy, and breach response autonomously.
Self-Healing Agent — for well-understood incident classes, proposes or applies mitigations (with governance approval), closing the loop from detection to remediation.

These agents turn the trust-and-improvement loop from human-driven to increasingly autonomous, while keeping governance and audit at the center.