Skip to content

Extension Roadmap

Target Architecture — Final-State Design

This page describes how the Observability & Feedback Platform is designed to grow. The current eleven services, eight workers, and nine APIs are the stable core; everything here is additive and follows the platform's extension principles.

The platform is built to extend without disruption: new signal sources, detectors, and feedback channels plug into the existing event backbone and required-dimension model rather than reshaping the core. The north star is autonomous observability — a factory that not only watches itself but increasingly diagnoses, scores, and improves itself.

Extension Principles

  • Additive via events. New capabilities subscribe to existing topics or publish new NounVerbPastTense events; they never require breaking changes to the canonical envelope.
  • Required dimensions are sacred. Every new signal, service, or worker emits the full set of required telemetry dimensions, preserving end-to-end correlation.
  • Single-owner aggregates. New data classes get a single owning service and store, consistent with the existing topology.
  • Tenant-isolated by default. Every extension carries tenantId as an isolation boundary.
  • Dogfood. New detectors and dashboards are first proven on the platform's own self-telemetry before being offered to generated SaaS.
  • Governed. New feedback and cost signals integrate with Governance, Security & Compliance for policy and audit.

Future Services

  • AnomalyDetectionService — generalised ML-based anomaly detection across metrics and traces (beyond cost), emitting AnomalyDetected signals into the feedback loop.
  • RootCauseService — automated root-cause analysis correlating traces, logs, deploys, and recent generation changes to propose likely causes on IncidentOpened.
  • SyntheticMonitoringService — proactive synthetic probes of generated SaaS endpoints feeding the same metric/SLO machinery.
  • FeedbackRoutingService — intelligent routing of feedback items to the right Knowledge Platform partition, agent cluster, or human reviewer.
  • ExperimentService — A/B and canary outcome measurement, attributing quality/cost deltas to generation variants.

Future Workers

  • RootCauseWorker — runs RCA pipelines on incident open and attaches hypotheses.
  • FeedbackApplicationWorker — tracks whether a feedback item was applied in a subsequent generation and emits FeedbackApplied to close loop visibility.
  • SloForecastWorker — forecasts error-budget burn and raises early-warning signals before breach.
  • CostForecastWorker — projects month-end cost per tenant/project and flags trajectory anomalies.
  • TraceSamplingWorker — adaptive, tenant-aware sampling to balance fidelity and cost.

Future APIs

  • POST /anomalies/query — query detected anomalies across signal types.
  • GET /incidents/{incidentId}/root-cause — retrieve automated RCA hypotheses.
  • POST /synthetics/probes — define synthetic monitors.
  • GET /quality/artifacts/{artifactId} — artifact-level quality and feedback history.
  • POST /experiments — register a generation experiment and its success metrics.

Marketplace — Dashboard Packs

The platform contributes dashboard packs to the Marketplace: versioned, reusable bundles of DashboardDefinition, AlertRule, and SloDefinition templates tuned for common SaaS shapes (e.g. Booking SaaS Pack, Multi-tenant API Pack, Event-Driven Backend Pack).

  • Packs install into a tenant/project and instantiate dashboards, alerts, and SLOs pre-wired to the required dimensions.
  • Packs are versioned and governed; updates flow through GitOps.
  • Quality-scoring rubrics and cost-anomaly baselines can ship as packs too, so the definition of good is itself reusable across projects.

Agent Opportunities

The platform is a natural home for autonomous observability agents in the Agent Mesh:

  • Triage Agent — consumes IncidentOpened/AlertTriggered, correlates telemetry, and proposes severity, owner, and likely root cause.
  • Quality Reviewer Agent — reads QualityScoreComputed and feedback streams and authors structured FeedbackItem records back into the loop.
  • Cost Optimizer Agent — acts on CostAnomalyDetected, attributes cost to generation paths, and proposes cheaper templates/patterns to the Knowledge Platform.
  • SLO Steward Agent — manages SLO definitions, error-budget policy, and breach response autonomously.
  • Self-Healing Agent — for well-understood incident classes, proposes or applies mitigations (with governance approval), closing the loop from detection to remediation.

These agents turn the trust-and-improvement loop from human-driven to increasingly autonomous, while keeping governance and audit at the center.