Skip to content

Extension Roadmap

Target Architecture — Final-State Design

This roadmap describes how the Runtime & Cloud Platform extends beyond its core 9 services without breaking its contracts. Every extension preserves the canonical event envelope, tenant isolation via RuntimeTenantBinding, Pulumi-declared infrastructure, and full traceability.

Principles

  • Contract stability — new capabilities are additive; existing events, APIs, and aggregates keep their meaning (breaking changes get a version suffix).
  • Everything as desired state — new infrastructure is Pulumi-declared and drift-reconciled; nothing is operated out of band.
  • Autonomy by default — extensions are control loops first, dashboards second: detect, decide, act, then surface to humans.
  • Tenant-safe — every extension respects the isolation model and quotas in RuntimeTenantBinding.
  • Trace-complete — new signals carry the cross-cutting dimensions and correlate to the same traceId.
  • Cloud-portable shape — Azure-first, but abstractions (ComputeTarget, RuntimeService) are kept provider-neutral so additional clouds can be added behind the same model.

Future Services

Candidate Service Purpose
RuntimeCostOptimizationService Continuously right-size workloads and recommend/apply cost-saving scaling and SKU changes per tenant.
RuntimeChaosService Inject controlled failures to validate resilience and the rollback/remediation loops.
RuntimeTrafficManagementService Progressive delivery, traffic shaping, and multi-region routing (canary/blue-green at the edge).
RuntimeComplianceService Continuously assert runtime compliance posture (residency, encryption, isolation) against Governance policy.
RuntimeBackupRestoreService Tenant-aware backup, point-in-time restore, and DR orchestration across data stores.
RuntimeCarbonService Surface and optimize energy/carbon footprint of generated runtimes.

Future Workers

Candidate Worker Trigger Purpose
CostAnomalyWorker Timer + billing signals Detect cost anomalies and propose remediation.
CapacityForecastWorker Timer Forecast capacity from usage trends and pre-provision.
DisasterRecoveryWorker Failover signal Promote a paired region and re-bind tenants.
ComplianceDriftWorker Timer + policy change Detect compliance drift distinct from config drift.
CertificateRotationWorker Expiry schedule Rotate TLS certificates alongside secret rotation.

Future APIs

  • POST /runtime/traffic-policies — declare progressive-delivery and routing policies.
  • POST /runtime/restore-points and POST /runtime/restores — backup and restore orchestration.
  • GET /runtime/cost/{environmentId} — per-environment cost and optimization recommendations.
  • GET /runtime/compliance/{environmentId} — runtime compliance posture.

Marketplace Opportunities

The Marketplace Platform can offer runtime building blocks as reusable, governed assets:

  • Runtime topology blueprints — opinionated Pulumi stacks (e.g. "regulated multi-region SaaS", "cost-lean single-region") publishable and reusable across tenants.
  • Scaling policy packs — curated ScalingPolicy presets per workload archetype.
  • Health & drift policy packs — reusable health gates and drift-remediation policies.
  • Runtime add-ons — pre-integrated runtime capabilities (CDN, WAF, cache tiers) installable into an environment.

Agent Opportunities

The platform is a natural home for autonomous Agent Mesh operators that reason over runtime signals and act through the existing APIs:

  • Runtime SRE Agent — diagnoses incidents from health/drift/scaling signals and proposes or executes governed remediations.
  • Capacity Planner Agent — tunes scaling policies and node pools from forecast and SLO attainment.
  • Cost Optimizer Agent — proposes right-sizing and SKU changes, gated by Governance.
  • Resilience Agent — schedules chaos experiments and validates rollback/DR readiness.
  • Drift Remediation Agent — classifies drift findings and selects the safest corrective deployment.

Each agent consumes the platform's events as grounding (via the Knowledge Platform), proposes actions as governed commands, and feeds outcomes back to Observability & Feedback — extending the factory's self-improvement loop all the way into production operations.