Skip to content

Observability & Feedback Platform Overview

Target Architecture — Final-State Design

This page describes the final-state target architecture of the Observability & Feedback Platform. Capabilities already realised in the codebase (Serilog, OpenTelemetry, Application Insights, MassTransit on Azure Service Bus) are marked separately; everything else is the designed end state the factory converges to.

The Observability & Feedback Platform is the trust and improvement loop of the ConnectSoft AI Software Factory. It is not a monitoring dashboard bolted onto the side of the system. It is the AI-native software factory platform that observes everything the factory and its generated SaaS products do at runtime, turns raw telemetry into meaningful signals, incidents, cost insights, quality scores, and feedback items, and feeds that learning back into the Knowledge Platform so that every subsequent generation is more reliable, cheaper, and higher quality.

Where the Agent Mesh supplies reasoning and the Control Plane supplies orchestration, this platform supplies evidence and trust. Every artifact the factory generates, every agent task it runs, and every SaaS product it deploys emits telemetry stamped with the canonical event envelope traceId. That common thread lets the platform stitch a single, continuous story: business intent → blueprint → agent task → context package → generated artifact → commit → deployment → runtime signal → feedback → improved generation.

Purpose

The platform exists to make the factory trustworthy and self-improving at multi-tenant scale:

  • Observe everything, with one trace. Traces, logs, metrics, SLOs, and cost are correlated by a single traceId so any runtime behaviour can be traced back to the exact generation decision that produced it.
  • Turn telemetry into signals. Raw spans, logs, and counters become aggregated metrics, alert triggers, incidents, cost anomalies, and SLO breaches that the factory can act on.
  • Close the improvement loop. Runtime signals, incidents, and human/agent feedback become durable FeedbackItem and QualityScore records, fed into the Knowledge Platform to improve future generation.
  • Govern cost and quality. Per-tenant, per-project cost telemetry and quality scoring give the factory the economic and quality feedback needed to optimise autonomously.

Role in the AI Software Factory

flowchart LR
    Runtime["Runtime &amp; Cloud<br/>(generated SaaS + agents)"] -->|"OTEL traces / logs / metrics"| OBS["Observability &amp; Feedback"]
    OBS -->|"signals, incidents, cost"| OBS2["Signals &amp; Quality"]
    OBS2 -->|"FeedbackItemCreated"| KP["Knowledge Platform"]
    KP -->|"better context + patterns"| AM["Agent Mesh"]
    AM -->|"improved artifacts"| Runtime
    OBS -->|"alerts + incidents"| CP["Control Plane"]
Hold "Alt" / "Option" to enable pan & zoom

This platform is the right-hand side of the factory loop. It reads runtime behaviour out of everything the factory ships, distils it into signals and feedback, and writes that learning back into the Knowledge Platform where the Agent Mesh can use it to generate better software. The loop is what turns a one-shot generator into a learning system.

Core Responsibilities

Responsibility Description
Distributed tracing Ingest and correlate OpenTelemetry traces across factory services, agents, and generated SaaS products, anchored by traceId.
Log querying Provide governed, tenant-scoped search over structured Serilog logs stored in Log Analytics.
Metric aggregation Roll up raw counters and histograms into time-series metric series for dashboards, SLOs, and anomaly detection.
Dashboards Define and serve reusable, multi-tenant dashboards (App Insights workbooks, optional Grafana).
Alerting Evaluate alert rules against metrics and signals and raise triggers.
Incident management Open, analyse, escalate, and resolve incidents with full trace lineage.
Feedback capture Create durable feedback items from runtime signals, humans, and agents.
Quality scoring Compute per-project and per-artifact quality scores from feedback, incidents, and SLO adherence.
Cost telemetry Track and attribute model, compute, and infrastructure cost per tenant and project, detecting anomalies.
SLO management Define service-level objectives, track error budgets, and detect breaches.
Telemetry correlation Stitch traces, logs, metrics, and feedback into a single correlated view per traceId.

Key Capabilities

  • Single-trace correlation — the TelemetryCorrelationService joins traces, logs, metrics, incidents, and feedback on traceId so one query answers "what happened across the whole lifecycle of this artifact".
  • Required telemetry dimensions — every signal carries traceId, executionId, tenantId, projectId, moduleId, agentId, skillId, artifactId, workflowId, environment, and version, making all telemetry sliceable by factory concept.
  • Runtime feedback loop — runtime signals and incidents are automatically distilled into FeedbackItem records that flow to the Knowledge Platform.
  • Cost and quality governance — per-tenant cost attribution and quality scoring give the factory economic and quality feedback to self-optimise.
  • Multi-tenant isolationtenantId is an isolation boundary in every store, query, dashboard, and alert.
  • Self-observing — the platform instruments itself with the same OTEL/Serilog stack it provides to the rest of the factory (see Observability).

High-Level Component Diagram

flowchart TB
    subgraph Contexts["Bounded Contexts"]
        Tracing["Tracing"]
        Logs["Logs"]
        Metrics["Metrics &amp; SLO"]
        Dash["Dashboards &amp; Alerts"]
        Inc["Incidents"]
        FQ["Feedback &amp; Quality"]
        Cost["Cost"]
    end

    subgraph Stores["Persistence"]
        AI[("Application Insights<br/>traces / metrics")]
        LA[("Log Analytics<br/>logs")]
        SQL[("Azure SQL / PostgreSQL<br/>incidents, feedback, quality, cost")]
        Blob[("Azure Blob<br/>exports / snapshots")]
    end

    Tracing --> AI
    Logs --> LA
    Metrics --> AI
    Dash --> AI
    Inc --> SQL
    FQ --> SQL
    Cost --> SQL
    FQ --> Blob
    Cost --> Blob
Hold "Alt" / "Option" to enable pan & zoom

Integration with Other Platforms

flowchart LR
    RC["Runtime &amp; Cloud"] -->|telemetry| OBS["Observability &amp; Feedback"]
    AM["Agent Mesh"] -->|agent task telemetry| OBS
    OBS -->|feedback, quality, lineage| KP["Knowledge Platform"]
    KP -->|context-to-outcome attribution| OBS
    OBS -->|alerts, incidents| CP["Control Plane"]
    CP -->|lifecycle events| OBS
    OBS -->|cost + quality signals| GOV["Governance, Security &amp; Compliance"]
Hold "Alt" / "Option" to enable pan & zoom
Platform Observability & Feedback receives Observability & Feedback provides
Runtime & Cloud OTEL traces, logs, metrics from generated SaaS Dashboards, alerts, incidents, SLOs
Agent Mesh Agent task and skill telemetry Quality scores, feedback on generation outcomes
Knowledge Platform Context-to-outcome attribution requests Feedback items, quality scores, runtime lineage
Control Plane Workflow lifecycle events Alert triggers, incidents, SLO status for gating
Governance, Security & Compliance Cost and quality policies Cost anomalies, quality evidence, audit trails

Implemented Foundations

Implemented

The observability substrate this platform builds on is already realised in the codebase via Serilog, OpenTelemetry, and Application Insights:

  • ConnectSoft.Extensions.Observability, ConnectSoft.Extensions.Telemetry, ConnectSoft.Extensions.Logging.Serilog, ConnectSoft.Extensions.Diagnostics.Metrics
  • OTEL traces/metrics/logs exported to Application Insights / Log Analytics, with optional Grafana/Prometheus dashboards
  • MassTransit on Azure Service Bus for feedback and signal events

See the existing implementation references: Observability-Driven Design, Platform Foundations — Observability, and Factory Runtime — Observability. The final-state platform builds on these foundations with the full microservice, worker, and feedback topology described across this section.

Final-State Summary

The Observability & Feedback Platform is the trust and improvement loop of the AI Software Factory. In its final state it comprises 11 microservices, 8 background workers, 11 aggregate roots, and 9 public APIs organised into seven bounded contexts, persisting across Application Insights, Log Analytics, Azure SQL/PostgreSQL, and Azure Blob. It transforms runtime behaviour into durable, traceable feedback: every trace, incident, cost anomaly, and SLO breach becomes a signal that improves the next generation, closing the loop between what the factory builds and how well it runs.