Observability & Feedback Platform Overview¶

Target Architecture — Final-State Design

This page describes the final-state target architecture of the Observability & Feedback Platform. Capabilities already realised in the codebase (Serilog, OpenTelemetry, Application Insights, MassTransit on Azure Service Bus) are marked separately; everything else is the designed end state the factory converges to.

The Observability & Feedback Platform is the trust and improvement loop of the ConnectSoft AI Software Factory. It is not a monitoring dashboard bolted onto the side of the system. It is the AI-native software factory platform that observes everything the factory and its generated SaaS products do at runtime, turns raw telemetry into meaningful signals, incidents, cost insights, quality scores, and feedback items, and feeds that learning back into the Knowledge Platform so that every subsequent generation is more reliable, cheaper, and higher quality.

Where the Agent Mesh supplies reasoning and the Control Plane supplies orchestration, this platform supplies evidence and trust. Every artifact the factory generates, every agent task it runs, and every SaaS product it deploys emits telemetry stamped with the canonical event envelope traceId. That common thread lets the platform stitch a single, continuous story: business intent → blueprint → agent task → context package → generated artifact → commit → deployment → runtime signal → feedback → improved generation.

Purpose¶

The platform exists to make the factory trustworthy and self-improving at multi-tenant scale:

Observe everything, with one trace. Traces, logs, metrics, SLOs, and cost are correlated by a single traceId so any runtime behaviour can be traced back to the exact generation decision that produced it.
Turn telemetry into signals. Raw spans, logs, and counters become aggregated metrics, alert triggers, incidents, cost anomalies, and SLO breaches that the factory can act on.
Close the improvement loop. Runtime signals, incidents, and human/agent feedback become durable FeedbackItem and QualityScore records, fed into the Knowledge Platform to improve future generation.
Govern cost and quality. Per-tenant, per-project cost telemetry and quality scoring give the factory the economic and quality feedback needed to optimise autonomously.

Role in the AI Software Factory¶

flowchart LR
    Runtime["Runtime &amp; Cloud<br/>(generated SaaS + agents)"] -->|"OTEL traces / logs / metrics"| OBS["Observability &amp; Feedback"]
    OBS -->|"signals, incidents, cost"| OBS2["Signals &amp; Quality"]
    OBS2 -->|"FeedbackItemCreated"| KP["Knowledge Platform"]
    KP -->|"better context + patterns"| AM["Agent Mesh"]
    AM -->|"improved artifacts"| Runtime
    OBS -->|"alerts + incidents"| CP["Control Plane"]

Hold "Alt" / "Option" to enable pan & zoom

This platform is the right-hand side of the factory loop. It reads runtime behaviour out of everything the factory ships, distils it into signals and feedback, and writes that learning back into the Knowledge Platform where the Agent Mesh can use it to generate better software. The loop is what turns a one-shot generator into a learning system.

Core Responsibilities¶

Responsibility	Description
Distributed tracing	Ingest and correlate OpenTelemetry traces across factory services, agents, and generated SaaS products, anchored by `traceId`.
Log querying	Provide governed, tenant-scoped search over structured Serilog logs stored in Log Analytics.
Metric aggregation	Roll up raw counters and histograms into time-series metric series for dashboards, SLOs, and anomaly detection.
Dashboards	Define and serve reusable, multi-tenant dashboards (App Insights workbooks, optional Grafana).
Alerting	Evaluate alert rules against metrics and signals and raise triggers.
Incident management	Open, analyse, escalate, and resolve incidents with full trace lineage.
Feedback capture	Create durable feedback items from runtime signals, humans, and agents.
Quality scoring	Compute per-project and per-artifact quality scores from feedback, incidents, and SLO adherence.
Cost telemetry	Track and attribute model, compute, and infrastructure cost per tenant and project, detecting anomalies.
SLO management	Define service-level objectives, track error budgets, and detect breaches.
Telemetry correlation	Stitch traces, logs, metrics, and feedback into a single correlated view per `traceId`.

Key Capabilities¶

Single-trace correlation — the TelemetryCorrelationService joins traces, logs, metrics, incidents, and feedback on traceId so one query answers "what happened across the whole lifecycle of this artifact".
Required telemetry dimensions — every signal carries traceId, executionId, tenantId, projectId, moduleId, agentId, skillId, artifactId, workflowId, environment, and version, making all telemetry sliceable by factory concept.
Runtime feedback loop — runtime signals and incidents are automatically distilled into FeedbackItem records that flow to the Knowledge Platform.
Cost and quality governance — per-tenant cost attribution and quality scoring give the factory economic and quality feedback to self-optimise.
Multi-tenant isolation — tenantId is an isolation boundary in every store, query, dashboard, and alert.
Self-observing — the platform instruments itself with the same OTEL/Serilog stack it provides to the rest of the factory (see Observability).

High-Level Component Diagram¶

flowchart TB
    subgraph Contexts["Bounded Contexts"]
        Tracing["Tracing"]
        Logs["Logs"]
        Metrics["Metrics &amp; SLO"]
        Dash["Dashboards &amp; Alerts"]
        Inc["Incidents"]
        FQ["Feedback &amp; Quality"]
        Cost["Cost"]
    end

    subgraph Stores["Persistence"]
        AI[("Application Insights<br/>traces / metrics")]
        LA[("Log Analytics<br/>logs")]
        SQL[("Azure SQL / PostgreSQL<br/>incidents, feedback, quality, cost")]
        Blob[("Azure Blob<br/>exports / snapshots")]
    end

    Tracing --> AI
    Logs --> LA
    Metrics --> AI
    Dash --> AI
    Inc --> SQL
    FQ --> SQL
    Cost --> SQL
    FQ --> Blob
    Cost --> Blob

Hold "Alt" / "Option" to enable pan & zoom

Integration with Other Platforms¶

flowchart LR
    RC["Runtime &amp; Cloud"] -->|telemetry| OBS["Observability &amp; Feedback"]
    AM["Agent Mesh"] -->|agent task telemetry| OBS
    OBS -->|feedback, quality, lineage| KP["Knowledge Platform"]
    KP -->|context-to-outcome attribution| OBS
    OBS -->|alerts, incidents| CP["Control Plane"]
    CP -->|lifecycle events| OBS
    OBS -->|cost + quality signals| GOV["Governance, Security &amp; Compliance"]

Hold "Alt" / "Option" to enable pan & zoom

Platform	Observability & Feedback receives	Observability & Feedback provides
Runtime & Cloud	OTEL traces, logs, metrics from generated SaaS	Dashboards, alerts, incidents, SLOs
Agent Mesh	Agent task and skill telemetry	Quality scores, feedback on generation outcomes
Knowledge Platform	Context-to-outcome attribution requests	Feedback items, quality scores, runtime lineage
Control Plane	Workflow lifecycle events	Alert triggers, incidents, SLO status for gating
Governance, Security & Compliance	Cost and quality policies	Cost anomalies, quality evidence, audit trails

Implemented Foundations¶

Implemented

The observability substrate this platform builds on is already realised in the codebase via Serilog, OpenTelemetry, and Application Insights:

ConnectSoft.Extensions.Observability, ConnectSoft.Extensions.Telemetry, ConnectSoft.Extensions.Logging.Serilog, ConnectSoft.Extensions.Diagnostics.Metrics
OTEL traces/metrics/logs exported to Application Insights / Log Analytics, with optional Grafana/Prometheus dashboards
MassTransit on Azure Service Bus for feedback and signal events

See the existing implementation references: Observability-Driven Design, Platform Foundations — Observability, and Factory Runtime — Observability. The final-state platform builds on these foundations with the full microservice, worker, and feedback topology described across this section.

Final-State Summary¶

The Observability & Feedback Platform is the trust and improvement loop of the AI Software Factory. In its final state it comprises 11 microservices, 8 background workers, 11 aggregate roots, and 9 public APIs organised into seven bounded contexts, persisting across Application Insights, Log Analytics, Azure SQL/PostgreSQL, and Azure Blob. It transforms runtime behaviour into durable, traceable feedback: every trace, incident, cost anomaly, and SLO breach becomes a signal that improves the next generation, closing the loop between what the factory builds and how well it runs.

Observability & Feedback Platform Overview¶

Purpose¶

Role in the AI Software Factory¶

Core Responsibilities¶

Key Capabilities¶

High-Level Component Diagram¶

Integration with Other Platforms¶

Implemented Foundations¶

Final-State Summary¶

Related Pages¶