Observability & Feedback Platform Overview¶
Target Architecture — Final-State Design
This page describes the final-state target architecture of the Observability & Feedback Platform. Capabilities already realised in the codebase (Serilog, OpenTelemetry, Application Insights, MassTransit on Azure Service Bus) are marked separately; everything else is the designed end state the factory converges to.
The Observability & Feedback Platform is the trust and improvement loop of the ConnectSoft AI Software Factory. It is not a monitoring dashboard bolted onto the side of the system. It is the AI-native software factory platform that observes everything the factory and its generated SaaS products do at runtime, turns raw telemetry into meaningful signals, incidents, cost insights, quality scores, and feedback items, and feeds that learning back into the Knowledge Platform so that every subsequent generation is more reliable, cheaper, and higher quality.
Where the Agent Mesh supplies reasoning and the Control Plane supplies orchestration, this platform supplies evidence and trust. Every artifact the factory generates, every agent task it runs, and every SaaS product it deploys emits telemetry stamped with the canonical event envelope traceId. That common thread lets the platform stitch a single, continuous story: business intent → blueprint → agent task → context package → generated artifact → commit → deployment → runtime signal → feedback → improved generation.
Purpose¶
The platform exists to make the factory trustworthy and self-improving at multi-tenant scale:
- Observe everything, with one trace. Traces, logs, metrics, SLOs, and cost are correlated by a single
traceIdso any runtime behaviour can be traced back to the exact generation decision that produced it. - Turn telemetry into signals. Raw spans, logs, and counters become aggregated metrics, alert triggers, incidents, cost anomalies, and SLO breaches that the factory can act on.
- Close the improvement loop. Runtime signals, incidents, and human/agent feedback become durable FeedbackItem and QualityScore records, fed into the Knowledge Platform to improve future generation.
- Govern cost and quality. Per-tenant, per-project cost telemetry and quality scoring give the factory the economic and quality feedback needed to optimise autonomously.
Role in the AI Software Factory¶
flowchart LR
Runtime["Runtime & Cloud<br/>(generated SaaS + agents)"] -->|"OTEL traces / logs / metrics"| OBS["Observability & Feedback"]
OBS -->|"signals, incidents, cost"| OBS2["Signals & Quality"]
OBS2 -->|"FeedbackItemCreated"| KP["Knowledge Platform"]
KP -->|"better context + patterns"| AM["Agent Mesh"]
AM -->|"improved artifacts"| Runtime
OBS -->|"alerts + incidents"| CP["Control Plane"]
This platform is the right-hand side of the factory loop. It reads runtime behaviour out of everything the factory ships, distils it into signals and feedback, and writes that learning back into the Knowledge Platform where the Agent Mesh can use it to generate better software. The loop is what turns a one-shot generator into a learning system.
Core Responsibilities¶
| Responsibility | Description |
|---|---|
| Distributed tracing | Ingest and correlate OpenTelemetry traces across factory services, agents, and generated SaaS products, anchored by traceId. |
| Log querying | Provide governed, tenant-scoped search over structured Serilog logs stored in Log Analytics. |
| Metric aggregation | Roll up raw counters and histograms into time-series metric series for dashboards, SLOs, and anomaly detection. |
| Dashboards | Define and serve reusable, multi-tenant dashboards (App Insights workbooks, optional Grafana). |
| Alerting | Evaluate alert rules against metrics and signals and raise triggers. |
| Incident management | Open, analyse, escalate, and resolve incidents with full trace lineage. |
| Feedback capture | Create durable feedback items from runtime signals, humans, and agents. |
| Quality scoring | Compute per-project and per-artifact quality scores from feedback, incidents, and SLO adherence. |
| Cost telemetry | Track and attribute model, compute, and infrastructure cost per tenant and project, detecting anomalies. |
| SLO management | Define service-level objectives, track error budgets, and detect breaches. |
| Telemetry correlation | Stitch traces, logs, metrics, and feedback into a single correlated view per traceId. |
Key Capabilities¶
- Single-trace correlation — the TelemetryCorrelationService joins traces, logs, metrics, incidents, and feedback on
traceIdso one query answers "what happened across the whole lifecycle of this artifact". - Required telemetry dimensions — every signal carries
traceId,executionId,tenantId,projectId,moduleId,agentId,skillId,artifactId,workflowId,environment, andversion, making all telemetry sliceable by factory concept. - Runtime feedback loop — runtime signals and incidents are automatically distilled into FeedbackItem records that flow to the Knowledge Platform.
- Cost and quality governance — per-tenant cost attribution and quality scoring give the factory economic and quality feedback to self-optimise.
- Multi-tenant isolation —
tenantIdis an isolation boundary in every store, query, dashboard, and alert. - Self-observing — the platform instruments itself with the same OTEL/Serilog stack it provides to the rest of the factory (see Observability).
High-Level Component Diagram¶
flowchart TB
subgraph Contexts["Bounded Contexts"]
Tracing["Tracing"]
Logs["Logs"]
Metrics["Metrics & SLO"]
Dash["Dashboards & Alerts"]
Inc["Incidents"]
FQ["Feedback & Quality"]
Cost["Cost"]
end
subgraph Stores["Persistence"]
AI[("Application Insights<br/>traces / metrics")]
LA[("Log Analytics<br/>logs")]
SQL[("Azure SQL / PostgreSQL<br/>incidents, feedback, quality, cost")]
Blob[("Azure Blob<br/>exports / snapshots")]
end
Tracing --> AI
Logs --> LA
Metrics --> AI
Dash --> AI
Inc --> SQL
FQ --> SQL
Cost --> SQL
FQ --> Blob
Cost --> Blob
Integration with Other Platforms¶
flowchart LR
RC["Runtime & Cloud"] -->|telemetry| OBS["Observability & Feedback"]
AM["Agent Mesh"] -->|agent task telemetry| OBS
OBS -->|feedback, quality, lineage| KP["Knowledge Platform"]
KP -->|context-to-outcome attribution| OBS
OBS -->|alerts, incidents| CP["Control Plane"]
CP -->|lifecycle events| OBS
OBS -->|cost + quality signals| GOV["Governance, Security & Compliance"]
| Platform | Observability & Feedback receives | Observability & Feedback provides |
|---|---|---|
| Runtime & Cloud | OTEL traces, logs, metrics from generated SaaS | Dashboards, alerts, incidents, SLOs |
| Agent Mesh | Agent task and skill telemetry | Quality scores, feedback on generation outcomes |
| Knowledge Platform | Context-to-outcome attribution requests | Feedback items, quality scores, runtime lineage |
| Control Plane | Workflow lifecycle events | Alert triggers, incidents, SLO status for gating |
| Governance, Security & Compliance | Cost and quality policies | Cost anomalies, quality evidence, audit trails |
Implemented Foundations¶
Implemented
The observability substrate this platform builds on is already realised in the codebase via Serilog, OpenTelemetry, and Application Insights:
ConnectSoft.Extensions.Observability,ConnectSoft.Extensions.Telemetry,ConnectSoft.Extensions.Logging.Serilog,ConnectSoft.Extensions.Diagnostics.Metrics- OTEL traces/metrics/logs exported to Application Insights / Log Analytics, with optional Grafana/Prometheus dashboards
- MassTransit on Azure Service Bus for feedback and signal events
See the existing implementation references: Observability-Driven Design, Platform Foundations — Observability, and Factory Runtime — Observability. The final-state platform builds on these foundations with the full microservice, worker, and feedback topology described across this section.
Final-State Summary¶
The Observability & Feedback Platform is the trust and improvement loop of the AI Software Factory. In its final state it comprises 11 microservices, 8 background workers, 11 aggregate roots, and 9 public APIs organised into seven bounded contexts, persisting across Application Insights, Log Analytics, Azure SQL/PostgreSQL, and Azure Blob. It transforms runtime behaviour into durable, traceable feedback: every trace, incident, cost anomaly, and SLO breach becomes a signal that improves the next generation, closing the loop between what the factory builds and how well it runs.