Workflows¶
Target Architecture — Final-State Design
This page describes the final-state workflows of the Observability & Feedback Platform. Each workflow is driven by the canonical event envelope and correlated by traceId.
The platform's workflows realise the trust and improvement loop: observe runtime behaviour, detect and react to problems, and feed the learning back into the factory. Three workflows are central — the runtime feedback loop, the incident lifecycle, and the feedback creation flow — each with explicit failure handling and escalation.
Runtime Feedback Loop¶
The end-to-end loop from runtime telemetry to improved generation. Runtime and agents emit OTLP telemetry; the platform distils signals, incidents, and feedback; the Knowledge Platform absorbs the learning; and the Agent Mesh generates better artifacts next time.
sequenceDiagram
participant RT as Runtime & Cloud
participant TS as TraceService
participant MA as MetricAggregationService
participant SLO as SloService
participant INC as IncidentService
participant FB as FeedbackService
participant KP as Knowledge Platform
participant AM as Agent Mesh
RT->>TS: OTLP traces / logs / metrics (traceId)
TS->>MA: TraceRecorded
MA->>MA: AggregateMetric -> MetricAggregated
MA->>SLO: metric series
SLO->>SLO: EvaluateSlo
SLO-->>INC: SloBreached
INC->>INC: OpenIncident -> IncidentOpened
INC->>INC: investigate / mitigate
INC-->>FB: IncidentResolved (root cause)
FB->>FB: CreateFeedbackItem -> FeedbackItemCreated
FB-->>KP: FeedbackItemCreated (attributed to artifactId)
KP-->>AM: improved context + patterns
AM-->>RT: regenerated / improved artifacts
The loop is closed by attribution: each FeedbackItem carries the artifactId, agentId, skillId, and traceId that produced the observed behaviour, so the Knowledge Platform can attribute outcomes back to the exact context and prompt that generated the artifact.
Incident Lifecycle¶
The Incident aggregate moves through a constrained state machine. Transitions are appended to the incident event log and emit domain events; only valid transitions are permitted (enforced as an aggregate invariant).
stateDiagram-v2
[*] --> Open : OpenIncident
Open --> Acknowledged : acknowledge
Acknowledged --> Investigating : begin analysis
Investigating --> Mitigated : apply mitigation
Investigating --> Escalated : escalation timer / severity
Escalated --> Investigating : on-call engaged
Mitigated --> Resolved : confirm + root cause
Investigating --> Resolved : resolve (root cause)
Resolved --> [*]
Open --> Escalated : critical severity
Open → Resolvedrequires a recordedrootCause(aggregate invariant).Escalatedis reachable fromOpen(for critical severity) orInvestigating(via the escalation timer); see Escalation.- A
Resolvedincident is immutable except for appended post-mortem notes, and emitsIncidentResolved, which triggers feedback creation.
Feedback Creation Flow¶
How a resolved incident, runtime signal, or cost anomaly becomes a durable feedback item routed to the Knowledge Platform.
flowchart TB
Resolved["IncidentResolved"] --> FbW["FeedbackCreationWorker"]
Signal["Runtime signal (TraceRecorded)"] --> FbW
Cost["CostAnomalyDetected"] --> FbW
FbW -->|"idempotency key from sourceId"| Dedup{Already created?}
Dedup -->|yes| Skip["Skip (idempotent)"]
Dedup -->|no| Create["CreateFeedbackItem"]
Create --> Attribute["Attribute artifactId / agentId / skillId / traceId"]
Attribute --> Emit["FeedbackItemCreated"]
Emit --> KP["Knowledge Platform"]
Emit --> QualW["QualityScoreWorker (recompute)"]
Feedback items are categorised (reliability, performance, cost, maintainability, correctness) and carry a sentiment, so the Knowledge Platform and QA Center can route and weight them appropriately.
Failure Handling¶
- Ingestion failures —
TraceIngestionWorkerandMetricAggregationWorkerretry with exponential backoff; unprocessable telemetry dead-letters with the full envelope preserved for replay. Scheduled aggregation simply re-runs the window (deterministic, idempotent). - Detection gaps — if
MetricAggregationServicefalls behind, alert and SLO evaluation operate on the latest available window and mark results as stale rather than producing false negatives. - Incident creation idempotency —
IncidentAnalysisWorkerdeduplicates on(source, sourceId)so a flapping alert or repeated SLO breach does not open duplicate incidents; instead it appends events to the open incident. - Feedback idempotency —
FeedbackCreationWorkerkeys onsourceId, so re-deliveredIncidentResolved/CostAnomalyDetectedevents never create duplicate feedback. - Downstream outage — if the Knowledge Platform is unavailable,
FeedbackItemCreatedevents remain on their Service Bus topic with subscription backlog; they are delivered on recovery (at-least-once with consumer idempotency).
Escalation¶
- Severity-based.
criticalincidents skip straight toEscalatedon open and page the on-call runtime operator;high/warningfollow the acknowledge → investigate path. - Timer-based. An incident that stays
Open/Investigatingbeyond its severity SLA escalates automatically (theIncidentAnalysisWorkerschedules escalation steps). - Routing. Escalation actions map to roles (
runtime.operator,cost.analyst,quality.reviewer) and notification channels configured per tenant. Escalation steps are recorded on the incident for audit. - Governance hand-off. Cost anomalies above a tenant threshold and repeated SLO breaches are surfaced to the Governance, Security & Compliance platform for policy review.