Skip to content

Workflows

Target Architecture — Final-State Design

This page describes the final-state workflows of the Observability & Feedback Platform. Each workflow is driven by the canonical event envelope and correlated by traceId.

The platform's workflows realise the trust and improvement loop: observe runtime behaviour, detect and react to problems, and feed the learning back into the factory. Three workflows are central — the runtime feedback loop, the incident lifecycle, and the feedback creation flow — each with explicit failure handling and escalation.

Runtime Feedback Loop

The end-to-end loop from runtime telemetry to improved generation. Runtime and agents emit OTLP telemetry; the platform distils signals, incidents, and feedback; the Knowledge Platform absorbs the learning; and the Agent Mesh generates better artifacts next time.

sequenceDiagram
    participant RT as Runtime & Cloud
    participant TS as TraceService
    participant MA as MetricAggregationService
    participant SLO as SloService
    participant INC as IncidentService
    participant FB as FeedbackService
    participant KP as Knowledge Platform
    participant AM as Agent Mesh

    RT->>TS: OTLP traces / logs / metrics (traceId)
    TS->>MA: TraceRecorded
    MA->>MA: AggregateMetric -> MetricAggregated
    MA->>SLO: metric series
    SLO->>SLO: EvaluateSlo
    SLO-->>INC: SloBreached
    INC->>INC: OpenIncident -> IncidentOpened
    INC->>INC: investigate / mitigate
    INC-->>FB: IncidentResolved (root cause)
    FB->>FB: CreateFeedbackItem -> FeedbackItemCreated
    FB-->>KP: FeedbackItemCreated (attributed to artifactId)
    KP-->>AM: improved context + patterns
    AM-->>RT: regenerated / improved artifacts
Hold "Alt" / "Option" to enable pan & zoom

The loop is closed by attribution: each FeedbackItem carries the artifactId, agentId, skillId, and traceId that produced the observed behaviour, so the Knowledge Platform can attribute outcomes back to the exact context and prompt that generated the artifact.

Incident Lifecycle

The Incident aggregate moves through a constrained state machine. Transitions are appended to the incident event log and emit domain events; only valid transitions are permitted (enforced as an aggregate invariant).

stateDiagram-v2
    [*] --> Open : OpenIncident
    Open --> Acknowledged : acknowledge
    Acknowledged --> Investigating : begin analysis
    Investigating --> Mitigated : apply mitigation
    Investigating --> Escalated : escalation timer / severity
    Escalated --> Investigating : on-call engaged
    Mitigated --> Resolved : confirm + root cause
    Investigating --> Resolved : resolve (root cause)
    Resolved --> [*]
    Open --> Escalated : critical severity
Hold "Alt" / "Option" to enable pan & zoom
  • Open → Resolved requires a recorded rootCause (aggregate invariant).
  • Escalated is reachable from Open (for critical severity) or Investigating (via the escalation timer); see Escalation.
  • A Resolved incident is immutable except for appended post-mortem notes, and emits IncidentResolved, which triggers feedback creation.

Feedback Creation Flow

How a resolved incident, runtime signal, or cost anomaly becomes a durable feedback item routed to the Knowledge Platform.

flowchart TB
    Resolved["IncidentResolved"] --> FbW["FeedbackCreationWorker"]
    Signal["Runtime signal (TraceRecorded)"] --> FbW
    Cost["CostAnomalyDetected"] --> FbW
    FbW -->|"idempotency key from sourceId"| Dedup{Already created?}
    Dedup -->|yes| Skip["Skip (idempotent)"]
    Dedup -->|no| Create["CreateFeedbackItem"]
    Create --> Attribute["Attribute artifactId / agentId / skillId / traceId"]
    Attribute --> Emit["FeedbackItemCreated"]
    Emit --> KP["Knowledge Platform"]
    Emit --> QualW["QualityScoreWorker (recompute)"]
Hold "Alt" / "Option" to enable pan & zoom

Feedback items are categorised (reliability, performance, cost, maintainability, correctness) and carry a sentiment, so the Knowledge Platform and QA Center can route and weight them appropriately.

Failure Handling

  • Ingestion failuresTraceIngestionWorker and MetricAggregationWorker retry with exponential backoff; unprocessable telemetry dead-letters with the full envelope preserved for replay. Scheduled aggregation simply re-runs the window (deterministic, idempotent).
  • Detection gaps — if MetricAggregationService falls behind, alert and SLO evaluation operate on the latest available window and mark results as stale rather than producing false negatives.
  • Incident creation idempotencyIncidentAnalysisWorker deduplicates on (source, sourceId) so a flapping alert or repeated SLO breach does not open duplicate incidents; instead it appends events to the open incident.
  • Feedback idempotencyFeedbackCreationWorker keys on sourceId, so re-delivered IncidentResolved/CostAnomalyDetected events never create duplicate feedback.
  • Downstream outage — if the Knowledge Platform is unavailable, FeedbackItemCreated events remain on their Service Bus topic with subscription backlog; they are delivered on recovery (at-least-once with consumer idempotency).

Escalation

  • Severity-based. critical incidents skip straight to Escalated on open and page the on-call runtime operator; high/warning follow the acknowledge → investigate path.
  • Timer-based. An incident that stays Open/Investigating beyond its severity SLA escalates automatically (the IncidentAnalysisWorker schedules escalation steps).
  • Routing. Escalation actions map to roles (runtime.operator, cost.analyst, quality.reviewer) and notification channels configured per tenant. Escalation steps are recorded on the incident for audit.
  • Governance hand-off. Cost anomalies above a tenant threshold and repeated SLO breaches are surfaced to the Governance, Security & Compliance platform for policy review.