docs/architecture/realtime-errors-observability.md

Realtime, Errors, And Observability

This slice covers WebSocket updates, support-reference errors, CloudWatch/Grafana observability, and cost visibility.

Realtime Updates

flowchart LR
    subgraph Browser["Browser"]
        NAV["DashboardSideNavigation"]
        WSMGR["WebSocketManager"]
        DETAIL["Current detail page"]
    end

    subgraph WSAPI["WebSocket API Gateway"]
        CONNECT["$connect"]
        DISCONNECT["$disconnect"]
        ROUTES["Notifier posts"]
    end

    subgraph Lambdas["WebSocket Lambdas"]
        WSC["ws-test-history-connect"]
        WSD["ws-test-history-disconnect"]
        WST["ws-test-history-notify"]
        WSDOSE["ws-dose-history-notify"]
        WSPROG["ws-program-history-notify"]
    end

    subgraph Streams["DynamoDB Streams"]
        TESTH["env-test-history stream"]
        DOSEH["env-dose-history stream"]
        PROGH["env-program-history stream"]
    end

    subgraph Data["DynamoDB"]
        CONN["env-websocket-connections"]
    end

    WSMGR --> CONNECT --> WSC --> CONN
    WSMGR --> DISCONNECT --> WSD --> CONN
    TESTH --> WST --> CONN
    DOSEH --> WSDOSE --> CONN
    PROGH --> WSPROG --> CONN
    WST --> ROUTES --> WSMGR
    WSDOSE --> ROUTES --> WSMGR
    WSPROG --> ROUTES --> WSMGR
    WSMGR --> NAV
    WSMGR --> DETAIL

Support Reference Error Flow

sequenceDiagram
    participant UI as Browser UI
    participant API as REST API Gateway
    participant Lambda as REST Lambda
    participant Logs as CloudWatch Logs
    participant ErrorTable as env-app-error
    participant Admin as Admin errors widget

    UI->>API: Request
    API->>Lambda: Invoke
    alt unexpected error
        Lambda->>Lambda: Generate support reference id
        Lambda->>Logs: Log reference id, request id, user id, tank id, exception
        Lambda->>ErrorTable: Persist error details
        Lambda-->>UI: Friendly error with reference id
        UI-->>UI: Show sticky dismissible error banner
    else success
        Lambda-->>UI: Response
    end
    Admin->>ErrorTable: List recent errors
    Admin->>ErrorTable: Acknowledge/delete selected error

Operations And Cost Observability

flowchart TB
    subgraph Runtime["Runtime Resources"]
        API["REST API Gateway"]
        WS["WebSocket API Gateway"]
        LAMBDA["Lambda functions"]
        SQS["SQS queues"]
        DDB["DynamoDB tables"]
        S3["S3 buckets"]
        CF["CloudFront"]
    end

    subgraph Telemetry["AWS Telemetry"]
        LOGS["CloudWatch Logs"]
        METRICS["CloudWatch Metrics"]
        ALARMS["CloudWatch Alarms"]
        SNS["SNS ops alerts"]
        DASH["CloudWatch dashboard: env-reef-a-matic-operations"]
        GRAFANA["Amazon Managed Grafana: env-reefamatic-ops"]
    end

    subgraph Cost["Cost Sources"]
        CE["AWS Cost Explorer"]
        TAGS["Environment + Module cost tags"]
        USAGE["env-usage-event"]
        ADMINCOST["Admin Usage & Costs tab"]
    end

    API --> LOGS
    WS --> LOGS
    LAMBDA --> LOGS
    LAMBDA --> METRICS
    SQS --> METRICS
    DDB --> METRICS
    S3 --> METRICS
    CF --> METRICS
    METRICS --> ALARMS --> SNS
    LOGS --> DASH
    METRICS --> DASH
    LOGS --> GRAFANA
    METRICS --> GRAFANA
    TAGS --> CE
    CE --> ADMINCOST
    USAGE --> ADMINCOST

Notes

  • Test, dose, and program streams use new-and-old images so delete messages can remove left-nav items.
  • The app shell applies websocket payloads directly and keeps a short fallback refresh for eventual consistency during async fan-out.
  • Admin live-login status reads active WebSocket connection rows.
  • Cost Explorer is filtered by Environment and grouped by AWS service and Module cost tag.
  • Per-PDF cost attribution is persisted as usage events with a correlation id.