Production Reliability Dashboard
Generated 2026-03-21 22:47 from Slack #_alerts_prod and AWS SNS alert emails for 2026-03-14 07:00 to 2026-03-21 07:00.
Start with the ranked charts to spot concentration, click one item to inspect the drill-down, then use the daily trends and lower tables as evidence. The board is designed to move from overview to investigation to raw supporting detail.
Top Slack Alert Families
Top Impacted Services / Resources
AWS Email Alarm Families
Drill-Down
Slack Alerts by Day
AWS Alert Emails by Day
Slack Alert Families
| Alert | Severity | Count | Last Seen | Status | Threads | Top Impacted Services | Discussion Signal | Latest Thread Note |
|---|---|---|---|---|---|---|---|---|
| KubeJobFailed | warning | 27 | 2026-03-19 07:46 | Recent (72h) | 0 | admission-end-session-29561780 (14)admission-end-session-29556020 (13)admission-end-session-29557460 (13)admission-end-session-29554580 (10)admission-end-session-29558900 (9) | None | |
| TraefikServiceHighErrorRate | critical | 17 | 2026-03-20 21:47 | Recent (72h) | 2 | uni-api-svc-4000 (12)accommodations-api-svc-4100 (5) | Observability storageGeneral investigation | Ionut Ciolan | de ce apar stack trace-urile alea? nu par folositoare |
| KubeHpaMaxedOut | warning | 9 | 2026-03-20 11:11 | Recent (72h) | 0 | docgen2-api (6)subscriptions-api (3) | None | |
| TraefikServiceHighLatency | warning | 5 | 2026-03-20 16:35 | Recent (72h) | 0 | ai-api-svc-3900 (4)web-80 (1) | None | |
| NodeSystemSaturation | warning | 3 | 2026-03-19 12:27 | Recent (72h) | 0 | grafana (3) | None | |
| NodeCPUHighUsage | warning | 1 | 2026-03-18 14:53 | Seen this week | 0 | grafana (1) | None |
Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.
Slack Impacted Service / Resource View
This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.
| Impacted Service / Resource | Count | Last Seen | Status | Top Alert Types | Discussion Signal | Latest Thread Note |
|---|---|---|---|---|---|---|
| admission-end-session-29561780 | 14 | 2026-03-19 07:46 | Recent (72h) | KubeJobFailed (14) | None | |
| admission-end-session-29556020 | 13 | 2026-03-16 10:52 | Seen this week | KubeJobFailed (13) | None | |
| admission-end-session-29557460 | 13 | 2026-03-16 10:52 | Seen this week | KubeJobFailed (13) | None | |
| uni-api-svc-4000 | 12 | 2026-03-20 21:47 | Recent (72h) | TraefikServiceHighErrorRate (12) | Observability storage | errors.group-already-exists |
| admission-end-session-29554580 | 10 | 2026-03-15 23:07 | Seen this week | KubeJobFailed (10) | None | |
| admission-end-session-29558900 | 9 | 2026-03-16 10:52 | Seen this week | KubeJobFailed (9) | None | |
| docgen2-api | 6 | 2026-03-19 13:11 | Recent (72h) | KubeHpaMaxedOut (6) | None | |
| accommodations-api-svc-4100 | 5 | 2026-03-18 13:24 | Seen this week | TraefikServiceHighErrorRate (5) | General investigation | Ionut Ciolan | de ce apar stack trace-urile alea? nu par folositoare |
| ai-api-svc-3900 | 4 | 2026-03-20 16:35 | Recent (72h) | TraefikServiceHighLatency (4) | None | |
| grafana | 4 | 2026-03-19 12:27 | Recent (72h) | NodeSystemSaturation (3)NodeCPUHighUsage (1) | None | |
| admission-end-session-29560340 | 3 | 2026-03-16 10:52 | Seen this week | KubeJobFailed (3) | None | |
| subscriptions-api | 3 | 2026-03-20 11:11 | Recent (72h) | KubeHpaMaxedOut (3) | None | |
| web-80 | 1 | 2026-03-17 04:13 | Seen this week | TraefikServiceHighLatency (1) | None |
AWS Email Alarm Families
| AWS Alarm | Emails | ALARM | OK | State Flips | First Seen | Last Seen | Latest State | Status |
|---|---|---|---|---|---|---|---|---|
| adservio-rds-mysql-master-memory-low | 63 | 32 | 31 | 62 | 2026-03-15 12:38 | 2026-03-18 07:09 | ALARM | Still alarming |
| adservio-rds-mysql-master-write-latency-high | 2 | 1 | 1 | 1 | 2026-03-16 23:08 | 2026-03-16 23:13 | OK | Latest OK |
“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.
Discussion-Derived Signal
| Thread Date | Alert | Severity | Services | Signal | Key Notes |
|---|---|---|---|---|---|
| 2026-03-18 12:48 | TraefikServiceHighErrorRate | critical | accommodations-api-svc-4100 | General investigation | Ionut Ciolan | de ce apar stack trace-urile alea? nu par folositoare |
| 2026-03-17 10:45 | TraefikServiceHighErrorRate | critical | uni-api-svc-4000 | Observability storage | errors.group-already-exists |