Production Reliability Dashboard
Generated 2026-03-22 15:58 from Slack #_alerts_prod and AWS SNS alert emails for 2026-03-14 07:00 to 2026-03-21 07:00.
How to use
Pick an alert family, service/resource, or AWS alarm. Inside an alert family, click a target to scope trends, notes, discussion signal, and latest matching alerts. Open Global Evidence Explorer only when you need report-wide pivots.
Slack messages64alert and ops posts in channel history
Slack discussions2threads with human follow-up we could mine for signal
AWS alert emails65Inbox/Trash/Spam matches from SNS sender
Latest observed event2026-03-20 21:47most recent alert timestamp seen in either source
Top Slack Alert Families
Top Impacted Services / Resources
AWS Email Alarm Families
Investigation
Investigation
Choose one item above to start a scoped investigation, then narrow to a target when an alert family exposes multiple workloads.
Global Evidence Explorer
Global Evidence Explorer
Report-wide charts and tables stay here, separate from the active investigation scope.
Global Evidence Explorer
Report-wide charts and tables stay here, separate from the active investigation scope.
Slack Alerts by Day
AWS Alert Emails by Day
Slack Alert Families
| Alert | Severity | Count | Last Seen | Status | Threads | Top Impacted Services | Discussion Signal | Latest Thread Note |
|---|---|---|---|---|---|---|---|---|
| KubeJobFailed | warning | 27 | 2026-03-19 07:46 | Seen this week | 0 | admission-end-session-29561780 (14)admission-end-session-29556020 (13)admission-end-session-29557460 (13)admission-end-session-29554580 (10)admission-end-session-29558900 (9) | None | |
| TraefikServiceHighErrorRate | critical | 17 | 2026-03-20 21:47 | Recent (72h) | 2 | uni-api-svc-4000 (12)accommodations-api-svc-4100 (5) | Observability storageGeneral investigation | Ionut Ciolan | de ce apar stack trace-urile alea? nu par folositoare |
| KubeHpaMaxedOut | warning | 9 | 2026-03-20 11:11 | Recent (72h) | 0 | docgen2-api (6)subscriptions-api (3) | None | |
| TraefikServiceHighLatency | warning | 5 | 2026-03-20 16:35 | Recent (72h) | 0 | ai-api-svc-3900 (4)web-80 (1) | None | |
| NodeSystemSaturation | warning | 3 | 2026-03-19 12:27 | Seen this week | 0 | grafana (3) | None | |
| NodeCPUHighUsage | warning | 1 | 2026-03-18 14:53 | Seen this week | 0 | grafana (1) | None |
Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.
Slack Impacted Service / Resource View
This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.
| Impacted Service / Resource | Count | Last Seen | Status | Top Alert Types | Discussion Signal | Latest Thread Note |
|---|---|---|---|---|---|---|
| admission-end-session-29561780 | 14 | 2026-03-19 07:46 | Seen this week | KubeJobFailed (14) | None | |
| admission-end-session-29556020 | 13 | 2026-03-16 10:52 | Seen this week | KubeJobFailed (13) | None | |
| admission-end-session-29557460 | 13 | 2026-03-16 10:52 | Seen this week | KubeJobFailed (13) | None | |
| uni-api-svc-4000 | 12 | 2026-03-20 21:47 | Recent (72h) | TraefikServiceHighErrorRate (12) | Observability storage | errors.group-already-exists |
| admission-end-session-29554580 | 10 | 2026-03-15 23:07 | Seen this week | KubeJobFailed (10) | None | |
| admission-end-session-29558900 | 9 | 2026-03-16 10:52 | Seen this week | KubeJobFailed (9) | None | |
| docgen2-api | 6 | 2026-03-19 13:11 | Seen this week | KubeHpaMaxedOut (6) | None | |
| accommodations-api-svc-4100 | 5 | 2026-03-18 13:24 | Seen this week | TraefikServiceHighErrorRate (5) | General investigation | Ionut Ciolan | de ce apar stack trace-urile alea? nu par folositoare |
| ai-api-svc-3900 | 4 | 2026-03-20 16:35 | Recent (72h) | TraefikServiceHighLatency (4) | None | |
| grafana | 4 | 2026-03-19 12:27 | Seen this week | NodeSystemSaturation (3)NodeCPUHighUsage (1) | None | |
| admission-end-session-29560340 | 3 | 2026-03-16 10:52 | Seen this week | KubeJobFailed (3) | None | |
| subscriptions-api | 3 | 2026-03-20 11:11 | Recent (72h) | KubeHpaMaxedOut (3) | None | |
| web-80 | 1 | 2026-03-17 04:13 | Seen this week | TraefikServiceHighLatency (1) | None |
AWS Email Alarm Families
| AWS Alarm | Emails | ALARM | OK | State Flips | First Seen | Last Seen | Latest State | Status |
|---|---|---|---|---|---|---|---|---|
| adservio-rds-mysql-master-memory-low | 63 | 32 | 31 | 62 | 2026-03-15 12:38 | 2026-03-18 07:09 | ALARM | Still alarming |
| adservio-rds-mysql-master-write-latency-high | 2 | 1 | 1 | 1 | 2026-03-16 23:08 | 2026-03-16 23:13 | OK | Latest OK |
“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.
Global Discussion-Derived Signal
| Thread Date | Alert | Severity | Services | Signal | Key Notes |
|---|---|---|---|---|---|
| 2026-03-18 12:48 | TraefikServiceHighErrorRate | critical | accommodations-api-svc-4100 | General investigation | Ionut Ciolan | de ce apar stack trace-urile alea? nu par folositoare |
| 2026-03-17 10:45 | TraefikServiceHighErrorRate | critical | uni-api-svc-4000 | Observability storage | errors.group-already-exists |