Production Reliability Dashboard

Generated 2026-03-22 15:58 from Slack #_alerts_prod and AWS SNS alert emails for 2026-03-14 07:00 to 2026-03-21 07:00.

How to use Pick an alert family, service/resource, or AWS alarm. Inside an alert family, click a target to scope trends, notes, discussion signal, and latest matching alerts. Open Global Evidence Explorer only when you need report-wide pivots.
Slack messages64alert and ops posts in channel history
Slack discussions2threads with human follow-up we could mine for signal
AWS alert emails65Inbox/Trash/Spam matches from SNS sender
Latest observed event2026-03-20 21:47most recent alert timestamp seen in either source

Top Slack Alert Families

Top Impacted Services / Resources

AWS Email Alarm Families

Investigation

Investigation

Choose one item above to start a scoped investigation, then narrow to a target when an alert family exposes multiple workloads.
Global Evidence Explorer

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Slack Alerts by Day

AWS Alert Emails by Day

Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
KubeJobFailedwarning272026-03-19 07:46Seen this week0admission-end-session-29561780 (14)admission-end-session-29556020 (13)admission-end-session-29557460 (13)admission-end-session-29554580 (10)admission-end-session-29558900 (9)None
TraefikServiceHighErrorRatecritical172026-03-20 21:47Recent (72h)2uni-api-svc-4000 (12)accommodations-api-svc-4100 (5)Observability storageGeneral investigation
Ionut Ciolan | de ce apar stack trace-urile alea? nu par folositoare
KubeHpaMaxedOutwarning92026-03-20 11:11Recent (72h)0docgen2-api (6)subscriptions-api (3)None
TraefikServiceHighLatencywarning52026-03-20 16:35Recent (72h)0ai-api-svc-3900 (4)web-80 (1)None
NodeSystemSaturationwarning32026-03-19 12:27Seen this week0grafana (3)None
NodeCPUHighUsagewarning12026-03-18 14:53Seen this week0grafana (1)None

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
admission-end-session-29561780142026-03-19 07:46Seen this weekKubeJobFailed (14)None
admission-end-session-29556020132026-03-16 10:52Seen this weekKubeJobFailed (13)None
admission-end-session-29557460132026-03-16 10:52Seen this weekKubeJobFailed (13)None
uni-api-svc-4000122026-03-20 21:47Recent (72h)TraefikServiceHighErrorRate (12)Observability storage
errors.group-already-exists
admission-end-session-29554580102026-03-15 23:07Seen this weekKubeJobFailed (10)None
admission-end-session-2955890092026-03-16 10:52Seen this weekKubeJobFailed (9)None
docgen2-api62026-03-19 13:11Seen this weekKubeHpaMaxedOut (6)None
accommodations-api-svc-410052026-03-18 13:24Seen this weekTraefikServiceHighErrorRate (5)General investigation
Ionut Ciolan | de ce apar stack trace-urile alea? nu par folositoare
ai-api-svc-390042026-03-20 16:35Recent (72h)TraefikServiceHighLatency (4)None
grafana42026-03-19 12:27Seen this weekNodeSystemSaturation (3)NodeCPUHighUsage (1)None
admission-end-session-2956034032026-03-16 10:52Seen this weekKubeJobFailed (3)None
subscriptions-api32026-03-20 11:11Recent (72h)KubeHpaMaxedOut (3)None
web-8012026-03-17 04:13Seen this weekTraefikServiceHighLatency (1)None

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus
adservio-rds-mysql-master-memory-low633231622026-03-15 12:382026-03-18 07:09ALARMStill alarming
adservio-rds-mysql-master-write-latency-high21112026-03-16 23:082026-03-16 23:13OKLatest OK

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes
2026-03-18 12:48TraefikServiceHighErrorRatecriticalaccommodations-api-svc-4100General investigation
Ionut Ciolan | de ce apar stack trace-urile alea? nu par folositoare
2026-03-17 10:45TraefikServiceHighErrorRatecriticaluni-api-svc-4000Observability storage
errors.group-already-exists