Production Reliability Dashboard

Generated 2026-04-01 12:18 from Slack #_alerts_prod and AWS SNS alert emails for 2026-03-21 07:00 to 2026-03-28 07:00.

How to use Pick an alert family, service/resource, or AWS alarm. Inside an alert family, click a target to scope trends, notes, discussion signal, and latest matching alerts. Open Global Evidence Explorer only when you need report-wide pivots.
Slack messages66alert and ops posts in channel history
Slack discussions3threads with human follow-up we could mine for signal
AWS alert emails50Inbox/Trash/Spam matches from SNS sender
Latest observed event2026-03-28 05:12most recent alert timestamp seen in either source
Executive Summary

What Needs Attention

Bottom line: application-level critical paths are present and catalog database alarms still look active.

Signal Over Noise

  • TraefikServiceHighErrorRate is the highest-severity application issue in this window, touching web-80 (1), uni-api-svc-4000 (1); the freshest signal was on web-80 at 2026-03-27 08:23.
  • AWS pressure is concentrated in 4 still-alarming catalog-related alarm(s), led by adservio-rds-mysql-catalog2-memory-low with 43 email events.

Recommended Actions

  • Treat TraefikServiceHighErrorRate as the primary application investigation. Reproduce the failing path on web-80, uni-api-svc-4000, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten.
  • Run a focused database-capacity investigation on the catalog instances now. Persistent memory-low and swap-high alarms are usually a system-pressure problem, not something to leave as background noise.

Service Hotspots

Alert Families

AWS Alarms

Global Evidence Explorer

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Slack Alerts by Day

AWS Alert Emails by Day

Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
TraefikServiceHighErrorRateCritical22026-03-27 08:23Seen this week1uni-api-svc-4000 (1)web-80 (1)General investigation
Andrei Alexandru pare ca e o buba cu foreign keys pe tuiasi | Service method DisciplineServiceImpl.patchDisciplineByIdAndAcademicPlan failed after 12ms: could not execute statement [(conn=4451547) Cann… | ok, urmaresc si rezolv, am mai gasit ieri.o astfel de problema, o urmaresc si pe asta
KubeJobFailedWarning322026-03-28 05:12Seen this week1accommodations-sync-users-29572740 (21)download-album-29576545 (6)accommodations-sync-users-29571300 (5)General investigation
production.ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 200704 bytes) {"exception":"[object] (Symfony\\Compon… | Ionut Ciolan a dat out of memory. E setat cumva in php.ini max memory la 128mb? Ai cum sa o cresti sa nu mai crape? | mai bine ar fi sa faci un pic refactor la cod. Pare ca inc…
TraefikServiceHighLatencyWarning162026-03-27 17:21Seen this week1ai-api-svc-3900 (11)web-80 (5)subscriptions-api-svc-3400 (4)rooms-api-svc-3700 (1)General investigation
era un query la catalog care mergea greu
NodeSystemSaturationWarning62026-03-27 11:29Seen this week0grafana (6)None
KubeHpaMaxedOutWarning52026-03-27 13:05Seen this week0docgen2-api (5)None
NodeCPUHighUsageWarning32026-03-27 11:29Seen this week0grafana (3)None
KubeNodeEvictionWarning22026-03-26 08:38Seen this week0grafana (2)None

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceHighest SeverityCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
web-80Critical62026-03-27 13:19Seen this weekTraefikServiceHighErrorRate (1)TraefikServiceHighLatency (5)General investigation
era un query la catalog care mergea greu
uni-api-svc-4000Critical12026-03-25 10:58No recent signalTraefikServiceHighErrorRate (1)General investigation
Andrei Alexandru pare ca e o buba cu foreign keys pe tuiasi | Service method DisciplineServiceImpl.patchDisciplineByIdAndAcademicPlan failed after 12ms: could not execute statement [(conn=4451547) Cann… | ok, urmaresc si rezolv, am mai gasit ieri.o astfel de problema, o urmaresc si pe asta
accommodations-sync-users-29572740Warning212026-03-28 04:14Seen this weekKubeJobFailed (21)None
ai-api-svc-3900Warning112026-03-27 17:21Seen this weekTraefikServiceHighLatency (11)None
grafanaWarning112026-03-27 11:29Seen this weekNodeSystemSaturation (6)NodeCPUHighUsage (3)KubeNodeEviction (2)None
download-album-29576545Warning62026-03-28 05:12Seen this weekKubeJobFailed (6)None
docgen2-apiWarning52026-03-27 13:05Seen this weekKubeHpaMaxedOut (5)None
subscriptions-api-svc-3400Warning42026-03-27 08:35Seen this weekTraefikServiceHighLatency (4)None
rooms-api-svc-3700Warning12026-03-27 08:35Seen this weekTraefikServiceHighLatency (1)None
accommodations-sync-users-29571300Warning52026-03-24 10:51No recent signalKubeJobFailed (5)General investigation
production.ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 200704 bytes) {"exception":"[object] (Symfony\\Compon… | Ionut Ciolan a dat out of memory. E setat cumva in php.ini max memory la 128mb? Ai cum sa o cresti sa nu mai crape? | mai bine ar fi sa faci un pic refactor la cod. Pare ca inc…

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus
adservio-rds-mysql-catalog2-memory-low432221422026-03-24 02:382026-03-25 10:11ALARMStill alarming
adservio-rds-mysql-catalog-swap-high21112026-03-27 08:262026-03-27 12:14ALARMStill alarming
adservio-rds-mysql-catalog-memory-low21112026-03-27 08:232026-03-27 08:41ALARMStill alarming
adservio-rds-mysql-catalog2-swap-high11002026-03-27 13:432026-03-27 13:43ALARMStill alarming
adservio-rds-mysql-catalog-disk-queue-high21112026-03-27 08:312026-03-27 08:41OKLatest OK

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes
2026-03-26 11:19TraefikServiceHighLatencyWarningweb-80General investigation
era un query la catalog care mergea greu
2026-03-25 10:58TraefikServiceHighErrorRateCriticaluni-api-svc-4000General investigation
Andrei Alexandru pare ca e o buba cu foreign keys pe tuiasi | Service method DisciplineServiceImpl.patchDisciplineByIdAndAcademicPlan failed after 12ms: could not execute statement [(conn=4451547) Cann… | ok, urmaresc si rezolv, am mai gasit ieri.o astfel de problema, o urmaresc si pe asta
2026-03-24 10:51KubeJobFailedWarningaccommodations-sync-users-29571300General investigation
production.ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 200704 bytes) {"exception":"[object] (Symfony\\Compon… | Ionut Ciolan a dat out of memory. E setat cumva in php.ini max memory la 128mb? Ai cum sa o cresti sa nu mai crape? | mai bine ar fi sa faci un pic refactor la cod. Pare ca inc…