Production Reliability Dashboard
Generated 2026-03-29 13:07 from Slack #_alerts_prod and AWS SNS alert emails for 2026-03-21 07:00 to 2026-03-28 07:00.
What Needs Attention
Bottom line: application-level critical paths are present and catalog database alarms still look active.
Signal Over Noise
- TraefikServiceHighErrorRate is the highest-severity application issue in this window, touching web-80 (1), uni-api-svc-4000 (1); the freshest signal was on web-80 at 2026-03-27 08:23.
- grafana is the main warning-level hotspot with 37 alert mentions, driven by KubeJobFailed (26), NodeSystemSaturation (6), NodeCPUHighUsage (3). Discussion hints: Thread summary · Raul Popovici: production.ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 200704 bytes) {"exception":"[object] (Symfony\\Compon… | Ionu….
- AWS pressure is concentrated in 4 still-alarming catalog-related alarm(s), led by adservio-rds-mysql-catalog2-memory-low with 43 email events.
Recommended Actions
- Treat TraefikServiceHighErrorRate as the primary application investigation. Reproduce the failing path on web-80, uni-api-svc-4000, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten.
- Reduce the operational drag on grafana by separating repeated symptom alerts from the underlying workload failure. Either eliminate the recurrent fault or retune the alert once the failure mode is understood. The discussion suggests memory exhaustion, so runtime limits and workload shape should be verified before treating this as generic noise.
- Run a focused database-capacity investigation on the catalog instances now. Persistent memory-low and swap-high alarms are usually a system-pressure problem, not something to leave as background noise.
Service Hotspots
Alert Families
AWS Alarms
Investigation
Global Evidence Explorer
Global Evidence Explorer
Report-wide charts and tables stay here, separate from the active investigation scope.
Global Evidence Explorer
Slack Alerts by Day
AWS Alert Emails by Day
Slack Alert Families
| Alert | Severity | Count | Last Seen | Status | Threads | Top Impacted Services | Discussion Signal | Latest Thread Note |
|---|---|---|---|---|---|---|---|---|
| TraefikServiceHighErrorRate | Critical | 2 | 2026-03-27 08:23 | Recent (72h) | 1 | uni-api-svc-4000 (1)web-80 (1) | General investigation | Andrei Alexandru pare ca e o buba cu foreign keys pe tuiasi | Service method DisciplineServiceImpl.patchDisciplineByIdAndAcademicPlan failed after 12ms: could not execute statement [(conn=4451547) Cann… | ok, urmaresc si rezolv, am mai gasit ieri.o astfel de problema, o urmaresc si pe asta |
| KubeJobFailed | Warning | 32 | 2026-03-28 05:12 | Recent (72h) | 1 | grafana (26)download-album-29576545 (6) | General investigation | production.ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 200704 bytes) {"exception":"[object] (Symfony\\Compon… | Ionut Ciolan a dat out of memory. E setat cumva in php.ini max memory la 128mb? Ai cum sa o cresti sa nu mai crape? | mai bine ar fi sa faci un pic refactor la cod. Pare ca inc… |
| TraefikServiceHighLatency | Warning | 16 | 2026-03-27 17:21 | Recent (72h) | 1 | ai-api-svc-3900 (11)web-80 (5)subscriptions-api-svc-3400 (4)rooms-api-svc-3700 (1) | General investigation | era un query la catalog care mergea greu |
| NodeSystemSaturation | Warning | 6 | 2026-03-27 11:29 | Recent (72h) | 0 | grafana (6) | None | |
| KubeHpaMaxedOut | Warning | 5 | 2026-03-27 13:05 | Recent (72h) | 0 | docgen2-api (5) | None | |
| NodeCPUHighUsage | Warning | 3 | 2026-03-27 11:29 | Recent (72h) | 0 | grafana (3) | None | |
| KubeNodeEviction | Warning | 2 | 2026-03-26 08:38 | Seen this week | 0 | grafana (2) | None |
Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.
Slack Impacted Service / Resource View
This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.
| Impacted Service / Resource | Highest Severity | Count | Last Seen | Status | Top Alert Types | Discussion Signal | Latest Thread Note |
|---|---|---|---|---|---|---|---|
| web-80 | Critical | 6 | 2026-03-27 13:19 | Recent (72h) | TraefikServiceHighErrorRate (1)TraefikServiceHighLatency (5) | General investigation | era un query la catalog care mergea greu |
| uni-api-svc-4000 | Critical | 1 | 2026-03-25 10:58 | Seen this week | TraefikServiceHighErrorRate (1) | General investigation | Andrei Alexandru pare ca e o buba cu foreign keys pe tuiasi | Service method DisciplineServiceImpl.patchDisciplineByIdAndAcademicPlan failed after 12ms: could not execute statement [(conn=4451547) Cann… | ok, urmaresc si rezolv, am mai gasit ieri.o astfel de problema, o urmaresc si pe asta |
| grafana | Warning | 37 | 2026-03-28 04:14 | Recent (72h) | KubeJobFailed (26)NodeSystemSaturation (6)NodeCPUHighUsage (3)KubeNodeEviction (2) | General investigation | production.ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 200704 bytes) {"exception":"[object] (Symfony\\Compon… | Ionut Ciolan a dat out of memory. E setat cumva in php.ini max memory la 128mb? Ai cum sa o cresti sa nu mai crape? | mai bine ar fi sa faci un pic refactor la cod. Pare ca inc… |
| ai-api-svc-3900 | Warning | 11 | 2026-03-27 17:21 | Recent (72h) | TraefikServiceHighLatency (11) | None | |
| download-album-29576545 | Warning | 6 | 2026-03-28 05:12 | Recent (72h) | KubeJobFailed (6) | None | |
| docgen2-api | Warning | 5 | 2026-03-27 13:05 | Recent (72h) | KubeHpaMaxedOut (5) | None | |
| subscriptions-api-svc-3400 | Warning | 4 | 2026-03-27 08:35 | Recent (72h) | TraefikServiceHighLatency (4) | None | |
| rooms-api-svc-3700 | Warning | 1 | 2026-03-27 08:35 | Recent (72h) | TraefikServiceHighLatency (1) | None |
AWS Email Alarm Families
| AWS Alarm | Emails | ALARM | OK | State Flips | First Seen | Last Seen | Latest State | Status |
|---|---|---|---|---|---|---|---|---|
| adservio-rds-mysql-catalog2-memory-low | 43 | 22 | 21 | 42 | 2026-03-24 02:38 | 2026-03-25 10:11 | ALARM | Still alarming |
| adservio-rds-mysql-catalog-swap-high | 2 | 1 | 1 | 1 | 2026-03-27 08:26 | 2026-03-27 12:14 | ALARM | Still alarming |
| adservio-rds-mysql-catalog-memory-low | 2 | 1 | 1 | 1 | 2026-03-27 08:23 | 2026-03-27 08:41 | ALARM | Still alarming |
| adservio-rds-mysql-catalog2-swap-high | 1 | 1 | 0 | 0 | 2026-03-27 13:43 | 2026-03-27 13:43 | ALARM | Still alarming |
| adservio-rds-mysql-catalog-disk-queue-high | 2 | 1 | 1 | 1 | 2026-03-27 08:31 | 2026-03-27 08:41 | OK | Latest OK |
“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.
Global Discussion-Derived Signal
| Thread Date | Alert | Severity | Services | Signal | Key Notes |
|---|---|---|---|---|---|
| 2026-03-26 11:19 | TraefikServiceHighLatency | Warning | web-80 | General investigation | era un query la catalog care mergea greu |
| 2026-03-25 10:58 | TraefikServiceHighErrorRate | Critical | uni-api-svc-4000 | General investigation | Andrei Alexandru pare ca e o buba cu foreign keys pe tuiasi | Service method DisciplineServiceImpl.patchDisciplineByIdAndAcademicPlan failed after 12ms: could not execute statement [(conn=4451547) Cann… | ok, urmaresc si rezolv, am mai gasit ieri.o astfel de problema, o urmaresc si pe asta |
| 2026-03-24 10:51 | KubeJobFailed | Warning | grafana | General investigation | production.ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 200704 bytes) {"exception":"[object] (Symfony\\Compon… | Ionut Ciolan a dat out of memory. E setat cumva in php.ini max memory la 128mb? Ai cum sa o cresti sa nu mai crape? | mai bine ar fi sa faci un pic refactor la cod. Pare ca inc… |