Production Reliability Dashboard
Generated 2026-03-17 16:22 from Slack #_alerts_prod and AWS SNS alert emails for the trailing 30 days ending March 15, 2026.
Start with the ranked charts to spot concentration, click one item to inspect the drill-down, then use the daily trends and lower tables as evidence. The board is designed to move from overview to investigation to raw supporting detail.
Top Slack Alert Families
Top Impacted Services / Resources
AWS Email Alarm Families
Drill-Down
Slack Alerts by Day
AWS Alert Emails by Day
Slack Alert Families
| Alert | Severity | Count | Last Seen | Status | Threads | Top Impacted Services | Discussion Signal | Latest Thread Note |
|---|---|---|---|---|---|---|---|---|
| KubeJobFailed | warning | 140 | 2026-03-15 14:57 | Recent (72h) | 3 | admission-end-session-29537300 (27)admission-end-session-29535860 (27)admission-end-session-29538740 (23)admission-end-session-29554580 (22)admission-end-session-29540180 (16) | DB / maintenanceGeneral investigationObservability storage | Call to a member function format() on bool in FormItem.php | Marian e ceva busit pe recurenta | Marian will investigate |
| TraefikServiceHighLatency | warning | 140 | 2026-03-13 16:39 | Seen this week | 0 | ws2-api-svc-4300 (32)uni-api-svc-4000 (29)accommodations-api-svc-4100 (27)web-80 (15)ai-api-svc-3900 (14) | None | |
| KubePodCrashLooping | warning | 58 | 2026-03-13 17:01 | Seen this week | 0 | grafana (52)notifications-event-manager (3)unclassified (1)tempo (1)service-av (1) | None | |
| KubeDeploymentReplicasMismatch | warning | 57 | 2026-03-12 17:53 | Seen this week | 1 | grafana (54)notifications-event-manager (2)unclassified (1) | Batch code bug | The arguments array must contain 2 items, 1 given in Notificari.php |
| TraefikServiceHighErrorRate | critical | 57 | 2026-03-13 17:02 | Likely noise / resolved | 6 | uni-api-svc-4000 (26)ai-api-svc-3900 (12)accommodations-api-svc-4100 (7)library-api-svc-3200 (4)docgen2-api-svc-3600 (4) | Release / migration issueApp bug / schema mismatchAlert tuning / noise | todo silence, nu ar trebui sa fie critical pentru grafana |
| KubeHpaMaxedOut | warning | 48 | 2026-03-13 13:08 | Seen this week | 1 | subscriptions-api (27)grafana (10)docgen2-api (6)notifications-event-manager (5)unclassified (1) | Scaling config | Răzvan Ionică ai cum sa te uiti tu peste asta? E acelasi config la keda pe azure nush de ce apare alerta | ai gasit cauza? | daca e asa da. asta cred ca e de la criza financiara de la craciun in care am scazut la minim tot. |
| CPUThrottlingHigh | warning | 28 | 2026-03-12 09:00 | Likely noise / resolved | 2 | service-av (21)subscriptions-api (3)grafana (3)docgen2-api (1) | Alert tuning / noise | am scos limitele de la toate cpu, nu ar trebui sa apara - am rezolvat la mana ma uit imediat |
| 5xx Response Rate Alert | warning | 19 | 2026-02-20 20:46 | No recent signal | 9 | app-gateway-ingress-production-we (19) | DB / maintenanceAttack / traffic anomalyDependency failureGeneral investigation | atac |
| KubeNodeEviction | warning | 12 | 2026-03-10 12:27 | No recent signal | 0 | grafana (11)unclassified (1) | None | |
| KubeCPUOvercommit | warning | 11 | 2026-02-25 08:55 | No recent signal | 0 | grafana (11) | None | |
| Watchdog | warning | 11 | 2026-02-24 09:44 | Likely noise / resolved | 1 | unclassified (9)grafana (2) | Alert tuning / noise | i-am dat silence, e o alerta de debug |
| KubePersistentVolumeFillingUp | warning | 9 | 2026-03-13 14:08 | Seen this week | 1 | grafana (9)loki (9) | Release / migration issue | dau deploy acum sa reduc logurile | oricum e facut sa stearga logurile vechi daca se apropie de 100gb | |
| Increased Latency Alert | warning | 7 | 2026-02-19 03:18 | No recent signal | 4 | app-gateway-ingress-production-we (7) | DB / maintenanceAttack / traffic anomaly | atac |
| NodeHighNumberConntrackEntriesUsed | warning | 7 | 2026-03-13 09:11 | Seen this week | 0 | grafana (7) | None | |
| TargetDown | warning | 7 | 2026-03-13 16:56 | Likely noise / resolved | 2 | unclassified (4)grafana (2)etcd (1)loki (1) | Alert tuning / noiseGeneral investigation | e un config gresit pus, se repara acum |
| Anomaly Detected: Unusual Request Ratio | warning | 3 | 2026-02-20 20:49 | No recent signal | 0 | app-gateway-ingress-production-we (3) | None | |
| KubeClientErrors | warning | 3 | 2026-02-26 03:27 | No recent signal | 0 | unclassified (3) | None | |
| KubeJobNotCompleted | warning | 2 | 2026-03-07 08:05 | No recent signal | 0 | grafana (2) | None | |
| KubeletTooManyPods | warning | 2 | 2026-02-15 18:28 | No recent signal | 0 | unclassified (2) | None | |
| KubeAggregatedAPIDown | warning | 1 | 2026-02-15 16:24 | No recent signal | 0 | grafana (1) | None | |
| KubeContainerWaiting | warning | 1 | 2026-02-17 18:25 | No recent signal | 0 | docgen2-api (1)subscriptions-api (1) | None | |
| KubeDeploymentRolloutStuck | warning | 1 | 2026-02-17 18:25 | No recent signal | 0 | subscriptions-api (1)uni-api (1) | None | |
| KubePodNotReady | warning | 1 | 2026-02-17 18:25 | No recent signal | 0 | docgen2-api (1)notifications-event-manager (1)uni-api (1) | None | |
| KubeStatefulSetReplicasMismatch | warning | 1 | 2026-03-13 17:00 | Seen this week | 0 | grafana (1)loki (1) | None | |
| NodeCPUHighUsage | warning | 1 | 2026-03-12 18:18 | Seen this week | 0 | grafana (1) | None | |
| NodeMemoryHighUtilization | warning | 1 | 2026-03-10 12:24 | No recent signal | 1 | grafana (1) | Observability storage | e de la loki, a consumat prea multa memorie |
| NodeSystemSaturation | warning | 1 | 2026-03-12 18:17 | Seen this week | 0 | grafana (1) | None | |
| etcdInsufficientMembers | critical | 1 | 2026-02-25 10:12 | Likely noise / resolved | 1 | etcd (1) | Alert tuning / noise | e ok, alerta asta nu ar trebui sa fie pusa pe tuiasi |
| etcdMembersDown | warning | 1 | 2026-02-25 10:14 | No recent signal | 0 | etcd (1) | None |
Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.
Slack Impacted Service / Resource View
This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.
| Impacted Service / Resource | Count | Last Seen | Status | Top Alert Types | Discussion Signal | Latest Thread Note |
|---|---|---|---|---|---|---|
| grafana | 173 | 2026-03-13 17:01 | Seen this week | KubeDeploymentReplicasMismatch (54)KubePodCrashLooping (52)KubeCPUOvercommit (11)KubeNodeEviction (11)KubeHpaMaxedOut (10) | Observability storageRelease / migration issueGeneral investigation | e un config gresit pus, se repara acum |
| uni-api-svc-4000 | 55 | 2026-03-11 10:04 | Seen this week | TraefikServiceHighLatency (29)TraefikServiceHighErrorRate (26) | App bug / schema mismatch | `d.nrCrt` — missing from `disciplina` | `d1_0.an — missing from disciplina` | | Andrei Alexandru pare ca tot lipsesc niste coloane |
| accommodations-api-svc-4100 | 34 | 2026-03-11 16:31 | Seen this week | TraefikServiceHighLatency (27)TraefikServiceHighErrorRate (7) | None | |
| subscriptions-api | 32 | 2026-03-12 18:24 | Likely noise / resolved | KubeHpaMaxedOut (27)CPUThrottlingHigh (3)KubeDeploymentRolloutStuck (1)KubeContainerWaiting (1) | Scaling configAlert tuning / noise | am scos limitele de la toate cpu, nu ar trebui sa apara - am rezolvat la mana ma uit imediat |
| ws2-api-svc-4300 | 32 | 2026-02-24 06:49 | No recent signal | TraefikServiceHighLatency (32) | None | |
| app-gateway-ingress-production-we | 29 | 2026-02-20 20:49 | No recent signal | 5xx Response Rate Alert (19)Increased Latency Alert (7)Anomaly Detected: Unusual Request Ratio (3) | DB / maintenanceAttack / traffic anomalyDependency failureGeneral investigation | atac |
| admission-end-session-29535860 | 27 | 2026-03-02 23:07 | No recent signal | KubeJobFailed (27) | None | |
| admission-end-session-29537300 | 27 | 2026-03-03 23:07 | No recent signal | KubeJobFailed (27) | None | |
| ai-api-svc-3900 | 26 | 2026-03-13 16:39 | Seen this week | TraefikServiceHighLatency (14)TraefikServiceHighErrorRate (12) | Release / migration issue | nu erau rulate migrarile pe ai |
| admission-end-session-29538740 | 23 | 2026-03-04 06:47 | No recent signal | KubeJobFailed (23) | None | |
| admission-end-session-29554580 | 22 | 2026-03-15 14:57 | Recent (72h) | KubeJobFailed (22) | Observability storage | Call to a member function format() on bool in FormItem.php | Marian e ceva busit pe recurenta | Marian will investigate |
| service-av | 22 | 2026-02-20 08:26 | Likely noise / resolved | CPUThrottlingHigh (21)KubePodCrashLooping (1) | Alert tuning / noise | am dat increase la limita pe av, nu ar trebui sa mai apara |
| unclassified | 22 | 2026-02-26 03:27 | Likely noise / resolved | Watchdog (9)TargetDown (4)KubeClientErrors (3)KubeletTooManyPods (2)KubeDeploymentReplicasMismatch (1) | Alert tuning / noise | Astea is pe envul nou |
| web-80 | 18 | 2026-03-12 18:12 | Seen this week | TraefikServiceHighLatency (15)TraefikServiceHighErrorRate (3) | Release / migration issue | am facut eu ceva gresit la deploy, merge bine |
| admission-end-session-29540180 | 16 | 2026-03-04 06:47 | No recent signal | KubeJobFailed (16) | None | |
| admission-end-session-29547380 | 16 | 2026-03-09 06:46 | No recent signal | KubeJobFailed (16) | None | |
| admission-end-session-29556020 | 16 | 2026-03-15 14:57 | Recent (72h) | KubeJobFailed (16) | None | |
| publish-results-29531525 | 15 | 2026-02-25 14:45 | No recent signal | KubeJobFailed (15) | General investigation | eroare de la alea vechi, le-a reparat Marian |
| attendance-register-missed-attendance-29531530 | 13 | 2026-02-25 14:45 | No recent signal | KubeJobFailed (13) | General investigation | eroare de la alea vechi, le-a reparat Marian |
| admission-end-session-29531540 | 12 | 2026-02-25 14:45 | No recent signal | KubeJobFailed (12) | General investigation | eroare de la alea vechi, le-a reparat Marian |
| loki | 12 | 2026-03-13 17:01 | Seen this week | KubePersistentVolumeFillingUp (9)KubePodCrashLooping (1)KubeStatefulSetReplicasMismatch (1)TargetDown (1) | Release / migration issueGeneral investigation | e un config gresit pus, se repara acum |
| admission-end-session-29551700 | 11 | 2026-03-11 10:52 | Seen this week | KubeJobFailed (11) | None | |
| notifications-event-manager | 11 | 2026-03-12 18:19 | Seen this week | KubeHpaMaxedOut (5)KubePodCrashLooping (3)KubeDeploymentReplicasMismatch (2)KubePodNotReady (1) | Batch code bug | The arguments array must contain 2 items, 1 given in Notificari.php |
| admission-end-session-29557460 | 10 | 2026-03-15 14:57 | Recent (72h) | KubeJobFailed (10) | None | |
| attendance-register-missed-attendance-29518570 | 10 | 2026-02-16 06:32 | No recent signal | KubeJobFailed (10) | DB / maintenance | joburi picate de la db maintenance |
| core-grafana-80 | 10 | 2026-03-13 17:02 | Likely noise / resolved | TraefikServiceHighLatency (9)TraefikServiceHighErrorRate (1) | Alert tuning / noise | todo silence, nu ar trebui sa fie critical pentru grafana |
| docgen2-api-svc-3600 | 10 | 2026-03-10 15:03 | No recent signal | TraefikServiceHighLatency (6)TraefikServiceHighErrorRate (4) | Release / migration issue | | "uri": "/docgen2/uni/disciplines/download?disciplineId=18461&academicPlanId=16905&cohortId=0&lang=ro", | "status": 500, | Aici e fix-ul facut de Stefan, Cred ca inca nu avem label-uri, dar poate nu e necesar pe acest fix :slightly_smiling_face: |
| admission-end-session-29541620 | 9 | 2026-03-04 06:47 | No recent signal | KubeJobFailed (9) | None | |
| admission-end-session-29544500 | 9 | 2026-03-06 06:46 | No recent signal | KubeJobFailed (9) | None | |
| admission-end-session-29548820 | 9 | 2026-03-09 06:46 | No recent signal | KubeJobFailed (9) | None | |
| docgen2-api | 9 | 2026-03-13 13:08 | Seen this week | KubeHpaMaxedOut (6)KubePodNotReady (1)KubeContainerWaiting (1)CPUThrottlingHigh (1) | None | |
| download-album-29518575 | 8 | 2026-02-16 06:32 | No recent signal | KubeJobFailed (8) | DB / maintenance | joburi picate de la db maintenance |
| download-album-29524325 | 8 | 2026-02-20 07:03 | No recent signal | KubeJobFailed (8) | None | |
| publish-results-29524325 | 8 | 2026-02-20 07:03 | No recent signal | KubeJobFailed (8) | None | |
| billing-api-svc-3100 | 7 | 2026-03-12 18:12 | Seen this week | TraefikServiceHighLatency (7) | None | |
| download-album-29532525 | 6 | 2026-02-25 14:45 | No recent signal | KubeJobFailed (6) | General investigation | eroare de la alea vechi, le-a reparat Marian |
| download-album-29552865 | 6 | 2026-03-11 10:52 | Seen this week | KubeJobFailed (6) | None | |
| library-api-svc-3200 | 6 | 2026-03-05 08:10 | No recent signal | TraefikServiceHighErrorRate (4)TraefikServiceHighLatency (2) | None | |
| admission-end-session-29534420 | 5 | 2026-02-26 19:01 | No recent signal | KubeJobFailed (5) | None | |
| admission-end-session-29532980 | 4 | 2026-02-25 14:45 | No recent signal | KubeJobFailed (4) | General investigation | eroare de la alea vechi, le-a reparat Marian |
| admission-end-session-29558900 | 4 | 2026-03-15 14:57 | Recent (72h) | KubeJobFailed (4) | None | |
| websocket-3800 | 4 | 2026-02-22 22:17 | No recent signal | TraefikServiceHighLatency (4) | None | |
| admission-end-session-29553140 | 3 | 2026-03-11 10:52 | Seen this week | KubeJobFailed (3) | None | |
| etcd | 3 | 2026-02-25 10:14 | Likely noise / resolved | TargetDown (1)etcdMembersDown (1)etcdInsufficientMembers (1) | Alert tuning / noise | e ok, alerta asta nu ar trebui sa fie pusa pe tuiasi |
| subscriptions-api-svc-3400 | 3 | 2026-03-12 18:12 | Seen this week | TraefikServiceHighLatency (3) | None | |
| admission-end-session-29543060 | 2 | 2026-03-04 06:47 | No recent signal | KubeJobFailed (2) | None | |
| admission-end-session-29545940 | 2 | 2026-03-06 06:46 | No recent signal | KubeJobFailed (2) | None | |
| admission-end-session-29550260 | 2 | 2026-03-09 06:46 | No recent signal | KubeJobFailed (2) | None | |
| publish-results-29527205 | 2 | 2026-02-22 16:12 | No recent signal | KubeJobFailed (2) | None | |
| uni-api | 2 | 2026-02-17 18:25 | No recent signal | KubeDeploymentRolloutStuck (1)KubePodNotReady (1) | None | |
| tempo | 1 | 2026-02-16 21:29 | No recent signal | KubePodCrashLooping (1) | None |
AWS Email Alarm Families
| AWS Alarm | Emails | ALARM | OK | State Flips | First Seen | Last Seen | Latest State | Status |
|---|---|---|---|---|---|---|---|---|
| adservio-rds-mysql-master-memory-low | 56 | 25 | 31 | 43 | 2026-03-04 15:19 | 2026-03-15 17:31 | OK | Flapping, latest OK |
| adservio-root-account-usage | 5 | 2 | 3 | 4 | 2026-02-25 11:53 | 2026-02-26 11:30 | OK | Latest OK |
| adservio-rds-mysql-catalog-memory-low | 1 | 1 | 0 | 0 | 2026-03-10 22:15 | 2026-03-10 22:15 | ALARM | Still alarming |
| adservio-rds-mysql-catalog-swap-high | 1 | 1 | 0 | 0 | 2026-03-11 11:38 | 2026-03-11 11:38 | ALARM | Still alarming |
| adservio-rds-postgres-billing-cpu-high | 1 | 1 | 0 | 0 | 2026-03-01 09:12 | 2026-03-01 09:12 | ALARM | Still alarming |
“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.
Discussion-Derived Signal
| Thread Date | Alert | Severity | Services | Signal | Key Notes |
|---|---|---|---|---|---|
| 2026-03-13 17:02 | TraefikServiceHighErrorRate | critical | core-grafana-80 | Alert tuning / noise | todo silence, nu ar trebui sa fie critical pentru grafana |
| 2026-03-13 16:56 | TargetDown | warning | grafana, loki | General investigation | e un config gresit pus, se repara acum |
| 2026-03-12 13:48 | KubeDeploymentReplicasMismatch | warning | notifications-event-manager | Batch code bug | The arguments array must contain 2 items, 1 given in Notificari.php |
| 2026-03-12 10:52 | KubeJobFailed | warning | admission-end-session-29554580 | Observability storage | Call to a member function format() on bool in FormItem.php | Marian e ceva busit pe recurenta | Marian will investigate |
| 2026-03-12 08:25 | CPUThrottlingHigh | warning | subscriptions-api | Alert tuning / noise | am scos limitele de la toate cpu, nu ar trebui sa apara - am rezolvat la mana ma uit imediat |
| 2026-03-11 08:01 | TraefikServiceHighErrorRate | critical | uni-api-svc-4000 | App bug / schema mismatch | `d.nrCrt` — missing from `disciplina` | `d1_0.an — missing from disciplina` | | Andrei Alexandru pare ca tot lipsesc niste coloane |
| 2026-03-10 12:24 | NodeMemoryHighUtilization | warning | grafana | Observability storage | e de la loki, a consumat prea multa memorie |
| 2026-03-10 10:49 | TraefikServiceHighErrorRate | critical | docgen2-api-svc-3600 | Release / migration issue | | "uri": "/docgen2/uni/disciplines/download?disciplineId=18461&academicPlanId=16905&cohortId=0&lang=ro", | "status": 500, | Aici e fix-ul facut de Stefan, Cred ca inca nu avem label-uri, dar poate nu e necesar pe acest fix :slightly_smiling_face: |
| 2026-03-10 09:12 | TraefikServiceHighErrorRate | critical | uni-api-svc-4000 | App bug / schema mismatch | | Got error 'missing ) at offset 911' from regexp | Si pe acest punct avem fix in main. Am facut un escape suplimentar pe caracterele care intra in acea expresie regex |
| 2026-03-06 14:16 | KubePersistentVolumeFillingUp | warning | grafana, loki | Release / migration issue | dau deploy acum sa reduc logurile | oricum e facut sa stearga logurile vechi daca se apropie de 100gb | |
| 2026-02-25 16:23 | TraefikServiceHighErrorRate | critical | ai-api-svc-3900 | Release / migration issue | nu erau rulate migrarile pe ai |
| 2026-02-25 10:55 | KubeJobFailed | warning | admission-end-session-29531540, admission-end-session-29532980, attendance-register-missed-attendance-29531530, download-album-29532525 | General investigation | eroare de la alea vechi, le-a reparat Marian |
| 2026-02-25 10:12 | etcdInsufficientMembers | critical | etcd | Alert tuning / noise | e ok, alerta asta nu ar trebui sa fie pusa pe tuiasi |
| 2026-02-23 14:28 | TraefikServiceHighErrorRate | critical | web-80 | Release / migration issue | am facut eu ceva gresit la deploy, merge bine |
| 2026-02-23 07:16 | KubeHpaMaxedOut | warning | subscriptions-api | Scaling config | Răzvan Ionică ai cum sa te uiti tu peste asta? E acelasi config la keda pe azure nush de ce apare alerta | ai gasit cauza? | daca e asa da. asta cred ca e de la criza financiara de la craciun in care am scazut la minim tot. |
| 2026-02-20 08:26 | CPUThrottlingHigh | warning | service-av | Alert tuning / noise | am dat increase la limita pe av, nu ar trebui sa mai apara |
| 2026-02-17 17:22 | TargetDown | warning | unclassified | Alert tuning / noise | Astea is pe envul nou |
| 2026-02-16 06:32 | KubeJobFailed | warning | attendance-register-missed-attendance-29518570, download-album-29518575 | DB / maintenance | joburi picate de la db maintenance |
| 2026-02-16 06:11 | Watchdog | warning | unclassified | Alert tuning / noise | i-am dat silence, e o alerta de debug |
| 2026-02-16 03:26 | 5xx Response Rate Alert | warning | app-gateway-ingress-production-we | Attack / traffic anomaly | atac |
| 2026-02-16 03:23 | Increased Latency Alert | warning | app-gateway-ingress-production-we | Attack / traffic anomaly | atac |
| 2026-02-16 03:21 | 5xx Response Rate Alert | warning | app-gateway-ingress-production-we | Attack / traffic anomaly | atac |
| 2026-02-15 18:01 | 5xx Response Rate Alert | warning | app-gateway-ingress-production-we | Dependency failure | aici a picat nodul de redis si de rabbitmq |
| 2026-02-15 03:36 | 5xx Response Rate Alert | warning | app-gateway-ingress-production-we | General investigation | aici is 13 requesturi |
| 2026-02-15 03:33 | Increased Latency Alert | warning | app-gateway-ingress-production-we | DB / maintenance | db restart |
| 2026-02-15 03:11 | 5xx Response Rate Alert | warning | app-gateway-ingress-production-we | DB / maintenance | db restart |
| 2026-02-15 03:08 | Increased Latency Alert | warning | app-gateway-ingress-production-we | DB / maintenance | db restart |
| 2026-02-15 02:31 | 5xx Response Rate Alert | warning | app-gateway-ingress-production-we | DB / maintenance | db restart |
| 2026-02-15 02:26 | 5xx Response Rate Alert | warning | app-gateway-ingress-production-we | DB / maintenance | db restart |
| 2026-02-15 02:21 | 5xx Response Rate Alert | warning | app-gateway-ingress-production-we | DB / maintenance | db restart |
| 2026-02-15 02:13 | Increased Latency Alert | warning | app-gateway-ingress-production-we | DB / maintenance | db restart |
| 2026-02-15 02:11 | 5xx Response Rate Alert | warning | app-gateway-ingress-production-we | DB / maintenance | db restart |