Production Reliability Dashboard

Generated 2026-03-17 16:22 from Slack #_alerts_prod and AWS SNS alert emails for the trailing 30 days ending March 15, 2026.

Slack messages654alert and ops posts in channel history
Slack discussions32threads with human follow-up we could mine for signal
AWS alert emails64Inbox/Trash/Spam matches from SNS sender
Latest observed event2026-03-15 17:31most recent alert timestamp seen in either source
Flow

Start with the ranked charts to spot concentration, click one item to inspect the drill-down, then use the daily trends and lower tables as evidence. The board is designed to move from overview to investigation to raw supporting detail.

Top Slack Alert Families

Top Impacted Services / Resources

AWS Email Alarm Families

Drill-Down

Click an alert family, service/resource, or AWS alarm to inspect its trend, related entities, and recent evidence.

Slack Alerts by Day

AWS Alert Emails by Day

Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
KubeJobFailedwarning1402026-03-15 14:57Recent (72h)3admission-end-session-29537300 (27)admission-end-session-29535860 (27)admission-end-session-29538740 (23)admission-end-session-29554580 (22)admission-end-session-29540180 (16)DB / maintenanceGeneral investigationObservability storage
Call to a member function format() on bool in FormItem.php | Marian e ceva busit pe recurenta | Marian will investigate
TraefikServiceHighLatencywarning1402026-03-13 16:39Seen this week0ws2-api-svc-4300 (32)uni-api-svc-4000 (29)accommodations-api-svc-4100 (27)web-80 (15)ai-api-svc-3900 (14)None
KubePodCrashLoopingwarning582026-03-13 17:01Seen this week0grafana (52)notifications-event-manager (3)unclassified (1)tempo (1)service-av (1)None
KubeDeploymentReplicasMismatchwarning572026-03-12 17:53Seen this week1grafana (54)notifications-event-manager (2)unclassified (1)Batch code bug
The arguments array must contain 2 items, 1 given in Notificari.php
TraefikServiceHighErrorRatecritical572026-03-13 17:02Likely noise / resolved6uni-api-svc-4000 (26)ai-api-svc-3900 (12)accommodations-api-svc-4100 (7)library-api-svc-3200 (4)docgen2-api-svc-3600 (4)Release / migration issueApp bug / schema mismatchAlert tuning / noise
todo silence, nu ar trebui sa fie critical pentru grafana
KubeHpaMaxedOutwarning482026-03-13 13:08Seen this week1subscriptions-api (27)grafana (10)docgen2-api (6)notifications-event-manager (5)unclassified (1)Scaling config
Răzvan Ionică ai cum sa te uiti tu peste asta? E acelasi config la keda pe azure nush de ce apare alerta | ai gasit cauza? | daca e asa da. asta cred ca e de la criza financiara de la craciun in care am scazut la minim tot.
CPUThrottlingHighwarning282026-03-12 09:00Likely noise / resolved2service-av (21)subscriptions-api (3)grafana (3)docgen2-api (1)Alert tuning / noise
am scos limitele de la toate cpu, nu ar trebui sa apara - am rezolvat la mana ma uit imediat
5xx Response Rate Alertwarning192026-02-20 20:46No recent signal9app-gateway-ingress-production-we (19)DB / maintenanceAttack / traffic anomalyDependency failureGeneral investigation
atac
KubeNodeEvictionwarning122026-03-10 12:27No recent signal0grafana (11)unclassified (1)None
KubeCPUOvercommitwarning112026-02-25 08:55No recent signal0grafana (11)None
Watchdogwarning112026-02-24 09:44Likely noise / resolved1unclassified (9)grafana (2)Alert tuning / noise
i-am dat silence, e o alerta de debug
KubePersistentVolumeFillingUpwarning92026-03-13 14:08Seen this week1grafana (9)loki (9)Release / migration issue
dau deploy acum sa reduc logurile | oricum e facut sa stearga logurile vechi daca se apropie de 100gb |
Increased Latency Alertwarning72026-02-19 03:18No recent signal4app-gateway-ingress-production-we (7)DB / maintenanceAttack / traffic anomaly
atac
NodeHighNumberConntrackEntriesUsedwarning72026-03-13 09:11Seen this week0grafana (7)None
TargetDownwarning72026-03-13 16:56Likely noise / resolved2unclassified (4)grafana (2)etcd (1)loki (1)Alert tuning / noiseGeneral investigation
e un config gresit pus, se repara acum
Anomaly Detected: Unusual Request Ratiowarning32026-02-20 20:49No recent signal0app-gateway-ingress-production-we (3)None
KubeClientErrorswarning32026-02-26 03:27No recent signal0unclassified (3)None
KubeJobNotCompletedwarning22026-03-07 08:05No recent signal0grafana (2)None
KubeletTooManyPodswarning22026-02-15 18:28No recent signal0unclassified (2)None
KubeAggregatedAPIDownwarning12026-02-15 16:24No recent signal0grafana (1)None
KubeContainerWaitingwarning12026-02-17 18:25No recent signal0docgen2-api (1)subscriptions-api (1)None
KubeDeploymentRolloutStuckwarning12026-02-17 18:25No recent signal0subscriptions-api (1)uni-api (1)None
KubePodNotReadywarning12026-02-17 18:25No recent signal0docgen2-api (1)notifications-event-manager (1)uni-api (1)None
KubeStatefulSetReplicasMismatchwarning12026-03-13 17:00Seen this week0grafana (1)loki (1)None
NodeCPUHighUsagewarning12026-03-12 18:18Seen this week0grafana (1)None
NodeMemoryHighUtilizationwarning12026-03-10 12:24No recent signal1grafana (1)Observability storage
e de la loki, a consumat prea multa memorie
NodeSystemSaturationwarning12026-03-12 18:17Seen this week0grafana (1)None
etcdInsufficientMemberscritical12026-02-25 10:12Likely noise / resolved1etcd (1)Alert tuning / noise
e ok, alerta asta nu ar trebui sa fie pusa pe tuiasi
etcdMembersDownwarning12026-02-25 10:14No recent signal0etcd (1)None

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
grafana1732026-03-13 17:01Seen this weekKubeDeploymentReplicasMismatch (54)KubePodCrashLooping (52)KubeCPUOvercommit (11)KubeNodeEviction (11)KubeHpaMaxedOut (10)Observability storageRelease / migration issueGeneral investigation
e un config gresit pus, se repara acum
uni-api-svc-4000552026-03-11 10:04Seen this weekTraefikServiceHighLatency (29)TraefikServiceHighErrorRate (26)App bug / schema mismatch
`d.nrCrt` — missing from `disciplina` | `d1_0.an — missing from disciplina` | | Andrei Alexandru pare ca tot lipsesc niste coloane
accommodations-api-svc-4100342026-03-11 16:31Seen this weekTraefikServiceHighLatency (27)TraefikServiceHighErrorRate (7)None
subscriptions-api322026-03-12 18:24Likely noise / resolvedKubeHpaMaxedOut (27)CPUThrottlingHigh (3)KubeDeploymentRolloutStuck (1)KubeContainerWaiting (1)Scaling configAlert tuning / noise
am scos limitele de la toate cpu, nu ar trebui sa apara - am rezolvat la mana ma uit imediat
ws2-api-svc-4300322026-02-24 06:49No recent signalTraefikServiceHighLatency (32)None
app-gateway-ingress-production-we292026-02-20 20:49No recent signal5xx Response Rate Alert (19)Increased Latency Alert (7)Anomaly Detected: Unusual Request Ratio (3)DB / maintenanceAttack / traffic anomalyDependency failureGeneral investigation
atac
admission-end-session-29535860272026-03-02 23:07No recent signalKubeJobFailed (27)None
admission-end-session-29537300272026-03-03 23:07No recent signalKubeJobFailed (27)None
ai-api-svc-3900262026-03-13 16:39Seen this weekTraefikServiceHighLatency (14)TraefikServiceHighErrorRate (12)Release / migration issue
nu erau rulate migrarile pe ai
admission-end-session-29538740232026-03-04 06:47No recent signalKubeJobFailed (23)None
admission-end-session-29554580222026-03-15 14:57Recent (72h)KubeJobFailed (22)Observability storage
Call to a member function format() on bool in FormItem.php | Marian e ceva busit pe recurenta | Marian will investigate
service-av222026-02-20 08:26Likely noise / resolvedCPUThrottlingHigh (21)KubePodCrashLooping (1)Alert tuning / noise
am dat increase la limita pe av, nu ar trebui sa mai apara
unclassified222026-02-26 03:27Likely noise / resolvedWatchdog (9)TargetDown (4)KubeClientErrors (3)KubeletTooManyPods (2)KubeDeploymentReplicasMismatch (1)Alert tuning / noise
Astea is pe envul nou
web-80182026-03-12 18:12Seen this weekTraefikServiceHighLatency (15)TraefikServiceHighErrorRate (3)Release / migration issue
am facut eu ceva gresit la deploy, merge bine
admission-end-session-29540180162026-03-04 06:47No recent signalKubeJobFailed (16)None
admission-end-session-29547380162026-03-09 06:46No recent signalKubeJobFailed (16)None
admission-end-session-29556020162026-03-15 14:57Recent (72h)KubeJobFailed (16)None
publish-results-29531525152026-02-25 14:45No recent signalKubeJobFailed (15)General investigation
eroare de la alea vechi, le-a reparat Marian
attendance-register-missed-attendance-29531530132026-02-25 14:45No recent signalKubeJobFailed (13)General investigation
eroare de la alea vechi, le-a reparat Marian
admission-end-session-29531540122026-02-25 14:45No recent signalKubeJobFailed (12)General investigation
eroare de la alea vechi, le-a reparat Marian
loki122026-03-13 17:01Seen this weekKubePersistentVolumeFillingUp (9)KubePodCrashLooping (1)KubeStatefulSetReplicasMismatch (1)TargetDown (1)Release / migration issueGeneral investigation
e un config gresit pus, se repara acum
admission-end-session-29551700112026-03-11 10:52Seen this weekKubeJobFailed (11)None
notifications-event-manager112026-03-12 18:19Seen this weekKubeHpaMaxedOut (5)KubePodCrashLooping (3)KubeDeploymentReplicasMismatch (2)KubePodNotReady (1)Batch code bug
The arguments array must contain 2 items, 1 given in Notificari.php
admission-end-session-29557460102026-03-15 14:57Recent (72h)KubeJobFailed (10)None
attendance-register-missed-attendance-29518570102026-02-16 06:32No recent signalKubeJobFailed (10)DB / maintenance
joburi picate de la db maintenance
core-grafana-80102026-03-13 17:02Likely noise / resolvedTraefikServiceHighLatency (9)TraefikServiceHighErrorRate (1)Alert tuning / noise
todo silence, nu ar trebui sa fie critical pentru grafana
docgen2-api-svc-3600102026-03-10 15:03No recent signalTraefikServiceHighLatency (6)TraefikServiceHighErrorRate (4)Release / migration issue
| "uri": "/docgen2/uni/disciplines/download?disciplineId=18461&academicPlanId=16905&cohortId=0&lang=ro", | "status": 500, | Aici e fix-ul facut de Stefan, Cred ca inca nu avem label-uri, dar poate nu e necesar pe acest fix :slightly_smiling_face:
admission-end-session-2954162092026-03-04 06:47No recent signalKubeJobFailed (9)None
admission-end-session-2954450092026-03-06 06:46No recent signalKubeJobFailed (9)None
admission-end-session-2954882092026-03-09 06:46No recent signalKubeJobFailed (9)None
docgen2-api92026-03-13 13:08Seen this weekKubeHpaMaxedOut (6)KubePodNotReady (1)KubeContainerWaiting (1)CPUThrottlingHigh (1)None
download-album-2951857582026-02-16 06:32No recent signalKubeJobFailed (8)DB / maintenance
joburi picate de la db maintenance
download-album-2952432582026-02-20 07:03No recent signalKubeJobFailed (8)None
publish-results-2952432582026-02-20 07:03No recent signalKubeJobFailed (8)None
billing-api-svc-310072026-03-12 18:12Seen this weekTraefikServiceHighLatency (7)None
download-album-2953252562026-02-25 14:45No recent signalKubeJobFailed (6)General investigation
eroare de la alea vechi, le-a reparat Marian
download-album-2955286562026-03-11 10:52Seen this weekKubeJobFailed (6)None
library-api-svc-320062026-03-05 08:10No recent signalTraefikServiceHighErrorRate (4)TraefikServiceHighLatency (2)None
admission-end-session-2953442052026-02-26 19:01No recent signalKubeJobFailed (5)None
admission-end-session-2953298042026-02-25 14:45No recent signalKubeJobFailed (4)General investigation
eroare de la alea vechi, le-a reparat Marian
admission-end-session-2955890042026-03-15 14:57Recent (72h)KubeJobFailed (4)None
websocket-380042026-02-22 22:17No recent signalTraefikServiceHighLatency (4)None
admission-end-session-2955314032026-03-11 10:52Seen this weekKubeJobFailed (3)None
etcd32026-02-25 10:14Likely noise / resolvedTargetDown (1)etcdMembersDown (1)etcdInsufficientMembers (1)Alert tuning / noise
e ok, alerta asta nu ar trebui sa fie pusa pe tuiasi
subscriptions-api-svc-340032026-03-12 18:12Seen this weekTraefikServiceHighLatency (3)None
admission-end-session-2954306022026-03-04 06:47No recent signalKubeJobFailed (2)None
admission-end-session-2954594022026-03-06 06:46No recent signalKubeJobFailed (2)None
admission-end-session-2955026022026-03-09 06:46No recent signalKubeJobFailed (2)None
publish-results-2952720522026-02-22 16:12No recent signalKubeJobFailed (2)None
uni-api22026-02-17 18:25No recent signalKubeDeploymentRolloutStuck (1)KubePodNotReady (1)None
tempo12026-02-16 21:29No recent signalKubePodCrashLooping (1)None

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus
adservio-rds-mysql-master-memory-low562531432026-03-04 15:192026-03-15 17:31OKFlapping, latest OK
adservio-root-account-usage52342026-02-25 11:532026-02-26 11:30OKLatest OK
adservio-rds-mysql-catalog-memory-low11002026-03-10 22:152026-03-10 22:15ALARMStill alarming
adservio-rds-mysql-catalog-swap-high11002026-03-11 11:382026-03-11 11:38ALARMStill alarming
adservio-rds-postgres-billing-cpu-high11002026-03-01 09:122026-03-01 09:12ALARMStill alarming

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes
2026-03-13 17:02TraefikServiceHighErrorRatecriticalcore-grafana-80Alert tuning / noise
todo silence, nu ar trebui sa fie critical pentru grafana
2026-03-13 16:56TargetDownwarninggrafana, lokiGeneral investigation
e un config gresit pus, se repara acum
2026-03-12 13:48KubeDeploymentReplicasMismatchwarningnotifications-event-managerBatch code bug
The arguments array must contain 2 items, 1 given in Notificari.php
2026-03-12 10:52KubeJobFailedwarningadmission-end-session-29554580Observability storage
Call to a member function format() on bool in FormItem.php | Marian e ceva busit pe recurenta | Marian will investigate
2026-03-12 08:25CPUThrottlingHighwarningsubscriptions-apiAlert tuning / noise
am scos limitele de la toate cpu, nu ar trebui sa apara - am rezolvat la mana ma uit imediat
2026-03-11 08:01TraefikServiceHighErrorRatecriticaluni-api-svc-4000App bug / schema mismatch
`d.nrCrt` — missing from `disciplina` | `d1_0.an — missing from disciplina` | | Andrei Alexandru pare ca tot lipsesc niste coloane
2026-03-10 12:24NodeMemoryHighUtilizationwarninggrafanaObservability storage
e de la loki, a consumat prea multa memorie
2026-03-10 10:49TraefikServiceHighErrorRatecriticaldocgen2-api-svc-3600Release / migration issue
| "uri": "/docgen2/uni/disciplines/download?disciplineId=18461&academicPlanId=16905&cohortId=0&lang=ro", | "status": 500, | Aici e fix-ul facut de Stefan, Cred ca inca nu avem label-uri, dar poate nu e necesar pe acest fix :slightly_smiling_face:
2026-03-10 09:12TraefikServiceHighErrorRatecriticaluni-api-svc-4000App bug / schema mismatch
| Got error 'missing ) at offset 911' from regexp | Si pe acest punct avem fix in main. Am facut un escape suplimentar pe caracterele care intra in acea expresie regex
2026-03-06 14:16KubePersistentVolumeFillingUpwarninggrafana, lokiRelease / migration issue
dau deploy acum sa reduc logurile | oricum e facut sa stearga logurile vechi daca se apropie de 100gb |
2026-02-25 16:23TraefikServiceHighErrorRatecriticalai-api-svc-3900Release / migration issue
nu erau rulate migrarile pe ai
2026-02-25 10:55KubeJobFailedwarningadmission-end-session-29531540, admission-end-session-29532980, attendance-register-missed-attendance-29531530, download-album-29532525General investigation
eroare de la alea vechi, le-a reparat Marian
2026-02-25 10:12etcdInsufficientMemberscriticaletcdAlert tuning / noise
e ok, alerta asta nu ar trebui sa fie pusa pe tuiasi
2026-02-23 14:28TraefikServiceHighErrorRatecriticalweb-80Release / migration issue
am facut eu ceva gresit la deploy, merge bine
2026-02-23 07:16KubeHpaMaxedOutwarningsubscriptions-apiScaling config
Răzvan Ionică ai cum sa te uiti tu peste asta? E acelasi config la keda pe azure nush de ce apare alerta | ai gasit cauza? | daca e asa da. asta cred ca e de la criza financiara de la craciun in care am scazut la minim tot.
2026-02-20 08:26CPUThrottlingHighwarningservice-avAlert tuning / noise
am dat increase la limita pe av, nu ar trebui sa mai apara
2026-02-17 17:22TargetDownwarningunclassifiedAlert tuning / noise
Astea is pe envul nou
2026-02-16 06:32KubeJobFailedwarningattendance-register-missed-attendance-29518570, download-album-29518575DB / maintenance
joburi picate de la db maintenance
2026-02-16 06:11WatchdogwarningunclassifiedAlert tuning / noise
i-am dat silence, e o alerta de debug
2026-02-16 03:265xx Response Rate Alertwarningapp-gateway-ingress-production-weAttack / traffic anomaly
atac
2026-02-16 03:23Increased Latency Alertwarningapp-gateway-ingress-production-weAttack / traffic anomaly
atac
2026-02-16 03:215xx Response Rate Alertwarningapp-gateway-ingress-production-weAttack / traffic anomaly
atac
2026-02-15 18:015xx Response Rate Alertwarningapp-gateway-ingress-production-weDependency failure
aici a picat nodul de redis si de rabbitmq
2026-02-15 03:365xx Response Rate Alertwarningapp-gateway-ingress-production-weGeneral investigation
aici is 13 requesturi
2026-02-15 03:33Increased Latency Alertwarningapp-gateway-ingress-production-weDB / maintenance
db restart
2026-02-15 03:115xx Response Rate Alertwarningapp-gateway-ingress-production-weDB / maintenance
db restart
2026-02-15 03:08Increased Latency Alertwarningapp-gateway-ingress-production-weDB / maintenance
db restart
2026-02-15 02:315xx Response Rate Alertwarningapp-gateway-ingress-production-weDB / maintenance
db restart
2026-02-15 02:265xx Response Rate Alertwarningapp-gateway-ingress-production-weDB / maintenance
db restart
2026-02-15 02:215xx Response Rate Alertwarningapp-gateway-ingress-production-weDB / maintenance
db restart
2026-02-15 02:13Increased Latency Alertwarningapp-gateway-ingress-production-weDB / maintenance
db restart
2026-02-15 02:115xx Response Rate Alertwarningapp-gateway-ingress-production-weDB / maintenance
db restart