Production Reliability Dashboard
Generated 2026-04-01 17:24 from Pingdom synthetic checks, Slack #_alerts_prod, and AWS SNS alert emails for 2026-03-16 00:00 to 2026-03-23 00:00.
What Needs Attention
Bottom line: application-level critical paths are present and catalog database alarms still look active.
Signal Over Noise
- TraefikServiceHighErrorRate is the highest-severity application issue in this window, touching uni-api-svc-4000 (12), accommodations-api-svc-4100 (5); the freshest signal was on uni-api-svc-4000 at 2026-03-20 21:47. Discussion hints: Thread summary · Raul Popovici: SQL Integrity Constraint Violation: (conn=3570102) Cannot add or update a child row: a foreign key constraint fails (`ums_uni_catalog`.`dis… | Andr….
- AWS pressure is concentrated in 1 still-alarming catalog-related alarm(s), led by adservio-rds-mysql-master-memory-low with 28 email events.
- docgen2-api is also showing capacity pressure through KubeHpaMaxedOut (6), which makes it a secondary scaling watch item.
Recommended Actions
- Treat TraefikServiceHighErrorRate as the primary application investigation. Reproduce the failing path on uni-api-svc-4000, accommodations-api-svc-4100, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten. The thread points toward a data-integrity or write-path failure, so recent schema or persistence changes should be checked first.
- Run a focused database-capacity investigation on the catalog instances now. Persistent memory-low and swap-high alarms are usually a system-pressure problem, not something to leave as background noise.
- Check whether docgen2-api needs a short-term scaling adjustment or a queue and load change before the next traffic bump.
Pingdom Checks
Alert Families
Service Hotspots
AWS Alarms
Global Evidence Explorer
Global Evidence Explorer
Report-wide charts and tables stay here, separate from the active investigation scope.
Global Evidence Explorer
Pingdom Events by Day
Slack Alerts by Day
AWS Alert Emails by Day
Pingdom Checks
| Pingdom Check | Status | Events | Downtime | Last Seen | Likely Services | Correlated Evidence |
|---|---|---|---|---|---|---|
| https://www.adservio.ro/api/v2/status | No recent customer-visible issue | 2 | 2m | 2026-03-23 00:00 | adservio-ro-api-v2-status | adservio-rds-mysql-master-memory-low, adservio-rds-mysql-master-write-latency-high |
| Adservio Ro | No recent customer-visible issue | 0 | 0m | 2026-03-23 00:00 | adservio-ro | Pingdom-only evidence so far |
Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.
Slack Alert Families
| Alert | Severity | Count | Last Seen | Status | Threads | Top Impacted Services | Discussion Signal | Latest Thread Note |
|---|---|---|---|---|---|---|---|---|
| TraefikServiceHighErrorRate | Critical | 17 | 2026-03-20 21:47 | No recent signal | 3 | uni-api-svc-4000 (12)accommodations-api-svc-4100 (5) | General investigationObservability storage | SQL Integrity Constraint Violation: (conn=3570102) Cannot add or update a child row: a foreign key constraint fails (`ums_uni_catalog`.`dis… | Andrei Alexandru |
| KubeJobFailed | Warning | 17 | 2026-03-19 07:46 | No recent signal | 0 | admission-end-session-29561780 (14)admission-end-session-29556020 (3)admission-end-session-29557460 (3)admission-end-session-29558900 (3)admission-end-session-29560340 (3) | None | |
| KubeHpaMaxedOut | Warning | 9 | 2026-03-20 11:11 | No recent signal | 0 | docgen2-api (6)subscriptions-api (3) | None | |
| TraefikServiceHighLatency | Warning | 5 | 2026-03-20 16:35 | No recent signal | 0 | ai-api-svc-3900 (4)web-80 (1) | None | |
| NodeSystemSaturation | Warning | 3 | 2026-03-19 12:27 | No recent signal | 0 | grafana (3) | None | |
| NodeCPUHighUsage | Warning | 1 | 2026-03-18 14:53 | No recent signal | 0 | grafana (1) | None |
Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.
Slack Impacted Service / Resource View
This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.
| Impacted Service / Resource | Highest Severity | Count | Last Seen | Status | Top Alert Types | Discussion Signal | Latest Thread Note |
|---|---|---|---|---|---|---|---|
| uni-api-svc-4000 | Critical | 12 | 2026-03-20 21:47 | No recent signal | TraefikServiceHighErrorRate (12) | Observability storageGeneral investigation | SQL Integrity Constraint Violation: (conn=3570102) Cannot add or update a child row: a foreign key constraint fails (`ums_uni_catalog`.`dis… | Andrei Alexandru |
| accommodations-api-svc-4100 | Critical | 5 | 2026-03-18 13:24 | No recent signal | TraefikServiceHighErrorRate (5) | General investigation | Ionut Ciolan | de ce apar stack trace-urile alea? nu par folositoare |
| admission-end-session-29561780 | Warning | 14 | 2026-03-19 07:46 | No recent signal | KubeJobFailed (14) | None | |
| docgen2-api | Warning | 6 | 2026-03-19 13:11 | No recent signal | KubeHpaMaxedOut (6) | None | |
| ai-api-svc-3900 | Warning | 4 | 2026-03-20 16:35 | No recent signal | TraefikServiceHighLatency (4) | None | |
| grafana | Warning | 4 | 2026-03-19 12:27 | No recent signal | NodeSystemSaturation (3)NodeCPUHighUsage (1) | None | |
| subscriptions-api | Warning | 3 | 2026-03-20 11:11 | No recent signal | KubeHpaMaxedOut (3) | None | |
| admission-end-session-29556020 | Warning | 3 | 2026-03-16 10:52 | No recent signal | KubeJobFailed (3) | None | |
| admission-end-session-29557460 | Warning | 3 | 2026-03-16 10:52 | No recent signal | KubeJobFailed (3) | None | |
| admission-end-session-29558900 | Warning | 3 | 2026-03-16 10:52 | No recent signal | KubeJobFailed (3) | None | |
| admission-end-session-29560340 | Warning | 3 | 2026-03-16 10:52 | No recent signal | KubeJobFailed (3) | None | |
| update-recurenta-29554565 | Warning | 3 | 2026-03-16 10:52 | No recent signal | KubeJobFailed (3) | None | |
| web-80 | Warning | 1 | 2026-03-17 04:13 | No recent signal | TraefikServiceHighLatency (1) | None |
AWS Email Alarm Families
| AWS Alarm | Emails | ALARM | OK | State Flips | First Seen | Last Seen | Latest State | Status |
|---|---|---|---|---|---|---|---|---|
| adservio-rds-mysql-master-memory-low | 28 | 14 | 14 | 27 | 2026-03-16 00:55 | 2026-03-18 07:09 | ALARM | Still alarming |
| adservio-rds-mysql-master-write-latency-high | 2 | 1 | 1 | 1 | 2026-03-16 23:08 | 2026-03-16 23:13 | OK | Latest OK |
“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.
Global Discussion-Derived Signal
| Thread Date | Alert | Severity | Services | Signal | Key Notes |
|---|---|---|---|---|---|
| 2026-03-20 21:47 | TraefikServiceHighErrorRate | Critical | uni-api-svc-4000 | General investigation | SQL Integrity Constraint Violation: (conn=3570102) Cannot add or update a child row: a foreign key constraint fails (`ums_uni_catalog`.`dis… | Andrei Alexandru |
| 2026-03-18 12:48 | TraefikServiceHighErrorRate | Critical | accommodations-api-svc-4100 | General investigation | Ionut Ciolan | de ce apar stack trace-urile alea? nu par folositoare |
| 2026-03-17 10:45 | TraefikServiceHighErrorRate | Critical | uni-api-svc-4000 | Observability storage | errors.group-already-exists |