Pingdom-First Observability

Production Reliability Dashboard

Generated 2026-04-01 17:24 from Pingdom synthetic checks, Slack #_alerts_prod, and AWS SNS alert emails for 2026-03-16 00:00 to 2026-03-23 00:00.

All sources Pingdom customer-facing checks Slack investigation context AWS infrastructure alarms
How to use Start from Pingdom when you want to know what users would have felt. Move into Slack when you need human investigation context, and use AWS to confirm whether the external symptom lines up with backend or infrastructure pressure.
Customer-visible failures2Pingdom events in the reporting window
Impacted services15services touched by Slack or Pingdom evidence
AWS alarms still active1alarms still in ALARM at the end of the window
Latest observed event2026-03-23 00:00most recent signal seen in any source
Executive Summary

What Needs Attention

Bottom line: application-level critical paths are present and catalog database alarms still look active.

Signal Over Noise

  • TraefikServiceHighErrorRate is the highest-severity application issue in this window, touching uni-api-svc-4000 (12), accommodations-api-svc-4100 (5); the freshest signal was on uni-api-svc-4000 at 2026-03-20 21:47. Discussion hints: Thread summary · Raul Popovici: SQL Integrity Constraint Violation: (conn=3570102) Cannot add or update a child row: a foreign key constraint fails (`ums_uni_catalog`.`dis… | Andr….
  • AWS pressure is concentrated in 1 still-alarming catalog-related alarm(s), led by adservio-rds-mysql-master-memory-low with 28 email events.
  • docgen2-api is also showing capacity pressure through KubeHpaMaxedOut (6), which makes it a secondary scaling watch item.

Recommended Actions

  • Treat TraefikServiceHighErrorRate as the primary application investigation. Reproduce the failing path on uni-api-svc-4000, accommodations-api-svc-4100, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten. The thread points toward a data-integrity or write-path failure, so recent schema or persistence changes should be checked first.
  • Run a focused database-capacity investigation on the catalog instances now. Persistent memory-low and swap-high alarms are usually a system-pressure problem, not something to leave as background noise.
  • Check whether docgen2-api needs a short-term scaling adjustment or a queue and load change before the next traffic bump.

Pingdom Checks

Alert Families

Service Hotspots

AWS Alarms

Global Evidence Explorer

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Pingdom Events by Day

Slack Alerts by Day

AWS Alert Emails by Day

Customer View

Pingdom Checks

Pingdom CheckStatusEventsDowntimeLast SeenLikely ServicesCorrelated Evidence
https://www.adservio.ro/api/v2/statusNo recent customer-visible issue22m2026-03-23 00:00adservio-ro-api-v2-status
Adservio RoNo recent customer-visible issue00m2026-03-23 00:00adservio-ro

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
TraefikServiceHighErrorRateCritical172026-03-20 21:47No recent signal3uni-api-svc-4000 (12)accommodations-api-svc-4100 (5)General investigationObservability storage
SQL Integrity Constraint Violation: (conn=3570102) Cannot add or update a child row: a foreign key constraint fails (`ums_uni_catalog`.`dis… | Andrei Alexandru
KubeJobFailedWarning172026-03-19 07:46No recent signal0admission-end-session-29561780 (14)admission-end-session-29556020 (3)admission-end-session-29557460 (3)admission-end-session-29558900 (3)admission-end-session-29560340 (3)None
KubeHpaMaxedOutWarning92026-03-20 11:11No recent signal0docgen2-api (6)subscriptions-api (3)None
TraefikServiceHighLatencyWarning52026-03-20 16:35No recent signal0ai-api-svc-3900 (4)web-80 (1)None
NodeSystemSaturationWarning32026-03-19 12:27No recent signal0grafana (3)None
NodeCPUHighUsageWarning12026-03-18 14:53No recent signal0grafana (1)None

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceHighest SeverityCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
uni-api-svc-4000Critical122026-03-20 21:47No recent signalTraefikServiceHighErrorRate (12)Observability storageGeneral investigation
SQL Integrity Constraint Violation: (conn=3570102) Cannot add or update a child row: a foreign key constraint fails (`ums_uni_catalog`.`dis… | Andrei Alexandru
accommodations-api-svc-4100Critical52026-03-18 13:24No recent signalTraefikServiceHighErrorRate (5)General investigation
Ionut Ciolan | de ce apar stack trace-urile alea? nu par folositoare
admission-end-session-29561780Warning142026-03-19 07:46No recent signalKubeJobFailed (14)None
docgen2-apiWarning62026-03-19 13:11No recent signalKubeHpaMaxedOut (6)None
ai-api-svc-3900Warning42026-03-20 16:35No recent signalTraefikServiceHighLatency (4)None
grafanaWarning42026-03-19 12:27No recent signalNodeSystemSaturation (3)NodeCPUHighUsage (1)None
subscriptions-apiWarning32026-03-20 11:11No recent signalKubeHpaMaxedOut (3)None
admission-end-session-29556020Warning32026-03-16 10:52No recent signalKubeJobFailed (3)None
admission-end-session-29557460Warning32026-03-16 10:52No recent signalKubeJobFailed (3)None
admission-end-session-29558900Warning32026-03-16 10:52No recent signalKubeJobFailed (3)None
admission-end-session-29560340Warning32026-03-16 10:52No recent signalKubeJobFailed (3)None
update-recurenta-29554565Warning32026-03-16 10:52No recent signalKubeJobFailed (3)None
web-80Warning12026-03-17 04:13No recent signalTraefikServiceHighLatency (1)None

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus
adservio-rds-mysql-master-memory-low281414272026-03-16 00:552026-03-18 07:09ALARMStill alarming
adservio-rds-mysql-master-write-latency-high21112026-03-16 23:082026-03-16 23:13OKLatest OK

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes
2026-03-20 21:47TraefikServiceHighErrorRateCriticaluni-api-svc-4000General investigation
SQL Integrity Constraint Violation: (conn=3570102) Cannot add or update a child row: a foreign key constraint fails (`ums_uni_catalog`.`dis… | Andrei Alexandru
2026-03-18 12:48TraefikServiceHighErrorRateCriticalaccommodations-api-svc-4100General investigation
Ionut Ciolan | de ce apar stack trace-urile alea? nu par folositoare
2026-03-17 10:45TraefikServiceHighErrorRateCriticaluni-api-svc-4000Observability storage
errors.group-already-exists