๐จAlerting and Monitoring
Each machine runs a Prometheus and Alertmanager instance which monitors the services running on the same machine.
If Prometheus and/or Alertmanager are down, or the entire machine is down/unresponsive then HealthChecks.io is used as a dead-mans-hand alert. Periodic heartbeat pings are sent by a cron script (~/healthchecks.sh
) every 5 minutes, with a 10 minute grace period.
All services are integrated with PagerDuty for alerts.
๐ฅPrometheus๐ก๏ธHealthChecks.io๐PagerDutyLast updated