This document discusses using Prometheus to monitor infrastructure. It describes using Prometheus to monitor over 600 servers handling high traffic loads. Prometheus scrapes metrics from exporters, services, and custom sources. It stores metrics in Prometheus and visualizes them using Grafana dashboards. The document provides examples of configuring Prometheus to scrape metrics from Google Compute Engine, Kubernetes, Hystrix, and custom sources using the Prometheus client library. It also discusses using Alertmanager for alerting.
2. Currently
? 600+ servers
? External load balancers see avg 5k+ requests per second
? Internal Amplification of 8x to 12x
? Self managed deployments:
ElasticSearch (Dynamic Scaling)
PostgresQL
Cassandra
Kafka
Redis
RabbitMQ
And more
? Uptime of 99.95
? Ability to handle AZ failures
9. ? GCE SD configurations to populate hosts
? Instance metadata to cluster nodes.
Data source: Prometheus (cont )
? Multiple Instances with different retention
? Separate dedicated instances for APM, Node Metrics, ICMP, Kubernetes
? Grafana connects to all of these
12. ? Hystrix is great for real time monitoring.
? Helps in quickly identifying failures.
? We capture hystrix data to prometheus.
? Help in debugging/retrospectives
Metrics source: Hystrix
app statsd-exporter prometheus
turbine