ݺߣ

ݺߣShare a Scribd company logo
Monitoring Infrastructure with
Prometheus
@
- Mohd Sahnawaz, Sr. SRE
Singapore Kubernetes User Group, 4th July 2018
Currently
? 600+ servers
? External load balancers see avg 5k+ requests per second
? Internal Amplification of 8x to 12x
? Self managed deployments:
 ElasticSearch (Dynamic Scaling)
 PostgresQL
 Cassandra
 Kafka
 Redis
 RabbitMQ
 And more
? Uptime of 99.95
? Ability to handle AZ failures
Architecture
Monitoring infrastructure with prometheus
monitoring
Dashboard
Grafana
Data Store
Prometheus
Metrics Source
 Exporter
 Node
 Postgres
 JVM
 ElasticSearch
 HAProxy
 Hystrix
 StatsD
 
 Write your own
Dashboard : Grafana
Dashboard : Grafana (cont)
K8S Deployment view 
Hystrix Metric view
Data source: Prometheus
? GCE SD configurations to populate hosts
? Instance metadata to cluster nodes.
Data source: Prometheus (cont )
? Multiple Instances with different retention
? Separate dedicated instances for APM, Node Metrics, ICMP, Kubernetes
? Grafana connects to all of these
? GCE SD configurations to populate hosts
metrics source: GCE with exporters
scrape_configs:
- job_name: node
scrape_interval: 15s
scrape_timeout: 15s
gce_sd_configs:
- project: <project_name>
zone: <zone_name>
port: 9100
filter: "(name ne .*stage.*)(name ne .*test.*)"
relabel_configs:
- source_labels: [__meta_gce_instance_name]
target_label: host
- source_labels: [__meta_gce_zone]
separator: '/'
regex: '(.*)/(.*)'
replacement: '${2}'
target_label: zone
- source_labels: [__meta_gce_metadata_cluster, __meta_gce_metadata_cluster_name]
separator: ';'
regex: '(.*);(.*)'
replacement: '${1}${2}'
target_label: cluster
- source_labels: [__meta_gce_metadata_env]
target_label: env
- source_labels: [__meta_gce_metadata_component]
target_label: component
Target Labels
? K8S SD configurations to populate pods
metrics source: K8S API
- job_name: 'pod-metrics'
scrape_interval: 15s
scrape_timeout: 5s
kubernetes_sd_configs:
- api_server: '<api-path>'
role: node
basic_auth:
username: <user>
password: <access_token>
tls_config:
insecure_skip_verify: true
- api_server: '<api-path>'
role: pod
basic_auth:
username: <user>
password: <access_token>
tls_config:
insecure_skip_verify: true
scheme: http
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: host
- source_labels: [__address__]
action: keep
- source_labels: [__address__]
action: replace
target_label: address
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_container_name]
action: replace
target_label: kubernetes_container
- source_labels: [__meta_kubernetes_pod_container_name]
action: replace
target_label: cluster
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: 9284
action: keep
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node_name
Target Labels
? Hystrix is great for real time monitoring.
? Helps in quickly identifying failures.
? We capture hystrix data to prometheus.
? Help in debugging/retrospectives
Metrics source: Hystrix
app statsd-exporter prometheus
turbine
? Custom metrics with Hystrix
Metrics source: Hystrix
scrape_configs:
- job_name: 'hystrix-stats'
scrape_interval: '15s'
file_sd_configs:
- files:
- 'hystrix-stats*.yml'
refresh_interval: 5m
metric_relabel_configs:
- source_labels: [__name__]
regex: '^(.*)_hystrix.*'
target_label: hcluster
replacement: '${1}'
- source_labels: [__name__]
regex: '^.*_(hystrix.*)'
target_label: __name__
replacement: '${1}'
- source_labels: [__name__]
regex: '^hystrix_([^_]+)_.*'
target_label: command
replacement: '${1}'
- source_labels: [__name__]
regex: '^hystrix_[^_]+_(.*)'
target_label: __name__
replacement: 'hystrix_${1}'
? Official client https://github.com/prometheus/client_golang
? Support 4 metric types (counter, gauge, histogram, summary)
? Built-in support Common gRPC metrics
? Exposed in http://ip:port/metrics
Metrics source: Custom
Alerting : Alertmanager
Prometheus alertmanager
slack
victorops
Alerting process 
Alert rule
Thank you!
Q & A
Were hiring, visit careers.carousell.com

More Related Content

Monitoring infrastructure with prometheus

  • 1. Monitoring Infrastructure with Prometheus @ - Mohd Sahnawaz, Sr. SRE Singapore Kubernetes User Group, 4th July 2018
  • 2. Currently ? 600+ servers ? External load balancers see avg 5k+ requests per second ? Internal Amplification of 8x to 12x ? Self managed deployments: ElasticSearch (Dynamic Scaling) PostgresQL Cassandra Kafka Redis RabbitMQ And more ? Uptime of 99.95 ? Ability to handle AZ failures
  • 5. monitoring Dashboard Grafana Data Store Prometheus Metrics Source Exporter Node Postgres JVM ElasticSearch HAProxy Hystrix StatsD Write your own
  • 7. Dashboard : Grafana (cont) K8S Deployment view Hystrix Metric view
  • 9. ? GCE SD configurations to populate hosts ? Instance metadata to cluster nodes. Data source: Prometheus (cont ) ? Multiple Instances with different retention ? Separate dedicated instances for APM, Node Metrics, ICMP, Kubernetes ? Grafana connects to all of these
  • 10. ? GCE SD configurations to populate hosts metrics source: GCE with exporters scrape_configs: - job_name: node scrape_interval: 15s scrape_timeout: 15s gce_sd_configs: - project: <project_name> zone: <zone_name> port: 9100 filter: "(name ne .*stage.*)(name ne .*test.*)" relabel_configs: - source_labels: [__meta_gce_instance_name] target_label: host - source_labels: [__meta_gce_zone] separator: '/' regex: '(.*)/(.*)' replacement: '${2}' target_label: zone - source_labels: [__meta_gce_metadata_cluster, __meta_gce_metadata_cluster_name] separator: ';' regex: '(.*);(.*)' replacement: '${1}${2}' target_label: cluster - source_labels: [__meta_gce_metadata_env] target_label: env - source_labels: [__meta_gce_metadata_component] target_label: component Target Labels
  • 11. ? K8S SD configurations to populate pods metrics source: K8S API - job_name: 'pod-metrics' scrape_interval: 15s scrape_timeout: 5s kubernetes_sd_configs: - api_server: '<api-path>' role: node basic_auth: username: <user> password: <access_token> tls_config: insecure_skip_verify: true - api_server: '<api-path>' role: pod basic_auth: username: <user> password: <access_token> tls_config: insecure_skip_verify: true scheme: http relabel_configs: - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: host - source_labels: [__address__] action: keep - source_labels: [__address__] action: replace target_label: address - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_container_name] action: replace target_label: kubernetes_container - source_labels: [__meta_kubernetes_pod_container_name] action: replace target_label: cluster - source_labels: [__meta_kubernetes_pod_container_port_number] regex: 9284 action: keep - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: kubernetes_node_name Target Labels
  • 12. ? Hystrix is great for real time monitoring. ? Helps in quickly identifying failures. ? We capture hystrix data to prometheus. ? Help in debugging/retrospectives Metrics source: Hystrix app statsd-exporter prometheus turbine
  • 13. ? Custom metrics with Hystrix Metrics source: Hystrix scrape_configs: - job_name: 'hystrix-stats' scrape_interval: '15s' file_sd_configs: - files: - 'hystrix-stats*.yml' refresh_interval: 5m metric_relabel_configs: - source_labels: [__name__] regex: '^(.*)_hystrix.*' target_label: hcluster replacement: '${1}' - source_labels: [__name__] regex: '^.*_(hystrix.*)' target_label: __name__ replacement: '${1}' - source_labels: [__name__] regex: '^hystrix_([^_]+)_.*' target_label: command replacement: '${1}' - source_labels: [__name__] regex: '^hystrix_[^_]+_(.*)' target_label: __name__ replacement: 'hystrix_${1}'
  • 14. ? Official client https://github.com/prometheus/client_golang ? Support 4 metric types (counter, gauge, histogram, summary) ? Built-in support Common gRPC metrics ? Exposed in http://ip:port/metrics Metrics source: Custom
  • 15. Alerting : Alertmanager Prometheus alertmanager slack victorops Alerting process Alert rule
  • 16. Thank you! Q & A Were hiring, visit careers.carousell.com