狠狠撸

对于我
? ?一线云计算从业者

? 电?子书《Prometheus 实战》作者

? Alertmanager 代码贡献者

? GitHub: songjiayang ，微博：small_?sh__

主题：
? Prometheus

? Alertmanager

? v1.x vs v2.x

Prometheus 是：
? 监控告警系统

? 基于指标（Metric）

? 时序的

? 开源的

Prometheus 不不是：
? trace 系统

? ?日志分析

? 审计系统

? ….

? Why not？

? 现代（?用 Go 编写）

? ?无依赖，安装?方便便，上?手容易易

? 很多插件或 exporter

? Grafana 默认?支持

? K8s 默认?支持，?非常适合云和微服务

? 社区活跃，它不不仅仅是个?工具?而是?生态
选择 Prometheus 的原因：

Prometheus 安装：
? 安装包安装，访问 https://github.com/prometheus/
prometheus/releases 下载对应版本，解压即可。

? 系统包管理理?工具安装，例例如 brew install prometheus

? Docker 镜像： docker run --name prometheus -d -p
127.0.0.1:9090:9090 quay.io/prometheus/prometheus

Prometheus 配置：
? global：全局配置

? alerting: 告警接受地址

? rule_?les: 告警规则配置

? scrape_conifgs: 数据拉取的配置

global:
scrape_interval: 15s
evaluation_interval: 15s
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
配置示例例：

物理理主机监控：
? node_expoter: 物理理机运?行行状态信息收集

? windows 使?用 wmi_exporter

? 配置：
scrape_con?gs:
- job_name: ‘node’
static_con?gs:

优化：
? 减少数据采集量量:

node_exporter --no-collector.arp --no-collector.bcache
? 降低数据采集频率：
scrape_con?gs:
- job_name: ‘node’
scrape_interval： 30s # default is 15s
static_con?gs:

CPU 使?用率：
100 - (avg by (instance) (irate(node_cpu{ mode="idle"}[5m])) * 100)

内存使?用率：
100 - ((node_memory_MemFree+
node_memory_Cached+
node_memory_Bu?ers) / node_memory_MemTotal) * 100

Prometheus 资料料：
? Prometheus Demo

? 更更多查询语句句

Alertmanager 介绍：
Prometheus ?用于收集数据，Alertmanger ?用于管理理和发送告警；两
者结合，才能的对我们业务进?行行有效监控。

? 接收告警

? 分组

? 降噪

? 丰富的通知渠道，（Email, Slack, WeChat, Webhooks…）

下载安装：
? 安装包安装，访问 https://github.com/prometheus/
alertmanager/releases 下载对应版本，解压即可。

基本使?用：
? Prometheus 配置

? Alertmanager 配置

Prometheus 配置：
? 修改 prometheus.yml

rule_?les:
- “rules/node.yml”
? 添加 rules/node.yml ?文件

groups:
- name: node
rules:
- alert: InstanceDown
expr: up{job=“node”} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down”
? 添加 alertmanager 地址

alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093

Alertmanager 配置：
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 5h
receiver: 'wechat'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname']
receivers:
- name: 'wechat'
wechat_configs:
- corp_id: ''
to_party: '1'
agent_id: '1000002'
api_secret: ''
分组，路路由

global:
resolve_timeout: 5m
route:
group_wait: 30s
group_interval: 5m
repeat_interval: 5h
receiver: 'wechat'
inhibit_rules:
- source_match:
target_match:
severity: 'warning'
receivers:
- name: 'wechat'
wechat_configs:
- corp_id: ''
to_party: '1'
agent_id: '1000002'
api_secret: ''
降噪

global:
resolve_timeout: 5m
route:
group_wait: 30s
group_interval: 5m
repeat_interval: 5h
receiver: 'wechat'
inhibit_rules:
- source_match:
target_match:
severity: 'warning'
receivers:
- name: 'wechat'
wechat_configs:
- corp_id: ''
to_party: '1'
agent_id: '1000002'
api_secret: ''
通知渠道

Alertmanager 通知?高可?用：
修改 prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
- localhost:9094
说明: 我们采?用多副本的 alertmanager 来接收告警信息，通常 alertmanager
配置为忽略略恢复告警信息，让其只发送 firing 告警。

选择 v1.x 还是惫2.虫？：

v2.x 性能提升:
? Memory and CPU usage is already down ~3X

? Disk writes down by ~10X

使?用注意：系统配置优化
# SSD 调优

echo 0 > /sys/block/sdX/queue/rotational

echo deadline > /sys/block/sdX/queue/scheduler

# /etc/sysctl.d/local.conf

vm.swappiness=1

# /etc/security/limits.d/00prometheus

prometheus - no?le 10000000

# 如果使?用的是 Intel CPU, 确保 scaling_governor 和 CPU 频率?一致

intel_pstate=disable

使?用注意： 2.x 配置
新 TSDB 存储引擎，只需要?一?行行设置:

--storage.tsdb.retention

升级版本：从 1.6x 到 1.8x
# 总内存的 2/3 左右

-storage.local.target-heap-size

# 设置为 5m 减少 SSD 数据读写

-storage.local.checkpoint-interval

# 如果有?大量量的 metrics ，但是拉取频率较低，可以将这个值设置为 10k 以上

-storage.local.num-?ngerprint-mutexes

# 如果你使?用的是 SSD , 这个可以设置很?高

-storage.local.checkpoint-dirty-series-limit

数据参考:
机器?：

? 32 cores

? 128GB RAM

? RAID 10 SSD

? Prometheus v2.1

性能：

? 2.3M 指标

? 57k /s 采样数据

? 30s ?一次拉取

? 没有数据延迟

推荐资料料：
? 官?网

? GitHub 源码： Prometheus, Alertmanager

? https://www.robustperception.io/blog/ （博客）

? https://kausal.co （博客）

? https://www.weave.works （博客）

? Prometheus up & running (书籍)

? Monitor with Prometheus (书籍)

? 各种演讲稿 (ppt)

? 在线演示 Demo

狠狠撸

Prometheus 101

More Related Content

Prometheus 101