2. Agenda
- Monitoring as a Service (Monasca)
- Monaca Architecture
- Helion OpenStack ? Monasca
- Helion Monitoring Console ??
3. Monitoring Challenge
Monitoring ? ???
Region
Region A Region A
Region B Region C Region XRegion B
Zone
Machine
Instance
Container? Scaling ? ??
: ???? ??, Multi-Region ?
Public/Private Cloud ? Scaling
? Cloud ?? ??? ??
: ??? VM ?? Container ?
? ?? Metric data ??
: Monitoring Data ? ??
? Dynamic ? ??
: ?? ??? Infra
10. MONASCA
Metrics Example
POST /v2.0/metrics
{
name: http_status,
dimensions: {
hostname: hlm001-cp1-c1-m2-mgmt,
cluster: c1,
control_plane: ccp,
service: compute
}
timestamp: 0, /* milliseconds */
value: 0.0,
value_meta: {
status_code: 500,
msg: Internal server error
}
}
? Simple, concise, multi-dimensional flexible description
? Name (string)
? Dimensions: Dictionary of user-defined (key, value) pairs that are used to
uniquely identify a metric
? Optional dictionary of user-defined (key, value) pairs that can be used to
describe a measurement
? Normally used for errors and messages
11. MONASCA
Alarm Definition
- ??? Template ??, metric name ? dimension ?
match ?? Alarm ? ??
- ??? Alarm ??? ??? Alarm ??? ???
GET, POST /v2.0/alarm-definitions
GET, PUT, PATCH, DELETE /v2.0/alarm-definitions{alarm-definition-
id}
? ?? Alarm ?? ??? ?? ??? ??? ??:
avg(cpu.user_perc{}) > 85
or avg(memory.system_perc{}) > 45
or avg(disk.read_ops(device=vda), 120) > 100
? Alarm ?? ?? (OK, ALARM and UNDETERMINED)
? Actions associated with alarms for state transitions
? Severity (LOW, MEDIUM, HIGH, CRITICAL) ??
? Thresholds ? ???? ?? ??
Example:
POST /v2.0/alarm-definitions
{
"name":¡±CPU percent greater than 10",
"description":"The average CPU percent is greater than 85",
"expression":"(avg(cpu,user_perc{region=uswest})> 85)",
"match_by":[
"hostname"
],
"severity":"LOW",
"ok_actions":[
"c60ec47e-5038-4bf1-9f95-4046c6e9a759"
],
"alarm_actions":[
"c60ec47e-5038-4bf1-9f95-4046c6e9a759"
],
"undetermined_actions":[
"c60ec47e-5038-4bf1-9f95-4046c6e9a759¡°
12. MONASCA
Alarms
- Alarm ? Threshold Engine ? ??, ?? ???
???? metric ? ?? ? ? ????
GET /v2.0/alarms
GET, PUT, PATH, DELETE /v2.0/alarms/{alarm-id}
Query Parameters:
? alarm_definition_id (string, optional) - Alarm definition ID to filter by.
? metric_name (string(255), optional) - Name of metric to filter by.
? metric_dimensions ({string(255): string(255)}, optional) -
Dimensions of metrics to filter by specified as a comma separated
array of (key, value) pairs as `key1:value1,key1:value1, ...`
? state (string, optional) - State of alarm to filter by, either `OK`,
`ALARM` or `UNDETERMINED`.
? state_updated_start_time (string, optional) - The start time in ISO
8601 combined date and time format in UTC.
Example:
List alarms
GET
/v2.0/alarms?metric_name=cpu.user_perc&metric_di
mensions=hostname:devstack&state=ALARM
List alarm
GET /v2.0/alarms/{alarm-id}
13. MONASCA
Alarm History
- OK, ALARM,UNDETERMINE ? ?? ?? ??? ???? ???
GET /v2.0/alarms/state-history
GET /v2.0/alarms/{alarm-id}/state-history
14. MONASCA
Notification
- Notification ?? (Email, Pager Duty, WebHook) ? ?? List
- ??? Alarm ??? ??, ?? Notification ?? ?? ??
Examples:
POST /v2.0/notification-methods
{
"name":"Name of notification method",
"type":"EMAIL",
"address":¡°sang-wook.byun@hpe.com"
}
POST /v2.0/notification-methods
{
"name":"Name of notification method",
"type":¡±WEBHOOK",
"address":¡±http://example.com/XXX"
}
18. Helion OpenStack Cloud Monitoring
Helion ??? Monasca Coverage
Fully supported
Partially supported
Not Applicable
Helion OpenStack Core Services Helion OpenStack Shared Services
Nova
Neutron
Cinder
Nuetron
L3agent
Glance
Swift
Ceilometer
Horizon
Heat
Keystone
Ops
Console
Logging
Monasca
BURA
OVS
Hlinux
MySQL
Rabbit
Apache
LogStash
Beaver
Elastic
Kafka
HAProxy
Storm
Service
up?
API up?
Host
up?
Perf
Resource
Utilization
Control Plane
Cloud IaaS
Compute Network Storage
Cloud PaaS
Application
19. Monasca
Helion ??? Monitoring Factor
? System (cpu, memory, network, file system, ¡ )
? Service (MySQL, Kafka, nova, cinder, ¡. )
? Application
Built-in Statsd daemon
Python monasca-stats library : Adds support for dimensions
? VM system metrics
? Active checks
HTTP status checks and respose times
System up/down check (ping and ssh)
? Runs any Nagios plugin or check_mk
? Extensible/Pluggable : Additional services can be
easily added
- Host alive check on all systems using ping check
- HTTP Status and response time on all OpenStack
service endpoints
- Process checks on all relevant processes
- System Metrics: CPU, disk, IO, load, memory, process,
network, NTP
- Services:
Elasticsearch, HAProxy, JVM, Kafka, MySQL, RabbitMQ,
Zookeeper
- OpenStack Services
Swift and Monasca specific metrics
- VM Metrics
CPU, IO, Memory, Network and Host Alive
See, http://monasca-
agent.readthedocs.org/en/latest/Plugins/
This is a sample Title ºÝºÝߣ with Picture ideal for including a dark picture with a brief title and subtitle.
A selection of pre-approved title slides are available in the HPE Title ºÝºÝߣ Library. The location of the library will be communicated later.
To insert a slide with a different picture from the HPE Title ºÝºÝߣ Library:
Open the file HPE_16x9_Title_ºÝºÝߣ_Library.pptx
From the ºÝºÝߣ thumbnails pane, select the slide with the picture you would like to use in your presentation and click Copy (Ctrl+C)
Open a copy of the new HPE 16x9 template (Standard or Events) or your current presentation
In the ºÝºÝߣ thumbnails pane, click Paste (Ctrl+V)
A Paste Options clipboard icon will appear. Click the icon and select Keep Source Formatting. (Ctrl+K)
Monitoring solutions have been around for decades, but in many respects they fail to address the requirements of monitoring large-scale public and private clouds. Traditionally, performance, scalability and data retention have been limited to hundreds of systems. In a large-scale cloud service thousands of physical servers and hundreds of thousands of virtual machines (VMs) and containers need to be monitored, resulting in hundreds of terabytes of monitoring data. The original monitoring source data needs to be stored in an on-line, queryable, lossless form at data retention periods greater than thirteen months. Such long data retention periods are necessary for SLAs, business continuity, and analytics.
Inventory elasticity is important because cloud infrastructure is constantly evolving with VMs and services continually being created and destroyed monitoring systems must be dynamic enough to understand the difference between a VM be purposely destroyed versus a VM that is in a failed state. Self-service models that empower teams to easily add new resources and monitor them independently of the monitoring teams involvement is necessary. Most solutions assume a static infrastructure that requires new services to be registered with the server prior to being monitored. This results in the monitoring team/server being the bottleneck. Extensibility is critical, but is often limited.
Run-time configurability is necessary to be able to tune the system over time by allowing alarms to be dynamically adjusted, which in many systems is not supported. Generalization of alarm definitions/templates is necessary to describe and manage alarms in a one-to-many relationship in order to avoid having to manually declare each alarm even though they may share many common attributes and differ in only one, such as hostname. Spammy alerts and alert fatigue is a common short-coming of every thresholding system. Many operations teams receive thousands of alerts on a weekly basis. Improvements in run-time configurability and generalizing alarm definitions can help to address spammy alerts. Anomaly detection based on non-parametric statistics and machine learning is required as a more fundamental change.
Monasca is a highly performant, scalable, fault-tolerant and extensible micro-services messages bus based architecture. It uses a REST API for high-speed metrics processing and querying and has a streaming alarm engine and notification engine. All of the major components are linked using?Kafka. Every component in the system is built with High Availability (HA) in mind and can be scaled either horizontally or vertically to allow for monitoring of very large?systems.
The?Monasca API?is the gateway for all interaction with Monasca. In a typical scenario?metrics?are collected by the?Monasca Agent?running on a system and sent to the Monasca API. The API then published the metrics to the Kafka queue. From here the?Monasca Persister?(metric ? Alarm ??? Kafka ?? Read ??, Metric DB ? ????? ??), consumes metrics and writes them to ourMetrics database. The?Monasca Threshold Engine?also consumes the metrics and uses them to evaluate?alarms.
At this point the metrics are in our system and can be queried using the Monasca API, either directly or through one of our other components, such as the Horizon plugin or the?Monasca CLI.
When the Threshold Engine evaluates the metrics against the alarms it can create alarm state transition events. These are published back to Kafka and are read by both the persister and?Notification Engine. The Persister writes the alarm transitions to the DB for future retrieval. The notification engine will send a notification of the configured type for appropriate state transitions.
In addition to the components discussed above we also have a configuration database used for storing information such as alarm definitions and notification methods. This database can be either MySQL or PostgreSQL.
Advantages of message bus architecture
Enables a micro-services foundation
Load-balancing, scalability, system maintenance (new deploys)
Handle different loads
Extensibility: Easily add new components/services:
HP Operations Manager i (OMi) BSM Connector for HP Helion Monasca
Consumes alarm state transition messages from Kafka
Multi-site replication of data
And there is more...
Pagination (? ?? ?? ??? ??) is supported via offset and limit query parameters
The Agent Forwarder buffers metrics for a short time to increase the size of the http request body (number of metrics) sent to the Monasca API.
The Monasca API caches auth tokens in-memory to reduce the round-trip authorization requests to Keystone
If network connectivity between the Agent and API occurs the Agent will buffer metrics and send when connectivity is restored
Metrics are submitted using a ¡°agent¡± role, which only allows metrics to be POST¡¯d to the metrics endpoint
Multi-site replication for metrics can be done by running two persisters simultaneously, sending to different metrics databases
System can handle failure of any component or node
Monasca-statsd daemon
: statsd engine capable of handling dimensions associated with metrics submitted by a client that supports them. Also supports metrics from the standard statsd client. (udp/8125)
The Helion platform provides a turnkey monitoring system that is ready to use immediately after cloud installation. This saves operators time and money by eliminating the need for a separate monitoring infrastructure and from having to manage complex network configurations. All aspects of the monitoring system are certified with HP Linux for Helion and Helion OpenStack and they are supported by HP. This saves operators set up time and lowers costs because operators do not need to stand up separate infrastructure or certify additional-plug ins with the Helion environment.
HP Helion OpenStack monitoring ships with many integration points and can easily snap into existing data center management tooling and infrastructure. It ships with supported connectors with HPSW OMi and technical preview connectors for Ops A, Splunk, and ArcSight.
The Helion OpenStack 2.0 documentation contains documented triage and resolution steps for common issues. This knowledge is based on years of OpenStack software operations experience from operating HP¡¯s public cloud services.
Reduces time to production
Simplifies start up experaince
We are monitoring all of the OpenStack core service availability and performance metrics. We are collecting log events from all OpenStack core services and most of the shared services.
For a complete listing of alarms and monitored services please see the Helion documentation Monasca/alarms.