The document discusses developing a comprehensive monitoring approach for Hadoop clusters. It recommends starting with basic monitoring of nodes using Nagios and Cacti for metrics like CPU usage, disk usage, and network traffic. Additional Hadoop-specific checks and graphs are added like checking DataNodes and monitoring the NameNode via JMX. JMX is also used to create alarms in Nagios based on metrics like number of files. The approach emphasizes starting simply and gradually expanding monitoring as the cluster grows.
The document discusses Google's engineering culture and infrastructure. It provides an overview of Google's practices around code review, team programming using tools like Gerrit, and the engineering pipeline. It also shares personal stories from software engineers and principles for balancing process with creativity.
Simple practices in performance monitoring and evaluationSchubert Zhang
?
This document discusses concepts and approaches for performance monitoring and evaluation. It defines key metrics like throughput, latency, concurrency and provides examples for measuring API and system performance. Specific metrics are outlined for services like call centers. Benchmarking quality of services and setting performance SLAs are also covered. The document provides code examples for implementing metrics collection and visualization using tools like JMX, Ganglia and Zabbix. It demonstrates measuring performance for a demo web application.
The document contains career advice articles on various topics:
1) Engineers should provide feedback and work with product owners, not just implement orders.
2) People should self-promote good work to get noticed and advance their careers.
3) Technical skills are important but interacting well with others is key to career progression.
4) Minor work issues should not be overblown and one should maintain perspective outside of work.
HiveServer2 was reconstructed and reimplemented to address limitations in the original HiveServer1 such as lack of concurrency, incomplete security implementations, and instability. HiveServer2 uses a multithreaded architecture where each client connection creates a new execution context including a session and operations. This allows HiveServer2 to associate a Hive execution context like the session and Driver with the thread serving each client request. The new Thrift interface in HiveServer2 also enables better support for common database features around authentication, authorization, and auditing compared to the original Thrift API in HiveServer1.
Horizon is a distributed SQL database that allows users to query and analyze big data stored in HBase using a familiar SQL interface. It uses the H2 database engine and customizes HBase's data model to provide features like indexing, partitioning, and SQL support. Horizon aims to make big data more accessible while maintaining HBase's scalability. It will integrate with Hadoop ecosystems and provide high performance data loading, scanning, and analysis tools. Horizon's architecture distributes the SQL engine across servers and uses HBase as the distributed storage layer.
This document provides an introduction and overview of HBase coprocessors. It discusses the motivations for using coprocessors such as performing distributed and parallel computations directly on data stored in HBase without data movement. It describes the architecture of coprocessors and compares the HBase coprocessor model to Google's Bigtable coprocessor model. It also provides details on the different types of coprocessors (observers and endpoints), how they are implemented and used, and provides examples code for both.
- The document discusses the vision for a new big data database (BigDataBase) with high scalability and the ability to store and analyze petabytes of data in real-time.
- An initial trial using HBase as the storage engine for a customized SQL interface showed potential but had limitations in features, models, and performance.
- The document proposes wrapping HBase in a middleware to add it as a pluggable storage engine to MySQL/PostgreSQL, enabling SQL queries over HBase's distributed data storage.
- It also considers designing a new SQL server from scratch that interfaces with HBase through the middleware, implementing additional database features like indexing, ACID compliance, and partitioning for big data work
The document discusses fans of Running Gump and their love for his running. It notes Google is good at run and Running Gump is stylish and wonderful when running. His runs are plentiful and he has become idolized and mythicized, gaining many followers and fans. While his new ideas are anticipated, the reasons for and methods behind his success are less known. Ultimately, his popularity may stem from love and keeping his feet on the ground.
The document provides an evaluation report of DaStor, a Cassandra-based data storage and query system. It summarizes the testbed hardware configuration including 9 nodes with 112 cores and 144GB RAM. It also describes the DaStor configuration, data schema for call detail records (CDR), storage architecture with indexing scheme, and benchmark results showing a throughput of around 80,000 write operations per second for the cluster.
This document discusses big data and cloud computing. It introduces cloud storage and computing models. It then discusses how big data requires distributed systems that can scale out across many commodity machines to handle large volumes and varieties of data with high velocity. The document outlines some famous cloud products and their technologies. Finally, it provides an overview of the company's focus on enterprise big data management leveraging cloud technologies, and lists some of its cloud products and services including data storage, object storage, MapReduce and compute cloud services.
This document provides an overview of Google's Megastore database system. It discusses three key aspects: the data model and schema language for structuring data, transactions for maintaining consistency, and replication across datacenters for high availability. The data model takes a relational approach and uses the concept of entity groups to partition data at a fine-grained level for scalability. Transactions provide ACID semantics within entity groups. Replication uses Paxos consensus for strong consistency across datacenters.
4. 更偏原理一些,如下图:
上图中的几个子系统的解释可见 Ganglia 主站的文档,这里简单粘贴和注释如下:
The ganglia monitoring daemon (gmond) is a lightweight service that is installed on every
machine you'd like to monitor. This daemon uses a simple listen/announce protocol via XDR to
collect monitoring state and then shares this information via XML over TCP. Gmond is portable
and collects dozens of system metrics: CPU, memory, disk, network and process data.
注释:
gmond
gmond 是核心模块,它有三个角色:
(1) 任何一个 gmond 节点都可以接收来自其他节点(根据配置,组播或单播)的 metrics 数据,所以 gmond
是个数据收集器。收集的 metrics 可能来自其他节点的 gmond(如 CPU, memory, disk, networ 等)
也可能来自其他节点的应用程序直接发送来的数据(后面的章节会提到)。
(2) gmond 可以将收集到的 metrics 数据 announce 给 gmetad(事实上是 gmetad 周期地 pull 数据,
3
5. 见 gmetad.conf 的 data_source 配置)。
(3) gmond 同时也代理采集本节点的基本 metrics(缺省有 CPU, memory, disk, network 等),并发给
负责收集数据的 gmond。用户应该也是可以添加采集的本节点项目的。
The ganglia meta daemon (gmetad) is a service that collects data from other gmetad(多级级
联) and gmond sources and stores their state to disk in indexed round-robin databases(RRD).
Gmetad provides a simple query mechanism for collecting specific information about groups of
machines. Gmetad supports hierarchical delegation for creating manageable monitoring
gmetad domains.
注释:
gmetad 是核心模块,它对我们提供服务:从配置的 data_source 的 gmond 中周期地 pull 数据,存储于
本地数据库(RRD),然后由 httpd 提供的 Web Server(PHP 代码)从 RRD 中 query 数据并展示图形界面。
The ganglia metric tool is a commandline application that you can use to inject custom made
metrics about hosts that are being monitored by ganglia. It has the ability to spoof messages
as coming from a different host in case you want to capture and report metrics from a device
where you don't have gmond running (like a network or other embedded device).
注释:
gmetric 是个命令行工具,可以带必要的参数 OPTIONS,将数据直接发送到负责收集数据的 gmond 节点
(或者广播给所有 gmond 节点)。这在一些 ad-hoc 的临时性测试,或者写个 shell script 输出采集数据到
ganglia 时非常方便,是个快捷的工具。后面讲到 Ganglia 的 Shell Script 集成的时候会举例用 script 采
集 iostat 数据输出到 ganglia 就是用这个工具。
可见,发出数据到 Ganglia 的不一定是 gmond 节点,我们是可以直接通过应用程序调用 Ganglia 提供的
gmetric 编程 API 组装 ganglia 协议的消息包或直接通过 gmetric 这个命令行工具发送的。
通过运行/usr/local/ganglia/bin/gmetric --help 可以查看该命令行工具的行为和参数。
注意:ganglia-3.1.7 及以前的版本中的 gmetric 不能指定所发 metric 的 group,所以显示的界面上会看
到其 group name 为”no_group metrics”,如下图。目前的 3.2.0 版本的 ganglia 中完善了这个功能,
可以通过”--group=STRING”指定 group name。
The ganglia stat tool is a commandline application that you can use to query a gmond for
metric information directly.
注释:
gstat
这个工具以前一直没用到,用过运行/usr/local/ganglia/bin/gmetric --help 可以查看该命令行工具的行
为和参数。似乎只有在启用 gexec 的时候,这个工具才有意义。而 gexec 只有在我们编译安装的时候,
在./configure 中带上--enable-gexec 参数才会生效。gexec 是 gagnlia 的一个补充,可以将工作和资源
4
6. 分布在多个节点,解决单个节点上资源限制问题,具体还不清楚。
The ganglia web frontend expresses the data stored by gmetad in a graphical web interface
using PHP.
web 注释:
这就是展示界面的 PHP 和页面代码。后续章节我们将探讨如何简单依葫芦画瓢地修改这些界面以展示我们
想要的内容。
我们先将 gmond、gmetad 和 web 了解清楚即可,而 gmetric 和 gstat 用到时再去了解它
们。
上述几张图都来自 Internet,目的是只要尽快地将问题说清楚即可。而本文也并不深入
Ganglia 的细节,而是基于最简单地使用 Ganglia 的出发点来让新手可以快速地将 Ganglia 使
用起来,起到一个引入作用。后续的细节和深度集成和定制,需要产物开发工程师和集成测
试工程师去深入研究和开发。请到 http://ganglia.info 去寻找细节的资料。
2 Ganglia 安装
Ganglia 的安装非常简单,但其中也有一些小小的 tricks,这里写出来,目的是让大家加
快安装使用的速度。为了让大家在开发过程中能经常使用 Ganglia,建议大家也在自己调试
的虚拟机上安装使用它。
下面所述的安装过程都是在 CentOS Linux 或 RedHat Linux 的 x86_64 系统上安装的,如果
你的 OS 版本不同,自行调整。以安装 ganglia-3.1.7 为例,其他验证过好用的版本包括 3.1.2
和 3.2.0。其中 3.2.0 由于是最新版本,内部改动有点多,我们的一些集成工具和界面 PHP
代码还未跟上一同调整,所以本文没有以 3.2.0 为例介绍。但根据测试和实验,3.2.0 的对外
gmond 协议接口和 3.1.x 是一致的,而且界面代码基本不用大的调整就可以使用,因此下面
讲的基本上也适用于 3.2.0,只是界面部分的 PHP 代码稍微有点差别。另外 3.2.0 针对 3.1.x
有写改进对我们还是很有价值的,比如前面提到的 gmetric 工具可以带 group 的功能,web
界面可以查询展示历史某段时间(起止时间)的数据和图形(如下图)。
2.1 安装准备
将安装包解压到/usr/local/src/ganglia-3.1.7 (我习惯将所有源代码安装包放在这里编译),
简单看一下 INSTALL 文件,发现 Ganglia 依赖下列 lib 和工具。我们须一一检查是否可用。
5