The document discusses fans of Running Gump and their love for his running. It notes Google is good at run and Running Gump is stylish and wonderful when running. His runs are plentiful and he has become idolized and mythicized, gaining many followers and fans. While his new ideas are anticipated, the reasons for and methods behind his success are less known. Ultimately, his popularity may stem from love and keeping his feet on the ground.
- The document discusses the vision for a new big data database (BigDataBase) with high scalability and the ability to store and analyze petabytes of data in real-time.
- An initial trial using HBase as the storage engine for a customized SQL interface showed potential but had limitations in features, models, and performance.
- The document proposes wrapping HBase in a middleware to add it as a pluggable storage engine to MySQL/PostgreSQL, enabling SQL queries over HBase's distributed data storage.
- It also considers designing a new SQL server from scratch that interfaces with HBase through the middleware, implementing additional database features like indexing, ACID compliance, and partitioning for big data work
The document discusses fans of Running Gump and their love for his running. It notes Google is good at run and Running Gump is stylish and wonderful when running. His runs are plentiful and he has become idolized and mythicized, gaining many followers and fans. While his new ideas are anticipated, the reasons for and methods behind his success are less known. Ultimately, his popularity may stem from love and keeping his feet on the ground.
- The document discusses the vision for a new big data database (BigDataBase) with high scalability and the ability to store and analyze petabytes of data in real-time.
- An initial trial using HBase as the storage engine for a customized SQL interface showed potential but had limitations in features, models, and performance.
- The document proposes wrapping HBase in a middleware to add it as a pluggable storage engine to MySQL/PostgreSQL, enabling SQL queries over HBase's distributed data storage.
- It also considers designing a new SQL server from scratch that interfaces with HBase through the middleware, implementing additional database features like indexing, ACID compliance, and partitioning for big data work
Horizon is a distributed SQL database that allows users to query and analyze big data stored in HBase using a familiar SQL interface. It uses the H2 database engine and customizes HBase's data model to provide features like indexing, partitioning, and SQL support. Horizon aims to make big data more accessible while maintaining HBase's scalability. It will integrate with Hadoop ecosystems and provide high performance data loading, scanning, and analysis tools. Horizon's architecture distributes the SQL engine across servers and uses HBase as the distributed storage layer.
The document contains career advice articles on various topics:
1) Engineers should provide feedback and work with product owners, not just implement orders.
2) People should self-promote good work to get noticed and advance their careers.
3) Technical skills are important but interacting well with others is key to career progression.
4) Minor work issues should not be overblown and one should maintain perspective outside of work.
This document provides an introduction and overview of HBase coprocessors. It discusses the motivations for using coprocessors such as performing distributed and parallel computations directly on data stored in HBase without data movement. It describes the architecture of coprocessors and compares the HBase coprocessor model to Google's Bigtable coprocessor model. It also provides details on the different types of coprocessors (observers and endpoints), how they are implemented and used, and provides examples code for both.
The document provides an evaluation report of DaStor, a Cassandra-based data storage and query system. It summarizes the testbed hardware configuration including 9 nodes with 112 cores and 144GB RAM. It also describes the DaStor configuration, data schema for call detail records (CDR), storage architecture with indexing scheme, and benchmark results showing a throughput of around 80,000 write operations per second for the cluster.
HiveServer2 was reconstructed and reimplemented to address limitations in the original HiveServer1 such as lack of concurrency, incomplete security implementations, and instability. HiveServer2 uses a multithreaded architecture where each client connection creates a new execution context including a session and operations. This allows HiveServer2 to associate a Hive execution context like the session and Driver with the thread serving each client request. The new Thrift interface in HiveServer2 also enables better support for common database features around authentication, authorization, and auditing compared to the original Thrift API in HiveServer1.
HBase is an open-source, distributed, versioned, key-value database modeled after Google's Bigtable. It is designed to store large volumes of sparse data across commodity hardware. HBase uses Hadoop for storage and provides real-time read and write capabilities. It scales horizontally and is highly fault tolerant through its master-slave architecture and use of Zookeeper for coordination. Data in HBase is stored in tables and indexed by row keys for fast lookup, with columns grouped into families and versions stored by timestamps.
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
?
NoSQL includes a wide range of different database technologies and were developed as a result of surging volume of data stored. Relational databases are not capable of coping with this huge volume and faces agility challenges. This is where NoSQL databases have come in to play and are popular because of their features. The session covers the following topics to help you choose the right NoSQL databases:
Traditional databases
Challenges with traditional databases
CAP Theorem
NoSQL to the rescue
A BASE system
Choose the right NoSQL database
The document discusses Google's engineering culture and infrastructure. It provides an overview of Google's practices around code review, team programming using tools like Gerrit, and the engineering pipeline. It also shares personal stories from software engineers and principles for balancing process with creativity.
Simple practices in performance monitoring and evaluationSchubert Zhang
?
This document discusses concepts and approaches for performance monitoring and evaluation. It defines key metrics like throughput, latency, concurrency and provides examples for measuring API and system performance. Specific metrics are outlined for services like call centers. Benchmarking quality of services and setting performance SLAs are also covered. The document provides code examples for implementing metrics collection and visualization using tools like JMX, Ganglia and Zabbix. It demonstrates measuring performance for a demo web application.
This document discusses big data and cloud computing. It introduces cloud storage and computing models. It then discusses how big data requires distributed systems that can scale out across many commodity machines to handle large volumes and varieties of data with high velocity. The document outlines some famous cloud products and their technologies. Finally, it provides an overview of the company's focus on enterprise big data management leveraging cloud technologies, and lists some of its cloud products and services including data storage, object storage, MapReduce and compute cloud services.
This document provides an overview of Google's Megastore database system. It discusses three key aspects: the data model and schema language for structuring data, transactions for maintaining consistency, and replication across datacenters for high availability. The data model takes a relational approach and uses the concept of entity groups to partition data at a fine-grained level for scalability. Transactions provide ACID semantics within entity groups. Replication uses Paxos consensus for strong consistency across datacenters.
Hanborq optimizations on hadoop map reduce 20120221aSchubert Zhang
?
Hanborq has developed optimizations to improve the performance of Hadoop MapReduce in three key areas:
1. The runtime environment uses a worker pool and improved scheduling to reduce job completion times from tens of seconds to near real-time.
2. The processing engine utilizes techniques like sendfile for zero-copy data transfer and Netty batch fetching to reduce network overhead and CPU usage during shuffling.
3. Sort avoidance algorithms are implemented to minimize expensive sorting operations through techniques such as early reduce and hash aggregation.
Cassandra Compression and Performance EvaluationSchubert Zhang
?
Even though we had abandoned the Cassandra in all our products, we would like to share our works here.
Why we abandoned the Cassandra in our products? Because:
(1) It is a big wrong in Cassandra's implementation, especially on it's local storage engine layer, i.e. SSTable and Indexing.
(2) It is a big wrong to combine Bigtable and Dynamo. Dynamo's hash ring architecture is a obsolete technolohy for scale, it's consistency and replication policy is also unusable in big data storage.
The document discusses different types of structured storage systems and how no single solution is appropriate for all applications. It outlines features-first, scale-first, simple structure, and batch-analytic stores and provides examples of products that fall into each category. The document also discusses Amazon AWS cloud structured storage solutions and argues that there is no one-size-fits-all approach due to contradictory requirements around data processing and access.
The document summarizes and compares several distributed file systems, including Google File System (GFS), Kosmos File System (KFS), Hadoop Distributed File System (HDFS), GlusterFS, and Red Hat Global File System (GFS). GFS, KFS and HDFS are based on the GFS architecture of a single metadata server and multiple chunkservers. GlusterFS uses a decentralized architecture without a metadata server. Red Hat GFS requires a SAN for high performance and scalability. Each system has advantages and limitations for different use cases.
This document describes the setup and architecture of a Red Hat Storage Cluster using Global File System (GFS), Clustered Logical Volume Manager (CLVM), and Global Network Block Device (GNBD). GFS allows nodes to share block-level storage over the network as if it were locally attached. GNBD exports block devices over TCP/IP to GFS nodes. CLVM provides cluster-wide logical volume management on top of shared block devices. The cluster uses components like CMAN, DLM, and fencing for distributed coordination and locking across nodes.
Parallel NFS (pNFS) is a standard defined in NFSv4.1 that separates file metadata and data to allow parallel and distributed access to file data. It defines protocols for clients to communicate with a metadata server to get file layouts describing the data locations and protocols to access multiple data servers directly in parallel. However, it does not define protocols between metadata and data servers, allowing flexibility in implementation. pNFS supports various storage layout types including file, block, and object storage and can provide high performance parallel I/O while maintaining NFS semantics.
Case Study - How Rackspace Query Terabytes Of DataSchubert Zhang
?
Rackspace generates hundreds of gigabytes of email log data daily from over 600 servers. They struggled to efficiently store and query this data using various relational database approaches. They implemented a Hadoop/MapReduce system where logs are streamed to HDFS in real-time and indexed using Lucene/Solr. MapReduce jobs run every 10 minutes to build indexes, which are compressed and stored in HDFS. Solr instances then merge and serve indexes, enabling fast log searching within hours. This scalable system meets their needs to troubleshoot issues and gain insights from massive and growing log data.
HFile: A Block-Indexed File Format to Store Sorted Key-Value PairsSchubert Zhang
?
HFile is a mimic of Google’s SSTable. Now, it is available in Hadoop HBase-0.20.0. And the previous releases of HBase temporarily use an alternate file format – MapFile, which is a common file format in Hadoop IO package. I think HFile should also become a common file format when it becomes mature, and should be moved into the common IO package of Hadoop in the future.
The document evaluates the performance of HBase version 0.20.0 on a small cluster. It describes the testbed setup including hardware specifications and Hadoop/HBase configuration parameters. A series of experiments are run to test random reads, random writes, sequential reads, sequential writes, and scans. The results show significant performance improvements over previous versions, getting closer to the performance levels of Google BigTable as reported in their paper.