Jonathan Gray gave an introduction to HBase at the NYC Hadoop Meetup. He began with an overview of HBase and why it was created to handle large datasets beyond what Hadoop could support alone. He then described what HBase is, as a distributed, column-oriented database management system. Gray explained how HBase works with its master and regionserver nodes and how it partitions data across tables and regions. He highlighted some key features of HBase and examples of companies using it in production. Gray concluded with what is planned for the future of HBase and contrasted it with relational database examples.
This document provides an overview of HBase, an open source, distributed, large scale database modeled after Google's BigTable. It describes what HBase is, why it was created, its key features like support for unstructured data and version management. It explains HBase's architecture including its write-ahead log, HLog files, HFile storage, ZooKeeper coordination, Masters and RegionServers. It provides examples of how tables and data are stored and examples of HBase in use by companies.
HBase is an open-source, distributed, versioned, key-value database modeled after Google's Bigtable. It is designed to store large volumes of sparse data across commodity hardware. HBase uses Hadoop for storage and provides real-time read and write capabilities. It scales horizontally and is highly fault tolerant through its master-slave architecture and use of Zookeeper for coordination. Data in HBase is stored in tables and indexed by row keys for fast lookup, with columns grouped into families and versions stored by timestamps.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
?
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.
HBase is an open-source, non-relational, distributed database built on top of Hadoop and HDFS. It provides BigTable-like capabilities for Hadoop, including fast random reads and writes. HBase stores data in tables comprised of rows, columns, and versions. It is designed to handle large volumes of sparse or unstructured data across clusters of commodity hardware. HBase uses a master-slave architecture with RegionServers storing and serving data and a single active MasterServer managing the cluster metadata and load balancing.
Hbase is a non-relational, distributed database that runs on top of HDFS. It uses Zookeeper for coordination between servers. Hbase has a master server that manages tables and region servers that store the distributed data. Data is stored in tables as rows and columns and can be queried using CRUD operations like put, get, scan, disable and drop.
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
?
The document summarizes two presentations about using HBase as a database. It discusses the speakers' experiences using HBase at Stumbleupon and Streamy to replace MySQL and other relational databases. Some key points covered include how HBase provides scalability, flexibility, and cost benefits over SQL databases for large datasets.
With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookCloudera, Inc.
?
This technical session will provide a quick review of the Apache HBase project, looking at it from the past to the future. It will cover the imminent HBase 0.92 release as well as what is slated for 0.94 and beyond. A number of companies and use cases will be used as examples to describe the overall direction of the HBase community and project.
HBaseCon 2013: Full-Text Indexing for Apache HBaseCloudera, Inc.
?
This document discusses full-text indexing for HBase tables. It describes how Lucene indices are organized based on HBase regions. Index building is implemented using coprocessors to update indices on data changes. Index splitting is optimized to avoid blocking updates during region splits. Search performance of indexing 10 billion records was tested, showing search times of around 1 second.
HBase is a distributed, column-oriented database that runs on top of Hadoop and HDFS, providing Bigtable-like capabilities for massive tables of structured and unstructured data. It is modeled after Google's Bigtable and provides a distributed, scalable, versioned storage system with strong consistency for random read/write access to billions of rows and millions of columns. HBase is well-suited for handling large datasets and providing real-time read/write access across clusters of commodity servers.
Most developers are familiar with the topic of ¡°database design¡±. In the relational world, normalization is the name of the game. How do things change when you¡¯re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Cloudera, Inc.
?
Facebook has one of the largest Apache Hadoop data warehouses in the world, primarily queried through Apache Hive for offline data processing and analytics. However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase. This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase. This includes several significant contributions to existing projects as well as the release of new open source projects.
HBase is an open source, distributed, sorted key-value store modeled after Google's BigTable. It uses HDFS for storage and provides random read/write access to large datasets. Data is stored in tables with rows sorted by key and columns grouped into column families. The master coordinates region servers that host regions, the distributed units of data. Clients locate data regions and directly communicate with region servers to read and write data.
Jesse Anderson (Smoking Hand)
This early-morning session offers an overview of what HBase is, how it works, its API, and considerations for using HBase as part of a Big Data solution. It will be helpful for people who are new to HBase, and also serve as a refresher for those who may need one.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
?
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
?
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Big Data Fundamentals in the Emerging New Data WorldJongwook Woo
?
I talk about the fundamental of Big Data, which includes Hadoop, Intensive Data Computing, and NoSQL database that have received highlight to compute and store BigData that is usually greater then Peta-byte data. Besides, I
introduce the case studies that use Hadoop and NSQL DB.
HBase is a NoSQL database that stores data in HDFS in a distributed, scalable, reliable way for big data. It is column-oriented and optimized for random read/write access to big data in real-time. HBase is not a relational database and relies on HDFS. Common use cases include flexible schemas, high read/write rates, and real-time analytics. Apache Phoenix provides a SQL interface for HBase, allowing SQL queries, joins, and familiar constructs to manage data in HBase tables.
In this session you will learn:
HBase Introduction
Row & Column storage
Characteristics of a huge DB
What is HBase?
HBase Data-Model
HBase vs RDBMS
HBase architecture
HBase in operation
Loading Data into HBase
HBase shell commands
HBase operations through Java
HBase operations through MR
To know more, click here: https://www.mindsmapped.com/courses/big-data-hadoop/big-data-and-hadoop-training-for-beginners/
This document discusses integrating Apache Hive and HBase. It provides an overview of Hive and HBase, describes use cases for querying HBase data using Hive SQL, and outlines features and improvements for Hive and HBase integration. Key points include mapping Hive schemas and data types to HBase tables and columns, pushing filters and other operations down to HBase, and using a storage handler to interface between Hive and HBase. The integration allows analysts to query both structured Hive and unstructured HBase data using a single SQL interface.
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
?
This document provides an introduction to HBase internals and schema design for HBase users. It discusses the logical and physical views of HBase, including how tables are split into regions and stored across region servers. It covers best practices for schema design, such as using row keys efficiently and avoiding redundancy. The document also briefly discusses advanced topics like coprocessors and compression. The overall goal is to help HBase users optimize performance and scalability based on its internal architecture.
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...HBaseCon
?
Speakers: Chris Huang and Scott Miao (Trend Micro)
Trend Micro collects lots of threat knowledge data for clients containing many different threat (web) entities. Most threat entities will be observed along with relations, such as malicious behaviors or interaction chains among them. So, we built a graph model on HBase to store all the known threat entities and their relationships, allowing clients to query threat relationships via any given threat entity. This presentation covers what problems we try to solve, what and how the design decisions we made, how we design such a graph model, and the graph computation tasks involved.
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
?
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
The document discusses data management for analytics. It describes how traditional relational databases do not scale well for big data due to strict structure and synchronization requirements. It then summarizes NoSQL databases as more scalable alternatives that trade strict structure for flexibility and relax synchronization. Specific NoSQL databases discussed include key-value stores, document databases, wide-column stores, and columnar databases. Distributed file systems like HDFS are also covered.
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
?
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
HBase is an open-source, distributed, column-oriented database that runs on top of Hadoop. It provides real-time read and write access to large amounts of data across clusters of commodity hardware. HBase scales to billions of rows and millions of columns and is used by companies like Twitter, Adobe, and Yahoo to store large datasets. It uses a master-slave architecture with a single HBaseMaster and multiple RegionServers and stores data in Hadoop's HDFS for high availability.
HBase is an open-source, distributed, versioned, non-relational database built on top of Hadoop. It is modeled after Google's BigTable and provides random real-time read/write access to large datasets stored on HDFS. HBase scales to billions of rows and millions of columns and is used by companies like Twitter, Adobe, and Yahoo to store large datasets. It uses a master-slave architecture with an HBase master managing region servers that store the data.
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookCloudera, Inc.
?
This technical session will provide a quick review of the Apache HBase project, looking at it from the past to the future. It will cover the imminent HBase 0.92 release as well as what is slated for 0.94 and beyond. A number of companies and use cases will be used as examples to describe the overall direction of the HBase community and project.
HBaseCon 2013: Full-Text Indexing for Apache HBaseCloudera, Inc.
?
This document discusses full-text indexing for HBase tables. It describes how Lucene indices are organized based on HBase regions. Index building is implemented using coprocessors to update indices on data changes. Index splitting is optimized to avoid blocking updates during region splits. Search performance of indexing 10 billion records was tested, showing search times of around 1 second.
HBase is a distributed, column-oriented database that runs on top of Hadoop and HDFS, providing Bigtable-like capabilities for massive tables of structured and unstructured data. It is modeled after Google's Bigtable and provides a distributed, scalable, versioned storage system with strong consistency for random read/write access to billions of rows and millions of columns. HBase is well-suited for handling large datasets and providing real-time read/write access across clusters of commodity servers.
Most developers are familiar with the topic of ¡°database design¡±. In the relational world, normalization is the name of the game. How do things change when you¡¯re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Cloudera, Inc.
?
Facebook has one of the largest Apache Hadoop data warehouses in the world, primarily queried through Apache Hive for offline data processing and analytics. However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase. This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase. This includes several significant contributions to existing projects as well as the release of new open source projects.
HBase is an open source, distributed, sorted key-value store modeled after Google's BigTable. It uses HDFS for storage and provides random read/write access to large datasets. Data is stored in tables with rows sorted by key and columns grouped into column families. The master coordinates region servers that host regions, the distributed units of data. Clients locate data regions and directly communicate with region servers to read and write data.
Jesse Anderson (Smoking Hand)
This early-morning session offers an overview of what HBase is, how it works, its API, and considerations for using HBase as part of a Big Data solution. It will be helpful for people who are new to HBase, and also serve as a refresher for those who may need one.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
?
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
?
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Big Data Fundamentals in the Emerging New Data WorldJongwook Woo
?
I talk about the fundamental of Big Data, which includes Hadoop, Intensive Data Computing, and NoSQL database that have received highlight to compute and store BigData that is usually greater then Peta-byte data. Besides, I
introduce the case studies that use Hadoop and NSQL DB.
HBase is a NoSQL database that stores data in HDFS in a distributed, scalable, reliable way for big data. It is column-oriented and optimized for random read/write access to big data in real-time. HBase is not a relational database and relies on HDFS. Common use cases include flexible schemas, high read/write rates, and real-time analytics. Apache Phoenix provides a SQL interface for HBase, allowing SQL queries, joins, and familiar constructs to manage data in HBase tables.
In this session you will learn:
HBase Introduction
Row & Column storage
Characteristics of a huge DB
What is HBase?
HBase Data-Model
HBase vs RDBMS
HBase architecture
HBase in operation
Loading Data into HBase
HBase shell commands
HBase operations through Java
HBase operations through MR
To know more, click here: https://www.mindsmapped.com/courses/big-data-hadoop/big-data-and-hadoop-training-for-beginners/
This document discusses integrating Apache Hive and HBase. It provides an overview of Hive and HBase, describes use cases for querying HBase data using Hive SQL, and outlines features and improvements for Hive and HBase integration. Key points include mapping Hive schemas and data types to HBase tables and columns, pushing filters and other operations down to HBase, and using a storage handler to interface between Hive and HBase. The integration allows analysts to query both structured Hive and unstructured HBase data using a single SQL interface.
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
?
This document provides an introduction to HBase internals and schema design for HBase users. It discusses the logical and physical views of HBase, including how tables are split into regions and stored across region servers. It covers best practices for schema design, such as using row keys efficiently and avoiding redundancy. The document also briefly discusses advanced topics like coprocessors and compression. The overall goal is to help HBase users optimize performance and scalability based on its internal architecture.
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...HBaseCon
?
Speakers: Chris Huang and Scott Miao (Trend Micro)
Trend Micro collects lots of threat knowledge data for clients containing many different threat (web) entities. Most threat entities will be observed along with relations, such as malicious behaviors or interaction chains among them. So, we built a graph model on HBase to store all the known threat entities and their relationships, allowing clients to query threat relationships via any given threat entity. This presentation covers what problems we try to solve, what and how the design decisions we made, how we design such a graph model, and the graph computation tasks involved.
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
?
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
The document discusses data management for analytics. It describes how traditional relational databases do not scale well for big data due to strict structure and synchronization requirements. It then summarizes NoSQL databases as more scalable alternatives that trade strict structure for flexibility and relax synchronization. Specific NoSQL databases discussed include key-value stores, document databases, wide-column stores, and columnar databases. Distributed file systems like HDFS are also covered.
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
?
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
HBase is an open-source, distributed, column-oriented database that runs on top of Hadoop. It provides real-time read and write access to large amounts of data across clusters of commodity hardware. HBase scales to billions of rows and millions of columns and is used by companies like Twitter, Adobe, and Yahoo to store large datasets. It uses a master-slave architecture with a single HBaseMaster and multiple RegionServers and stores data in Hadoop's HDFS for high availability.
HBase is an open-source, distributed, versioned, non-relational database built on top of Hadoop. It is modeled after Google's BigTable and provides random real-time read/write access to large datasets stored on HDFS. HBase scales to billions of rows and millions of columns and is used by companies like Twitter, Adobe, and Yahoo to store large datasets. It uses a master-slave architecture with an HBase master managing region servers that store the data.
HBase is a distributed column-oriented database built on top of Hadoop that provides random real-time read/write access to big data stored in Hadoop. It uses a master server to assign regions to region servers and Zookeeper to track servers and coordinate tasks. HBase allows users to perform CRUD operations on tables through its shell interface using commands like create, put, get, and scan.
HBase is a distributed column-oriented database built on top of Hadoop that provides quick random access to large amounts of structured data. It leverages the fault tolerance of HDFS and allows for real-time read/write access to data stored in HDFS. HBase sits above HDFS and provides APIs for reading and writing data randomly. It is a scalable, schema-less database modeled after Google's Bigtable.
Precisando lidar com dados massivos onde centenas de gigabytes com crescimento para terabytes ou mesmo petabytes fazem parte do seu dia-a-dia ? Voc¨º precisa realizar milhares de opera??es por segundo em m¨²ltiplos terabytes de dados ? Venha conhecer o Apache HBase, um banco de dados NoSQL que roda em cima do HDFS e ¨¦ altamente dispon¨ªvel, tolerante a falhas e escal¨¢vel. HBase tem sido muito utilizado em empresas como Facebook e Twitter. Esta palestra faz uma introdu??o, mostrando o que ¨¦ o HBase e quando usar, sua arquitetura e tamb¨¦m exemplos de solu??es reais de grandes empresas como Facebook, Twitter e Trend Micro
Hive was initially developed by Facebook to manage large amounts of data stored in HDFS. It uses a SQL-like query language called HiveQL to analyze structured and semi-structured data. Hive compiles HiveQL queries into MapReduce jobs that are executed on a Hadoop cluster. It provides mechanisms for partitioning, bucketing, and sorting data to optimize query performance.
The document provides information on various components of the Hadoop ecosystem including Pig, Zookeeper, HBase, Spark, and Hive. It discusses how HBase offers random access to data stored in HDFS, allowing for faster lookups than HDFS alone. It describes the architecture of HBase including its use of Zookeeper, storage of data in regions on region servers, and secondary indexing capabilities. Finally, it summarizes Hive and how it allows SQL-like queries on large datasets stored in HDFS or other distributed storage systems using MapReduce or Spark jobs.
What is Hbase?
¡ñ Benefits
¡ñ Why Hbase?
¡ñ High Level Architecture
¡ñ Terminology
¡ñ When to use Hbase?
¡ñ Trivia
#database #no-sql #distributed #bigdata
The document discusses NoSQL databases, describing their characteristics like being non-relational, scalable, and schema-free. It covers different types of NoSQL databases like key-value stores, wide column stores, document stores, and graph databases. The document also discusses where NoSQL databases are particularly useful compared to relational databases and gives examples of companies using NoSQL.
HBase is a distributed, column-oriented database that is modeled after Google's Bigtable. It runs on top of HDFS and provides real-time read/write access to large datasets. HBase tables are split into regions that can be distributed across multiple servers. It uses a log-structured merge tree to store data on disk for efficient read/write operations. HBase is well suited for handling large volumes of randomly accessible data.
Hive is a data warehouse system for querying large datasets using SQL. Version 0.6 added views, multiple databases, dynamic partitioning, and storage handlers. Version 0.7 will focus on concurrency control, statistics collection, indexing, and performance improvements. Hive has become a top-level Apache project and aims to improve security, testing, and integration with other Hadoop components in the future.
HBase is a distributed, column-oriented database built on top of HDFS that can handle large datasets across a cluster. It uses a map-reduce model where data is stored as multidimensional sorted maps across nodes. Data is first written to a write-ahead log and memory, then flushed to disk files and compacted for efficiency. Client applications access HBase programmatically through APIs rather than SQL. Map-reduce jobs on HBase use input, mapper, reducer, and output classes to process table data in parallel across regions.
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPERKrishnaVeni451953
?
HBase is an open source, column-oriented database built on top of Hadoop that allows for the storage and retrieval of large amounts of sparse data. It provides random real-time read/write access to this data stored in Hadoop and scales horizontally. HBase features include automatic failover, integration with MapReduce, and storing data as multidimensional sorted maps indexed by row, column, and timestamp. The architecture consists of a master server (HMaster), region servers (HRegionServer), regions (HRegions), and Zookeeper for coordination.
Impala is an open-source SQL query engine for Apache Hadoop that allows for fast, interactive queries directly against data stored in HDFS and other data storage systems. It provides low-latency queries in seconds by using a custom query engine instead of MapReduce. Impala allows users to interact with data using standard SQL and business intelligence tools while leveraging existing metadata in Hadoop. It is designed to be integrated with the Hadoop ecosystem for distributed, fault-tolerant and scalable data processing and analytics.
In this introduction to Apache Hive the following topics are covered:
1. Hive Introduction
2. Hive origin
3. Where does Hive fall in Big Data stack
4. Hive architecture
5. Tts job execution mechanisms
6. HiveQL and Hive Shell
7 Types of tables
8. Querying data
9. Partitioning
10. Bucketing
11. Pros
12. Limitations of Hive
Data Explosion
- TBs of data generated everyday
Solution ¨C HDFS to store data and Hadoop Map-Reduce framework to parallelize processing of Data
What is the catch?
Hadoop Map Reduce is Java intensive
Thinking in Map Reduce paradigm can get tricky
Big Data and NoSQL for Database and BI ProsAndrew Brust
?
This document discusses how Microsoft business intelligence (BI) tools can integrate with big data technologies like Hadoop. It describes how Excel, PowerPivot, and SQL Server Analysis Services can connect to Hadoop data stored in HDFS or Hive via an ODBC driver. It also explains how SQL Server Parallel Data Warehouse uses PolyBase to directly query Hadoop, bypassing MapReduce. The document provides an overview of Hadoop concepts like MapReduce, HDFS, and Hive, as well as ETL tools like Sqoop that can move data between Hadoop and other data sources.
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
2. About Me
? Jonathan Gray
¨C HBase Committer
¨C HBase User since early 2008
¨C Migrated large PostgreSQL instance to HBase
? In production @ streamy.com since June 2008
¨C Core contributor to performance improvements
in HBase 0.20
¨C Currently consulting around HBase
? As well as Hadoop/MR and Lucene/Katta
3. Overview
? Why HBase?
? What is HBase?
? How does HBase work?
? HBase Today and Tomorrow
? HBase vs. RDBMS Example
? HBase and ¡°NoSQL¡±
4. Why HBase?
? Same reasons we need Hadoop
¨C Datasets growing into Terabytes and Petabytes
¨C Scaling out is cheaper than scaling up
? Continue to grow just by adding commodity nodes
? But sometimes Hadoop is not enough
¨C Need to support random reads and random writes
Traditional databases are expensive to scale
and difficult to distribute
5. What is HBase?
? Distributed
? Column-Oriented
? Multi-Dimensional
? High-Availability
? High-Performance
? Storage System
Project Goal
Billions of Rows * Millions of Columns * Thousands of Versions
Petabytes across thousands of commodity servers
6. HBase is not¡
? A Traditional SQL Database
¨C No joins, no query engine, no types, no SQL
¨C Transactions and secondary indexing possible but these
are add-ons, not part of core HBase
? A drop-in replacement for your RDBMS
? You must be OK with RDBMS anti-schema
¨C Denormalized data
¨C Wide and sparsely populated tables
Just say ¡°no¡± to your inner DBA
7. How does HBase work?
? Two types of HBase nodes:
Master and RegionServer
? Master (one at a time)
¨C Manages cluster operations
? Assignment, load balancing, splitting
¨C Not part of the read/write path
¨C Highly available with ZooKeeper and standbys
? RegionServer (one or more)
¨C Hosts tables; performs reads, buffers writes
¨C Clients talk directly to them for reads/writes
8. HBase Tables
? An HBase cluster is made up of any number of user-
defined tables
? Table schema only defines it¡¯s column families
¨C Each family consists of any number of columns
¨C Each column consists of any number of versions
¨C Columns only exist when inserted, NULLs are free
¨C Everything except table/family names are byte[]
¨C Rows in a table are sorted and stored sequentially
¨C Columns in a family are sorted and stored sequentially
(Table, Row, Family, Column, Timestamp) ? Value
9. HBase Table as Data Structures
? A table maps rows to its families
¨C SortedMap(Row ? List(ColumnFamilies))
? A family maps column names to versioned values
¨C SortedMap(Column ? SortedMap(VersionedValues))
? A column maps timestamps to values
¨C SortedMap(Timestamp ? Value)
An HBase table is a three-dimensional sorted map
(row, column, and timestamp)
10. HBase Regions
? Table is made up of any number of regions
? Region is specified by its startKey and endKey
¨C Empty table:
(Table, NULL, NULL)
¨C Two-region table:
(Table, NULL, ¡°MidKey¡±) and (Table, ¡°MidKey¡±, NULL)
? A region only lives on one RegionServer at a time
? Each region may live on a different node and is
made up of several HDFS files and blocks, each of
which is replicated by Hadoop
11. More HBase Architecture
? Region information and locations stored in special
tables called catalog tables
-ROOT- table contains location of meta table
.META. table contains schema/locations of user regions
? Location of -ROOT- is stored in ZooKeeper
¨C This is the ¡°bootstrap¡± location
? ZooKeeper is used for coordination / monitoring
¨C Leader election to decide who is master
¨C Ephemeral nodes to detect RegionServer node failures
14. HBase Key Features
? Automatic partitioning of data
¨C As data grows, it is automatically split up
? Transparent distribution of data
¨C Load is automatically balanced across nodes
? Tables are ordered by row, rows by column
¨C Designed for efficient scanning (not just gets)
¨C Composite keys allow ORDER BY / GROUP BY
? Server-side filters
? No SPOF because of ZooKeeper integration
15. HBase Key Features (cont)
? Fast adding/removing of nodes while online
¨C Moving locations of data doesn¡¯t move data
? Supports creating/modifying tables online
¨C Both table-level and family-level configuration
parameters
? Close ties with Hadoop MapReduce
¨C TableInputFormat/TableOutputFormat
¨C HFileOutputFormat
16. Connecting to HBase
? Native Java Client/API
¨C Get, Scan, Put, Delete classes
¨C HTable for read/write, HBaseAdmin for admin stuff
? Non-Java Clients
¨C Thrift server (Ruby, C++, PHP, etc)
¨C REST server (stargate contrib)
? HBase Shell
¨C Jruby shell supports put, delete, get, scan
¨C Also supports administrative tasks
? TableInputFormat/TableOutputFormat
17. HBase Add-ons
? MapReduce / Cascading / Hive / Pig
¨C Support for HBase as a data source or sink
? Transactional HBase
¨C Distributed transactions using OCC
? Indexed HBase
¨C Utilizes Transactional HBase for secondary indexing
? IHbase
¨C New contrib for in-memory secondary indexes
? HBql
¨C SQL syntax on top of HBase
18. HBase Today
? Latest stable release is HBase 0.20.3
¨C Major improvement over HBase 0.19
¨C Focus on performance improvement
¨C Add ZooKeeper, remove SPOF
¨C Expansion of in-memory and caching capabilities
¨C Compatible with Hadoop 0.20.x
¨C Recommend upgrading from earlier 0.20.x HBase
releases as 0.20.3 includes some important fixes
? Improves logging, shell, faster cluster ops, stability
19. HBase in Production
? Streamy
? StumbleUpon
? Adobe
? Meetup
? Ning
? Openplaces
? Powerset
? SocialMedia.com
? TrendMicro
20. The Future of HBase
? Next release is HBase 0.21.0
¨C Release date will be ~1 month after Hadoop 0.21
? Data durability is fixed in this release
¨C HDFS append/sync finally works in Hadoop 0.21
¨C This is implemented and working on TRUNK
¨C Have added group commit and knobs to adjust
? Other cool features
¨C Inter-cluster replication
¨C Master Rewrite
¨C Parallel Puts
¨C Co-processors
21. HBase Web Crawl Example
? Store web crawl data
¨C Table crawl with family content
¨C Row is URL with Columns
? content:data stores raw crawled data
? content:language stores http language header
? content:type stores http content-type header
¨C If processing raw data for hyperlinks and images,
add families links and images
? links:<url> column for each hyperlink
? images:<url> column for each image
22. Web Crawl Example in RDBMS
? How would this look in a traditional DB?
¨C Table crawl with columns url, data, language, and
type
¨C Table links with columns url and link
¨C Table images with columns url and image
? How will this scale?
¨C 10M documents w/ avg10 links and 10 images
¨C 210M total rows versus 10M total rows
¨C Index bloat with links/images tables
23. What is ¡°NoSQL¡±?
? Has little to do with not being SQL
¨C SQL is just a query language standard
¨C HBql is an attempt to add SQL syntax to HBase
¨C Millions are trained in SQL; resistance is futile!
? Popularity of Hive and Pig over raw MapReduce
? Has more to do with anti-RDBMS architecture
¨C Dropping the relational aspects
¨C Loosening ACID and transactional elements
24. NoSQL Types and Projects
? Column-oriented
¨C HBase, Cassandra, Hypertable
? Key/Value
¨C BerkeleyDB, Tokyo, Memcache, Redis, SimpleDB
? Document
¨C CouchDB, MongoDB
? Other differentiators as well¡
¨C Strong vs. Eventual consistency
¨C Database replication vs. Filesystem replication