Hive provides an SQL-like interface to query data stored in Hadoop's HDFS distributed file system and processed using MapReduce. It allows users without MapReduce programming experience to write queries that Hive then compiles into a series of MapReduce jobs. The document discusses Hive's components, data model, query planning and optimization techniques, and performance compared to other frameworks like Pig.
Hadoop and Hive are used at Facebook for large scale data processing and analytics using commodity hardware and open source software. Hive provides an SQL-like interface to query large datasets stored in Hadoop and translates queries into MapReduce jobs. It is used for daily/weekly data aggregations, ad-hoc analysis, data mining, and other tasks using datasets exceeding petabytes in size stored on Hadoop clusters.
The document discusses some of the challenges of managing Hadoop clusters in the cloud, including setting up infrastructure components like the Hive metastore and determining optimal cluster sizing. It then presents some solutions offered by Qubole's data platform, like auto-scaling clusters and running periodic jobs. The document also covers techniques for improving query performance, such as using HDFS as a cache layer and storing data in columnar format for faster access compared to JSON or CSV files stored in S3.
Facebook Retrospective - Big data-world-europe-2012Joydeep Sen Sarma
油
This document provides a retrospective on data infrastructure at Facebook from 2007-2011 written by the ex-Facebook data infrastructure lead. It summarizes the goals of building a universal data logging and computing platform, the state and growth of the Hadoop cluster from 10TB to 50PB, and key components like Hive, Scribe, and reporting tools that helped various teams access and analyze data. It also discusses challenges around query performance, unnecessary duplication, and a lack of APIs that were missed opportunities. The overall message is that building useful services around the software was more important than the software itself.
What do you talk about to a hall full of database gurus? Instead of science - my talk focused on the art. What made Hadoop successful? What can we learn from it? What principles work well in building software for large scale services? What are some interesting unsolved problems in a world overrun by open-source (and VC investments :-))
Qubole is a big data as a service platform that allows users to run analytics jobs on AWS infrastructure. It integrates tightly with various AWS services like EC2, S3, Redshift, and Kinesis. Qubole handles cluster provisioning and management, provides tools for interactive querying using Presto, and allows customers to access data across different AWS data platforms through a single interface. Some key benefits of Qubole include simplified management of AWS resources, optimized performance through techniques like auto-scaling and caching, and unified analytics platform for tools like Hive, Spark and Presto.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable, and distributed processing of petabytes of data. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. Many large companies use Hadoop for applications such as log analysis, web indexing, and data mining of large datasets.
Big data and Hadoop are introduced as ways to handle the increasing volume, variety, and velocity of data. Hadoop evolved as a solution to process large amounts of unstructured and semi-structured data across distributed systems in a cost-effective way using commodity hardware. It provides scalable and parallel processing via MapReduce and HDFS distributed file system that stores data across clusters and provides redundancy and failover. Key Hadoop projects include HDFS, MapReduce, HBase, Hive, Pig and Zookeeper.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawls extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
This document provides an overview and introduction to Hadoop, HDFS, and MapReduce. It covers the basic concepts of HDFS, including how files are stored in blocks across data nodes, and the role of the name node and data nodes. It also explains the MapReduce programming model, including the mapper, reducer, and how jobs are split into parallel tasks. The document discusses using Hadoop from the command line and writing MapReduce jobs in Java. It also mentions some other projects in the Hadoop ecosystem like Pig, Hive, HBase and Zookeeper.
Arun Murthy, from the Hadoop team at Yahoo! will introduce compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of a Grid Pattern which, similar to Design Pattern, represents a general reusable solution for applications running on the Grid. He will even cover the anti-patterns of applications running on the Apache Hadoop clusters. Arun will enumerate characteristics of well-behaved applications and provide guidance on appropriate uses of various features and capabilities of the Hadoop framework. It is largely prescriptive in its nature; a useful way to look at the presention is to understand that applications that follow, in spirit, the best practices prescribed here are very likely to be efficient, well-behaved in the multi-tenant environment of the Apache Hadoop clusters and unlikely to fall afoul of most policies and limits.
Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment. It allows for the storage and analysis of datasets that are too large for single servers. The document discusses several key Hadoop components including HDFS for storage, MapReduce for processing, HBase for column-oriented storage, Hive for SQL-like queries, Pig for data flows, and Sqoop for data transfer between Hadoop and relational databases. It provides examples of how each component can be used and notes that Hadoop is well-suited for large-scale batch processing of data.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
Hadoop @ eBay: Past, Present, and FutureRyan Hennig
油
An overview of eBay's experience with Hadoop in the Past and Present, as well as directions for the Future. Given by Ryan Hennig at the Big Data Meetup at eBay in Netanya, Israel on Dec 2, 2013
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon
油
Graphs are well-suited for many use cases to express and process complex relationships among entities in enterprise and social contexts. Fueled by the growing interest in graphs, there are various graph databases and processing systems that dot the graph landscape. JanusGraph is a community-driven project that continues the legacy of Titan, a pioneer of open source graph databases. JanusGraph is a scalable graph database optimized for large scale transactional and analytical graph processing. In the session, we will introduce JanusGraph, which features full integration with the Apache TinkerPop graph stack. We will discuss JanusGraph's optimized storage model that relies on HBase for fast graph transversal and processing.
by Jason Plurad and Jing Chen He of IBM
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a programming model called MapReduce where developers write mapping and reducing functions that are automatically parallelized and executed on a large cluster. Hadoop also includes HDFS, a distributed file system that stores data across nodes providing high bandwidth. Major companies like Yahoo, Google and IBM use Hadoop to process large amounts of data from users and applications.
Moderated by Lars Hofhansl (Salesforce), with Matteo Bertozzi (Cloudera), John Leach (Splice Machine), Maxim Lukiyanov (Microsoft), Matt Mullins (Facebook), and Carter Page (Google)
The future of HBase, via a variety of viewpoints.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses challenges in big data by providing reliability, scalability, and fault tolerance. Hadoop allows distributed processing of large datasets across clusters using MapReduce and can scale from single servers to thousands of machines, each offering local computation and storage. It is widely used for applications such as log analysis, data warehousing, and web indexing.
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
油
Valta is a resource management layer over Apache HBase that aims to address issues with shared workloads on a single HBase cluster. It introduces resource limits for HBase clients to prevent ill-behaved clients from monopolizing cluster resources. This is an initial step, and more work is needed to address request scheduling across HBase, HDFS, and lower layers to meet service level objectives. The document outlines ideas for full-stack request scheduling, auto-tuning systems based on high-level SLOs, and using multiple read replicas to improve latency.
This document summarizes Syncsort's high performance data integration solutions for Hadoop contexts. Syncsort has over 40 years of experience innovating performance solutions. Their DMExpress product provides high-speed connectivity to Hadoop and accelerates ETL workflows. It uses partitioning and parallelization to load data into HDFS 6x faster than native methods. DMExpress also enhances usability with a graphical interface and accelerates MapReduce jobs by replacing sort functions. Customers report TCO reductions of 50-75% and ROI within 12 months by using DMExpress to optimize their Hadoop deployments.
eBay has one of the largest Hadoop clusters in the industry with many petabytes of data. This talk will give an overview of how Hadoop and HBase have been used within eBay, the lessons we have learned from supporting large-scale production clusters, as well as how we plan to use and improve Hadoop and HBase moving forward. Specific use cases, production issues and platform improvement work will be discussed.
This document provides an overview of Apache Hadoop and HBase. It begins with an introduction to why big data is important and how Hadoop addresses storing and processing large amounts of data across commodity servers. The core components of Hadoop, HDFS for storage and MapReduce for distributed processing, are described. An example MapReduce job is outlined. The document then introduces the Hadoop ecosystem, including Apache HBase for random read/write access to data stored in Hadoop. Real-world use cases of Hadoop at companies like Yahoo, Facebook and Twitter are briefly mentioned before addressing questions.
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
油
This document provides an introduction to JanusGraph, an open source distributed graph database that can be used with Apache HBase for storage. It begins with background on graph databases and their structures, such as vertices, edges, properties, and different storage models. It then discusses JanusGraph's architecture, support for the TinkerPop graph computing framework, and schema and data modeling capabilities. Details are given on partitioning graphs across servers and using different indexing approaches. The document concludes by explaining why HBase is a good storage backend for JanusGraph and providing examples of how the data model would be structured within HBase.
This document discusses using Sqoop to transfer data between relational databases and Hadoop. It begins by providing context on big data and Hadoop. It then introduces Sqoop as a tool for efficiently importing and exporting large amounts of structured data between databases and Hadoop. The document explains that Sqoop allows importing data from databases into HDFS for analysis and exporting summarized data back to databases. It also outlines how Sqoop works, including providing a pluggable connector mechanism and allowing scheduling of jobs.
Lars George and Jon Hsieh presented archetypes for common Apache HBase application patterns. They defined archetypes as common architecture patterns extracted from multiple use cases to be repeatable. The presentation covered "good" archetypes that are well-suited to HBase's capabilities, such as storing simple entities, messaging data, and metrics. "Bad" archetypes that are not optimal fits for HBase included using it as a large blob store, naively porting a relational database schema, and as an analytic archive requiring frequent full scans. A discussion of access patterns and tradeoffs concluded the overview of HBase application archetypes.
This document provides an introduction to the Pig analytics platform for Hadoop. It begins with an overview of big data and Hadoop, then discusses the basics of Pig including its data model, language called Pig Latin, and components. Key points made are that Pig provides a high-level language for expressing data analysis processes, compiles queries into MapReduce programs for execution, and allows for easier programming than lower-level systems like Java MapReduce. The document also compares Pig to SQL and Hive, and demonstrates visualizing Pig jobs with the Twitter Ambrose tool.
This document discusses distributed computing and Hadoop. It begins by explaining distributed computing and how it divides programs across several computers. It then introduces Hadoop, an open-source Java framework for distributed processing of large data sets across clusters of computers. Key aspects of Hadoop include its scalable distributed file system (HDFS), MapReduce programming model, and ability to reliably process petabytes of data on thousands of nodes. Common use cases and challenges of using Hadoop are also outlined.
This document provides an overview and introduction to BigData using Hadoop and Pig. It begins with introducing the speaker and their background working with large datasets. It then outlines what will be covered, including an introduction to BigData, Hadoop, Pig, HBase and Hive. Definitions and examples are provided for each. The remainder of the document demonstrates Hadoop and Pig concepts and commands through code examples and explanations.
Facebook generates large amounts of user data daily from activities like status updates, photo uploads, and shared content. This data is stored in Hadoop using Hive for analytics. Some key facts:
- Facebook adds 4TB of new compressed data daily to its Hadoop cluster.
- The cluster has 4800 cores and 5.5PB of storage across 12TB nodes.
- Hive is used for over 7500 jobs daily and by around 200 analysts monthly.
- Performance improvements to Hive include lazy deserialization, map-side aggregation, and joins.
Big data and Hadoop are introduced as ways to handle the increasing volume, variety, and velocity of data. Hadoop evolved as a solution to process large amounts of unstructured and semi-structured data across distributed systems in a cost-effective way using commodity hardware. It provides scalable and parallel processing via MapReduce and HDFS distributed file system that stores data across clusters and provides redundancy and failover. Key Hadoop projects include HDFS, MapReduce, HBase, Hive, Pig and Zookeeper.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawls extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
This document provides an overview and introduction to Hadoop, HDFS, and MapReduce. It covers the basic concepts of HDFS, including how files are stored in blocks across data nodes, and the role of the name node and data nodes. It also explains the MapReduce programming model, including the mapper, reducer, and how jobs are split into parallel tasks. The document discusses using Hadoop from the command line and writing MapReduce jobs in Java. It also mentions some other projects in the Hadoop ecosystem like Pig, Hive, HBase and Zookeeper.
Arun Murthy, from the Hadoop team at Yahoo! will introduce compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of a Grid Pattern which, similar to Design Pattern, represents a general reusable solution for applications running on the Grid. He will even cover the anti-patterns of applications running on the Apache Hadoop clusters. Arun will enumerate characteristics of well-behaved applications and provide guidance on appropriate uses of various features and capabilities of the Hadoop framework. It is largely prescriptive in its nature; a useful way to look at the presention is to understand that applications that follow, in spirit, the best practices prescribed here are very likely to be efficient, well-behaved in the multi-tenant environment of the Apache Hadoop clusters and unlikely to fall afoul of most policies and limits.
Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment. It allows for the storage and analysis of datasets that are too large for single servers. The document discusses several key Hadoop components including HDFS for storage, MapReduce for processing, HBase for column-oriented storage, Hive for SQL-like queries, Pig for data flows, and Sqoop for data transfer between Hadoop and relational databases. It provides examples of how each component can be used and notes that Hadoop is well-suited for large-scale batch processing of data.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
Hadoop @ eBay: Past, Present, and FutureRyan Hennig
油
An overview of eBay's experience with Hadoop in the Past and Present, as well as directions for the Future. Given by Ryan Hennig at the Big Data Meetup at eBay in Netanya, Israel on Dec 2, 2013
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon
油
Graphs are well-suited for many use cases to express and process complex relationships among entities in enterprise and social contexts. Fueled by the growing interest in graphs, there are various graph databases and processing systems that dot the graph landscape. JanusGraph is a community-driven project that continues the legacy of Titan, a pioneer of open source graph databases. JanusGraph is a scalable graph database optimized for large scale transactional and analytical graph processing. In the session, we will introduce JanusGraph, which features full integration with the Apache TinkerPop graph stack. We will discuss JanusGraph's optimized storage model that relies on HBase for fast graph transversal and processing.
by Jason Plurad and Jing Chen He of IBM
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a programming model called MapReduce where developers write mapping and reducing functions that are automatically parallelized and executed on a large cluster. Hadoop also includes HDFS, a distributed file system that stores data across nodes providing high bandwidth. Major companies like Yahoo, Google and IBM use Hadoop to process large amounts of data from users and applications.
Moderated by Lars Hofhansl (Salesforce), with Matteo Bertozzi (Cloudera), John Leach (Splice Machine), Maxim Lukiyanov (Microsoft), Matt Mullins (Facebook), and Carter Page (Google)
The future of HBase, via a variety of viewpoints.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses challenges in big data by providing reliability, scalability, and fault tolerance. Hadoop allows distributed processing of large datasets across clusters using MapReduce and can scale from single servers to thousands of machines, each offering local computation and storage. It is widely used for applications such as log analysis, data warehousing, and web indexing.
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
油
Valta is a resource management layer over Apache HBase that aims to address issues with shared workloads on a single HBase cluster. It introduces resource limits for HBase clients to prevent ill-behaved clients from monopolizing cluster resources. This is an initial step, and more work is needed to address request scheduling across HBase, HDFS, and lower layers to meet service level objectives. The document outlines ideas for full-stack request scheduling, auto-tuning systems based on high-level SLOs, and using multiple read replicas to improve latency.
This document summarizes Syncsort's high performance data integration solutions for Hadoop contexts. Syncsort has over 40 years of experience innovating performance solutions. Their DMExpress product provides high-speed connectivity to Hadoop and accelerates ETL workflows. It uses partitioning and parallelization to load data into HDFS 6x faster than native methods. DMExpress also enhances usability with a graphical interface and accelerates MapReduce jobs by replacing sort functions. Customers report TCO reductions of 50-75% and ROI within 12 months by using DMExpress to optimize their Hadoop deployments.
eBay has one of the largest Hadoop clusters in the industry with many petabytes of data. This talk will give an overview of how Hadoop and HBase have been used within eBay, the lessons we have learned from supporting large-scale production clusters, as well as how we plan to use and improve Hadoop and HBase moving forward. Specific use cases, production issues and platform improvement work will be discussed.
This document provides an overview of Apache Hadoop and HBase. It begins with an introduction to why big data is important and how Hadoop addresses storing and processing large amounts of data across commodity servers. The core components of Hadoop, HDFS for storage and MapReduce for distributed processing, are described. An example MapReduce job is outlined. The document then introduces the Hadoop ecosystem, including Apache HBase for random read/write access to data stored in Hadoop. Real-world use cases of Hadoop at companies like Yahoo, Facebook and Twitter are briefly mentioned before addressing questions.
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
油
This document provides an introduction to JanusGraph, an open source distributed graph database that can be used with Apache HBase for storage. It begins with background on graph databases and their structures, such as vertices, edges, properties, and different storage models. It then discusses JanusGraph's architecture, support for the TinkerPop graph computing framework, and schema and data modeling capabilities. Details are given on partitioning graphs across servers and using different indexing approaches. The document concludes by explaining why HBase is a good storage backend for JanusGraph and providing examples of how the data model would be structured within HBase.
This document discusses using Sqoop to transfer data between relational databases and Hadoop. It begins by providing context on big data and Hadoop. It then introduces Sqoop as a tool for efficiently importing and exporting large amounts of structured data between databases and Hadoop. The document explains that Sqoop allows importing data from databases into HDFS for analysis and exporting summarized data back to databases. It also outlines how Sqoop works, including providing a pluggable connector mechanism and allowing scheduling of jobs.
Lars George and Jon Hsieh presented archetypes for common Apache HBase application patterns. They defined archetypes as common architecture patterns extracted from multiple use cases to be repeatable. The presentation covered "good" archetypes that are well-suited to HBase's capabilities, such as storing simple entities, messaging data, and metrics. "Bad" archetypes that are not optimal fits for HBase included using it as a large blob store, naively porting a relational database schema, and as an analytic archive requiring frequent full scans. A discussion of access patterns and tradeoffs concluded the overview of HBase application archetypes.
This document provides an introduction to the Pig analytics platform for Hadoop. It begins with an overview of big data and Hadoop, then discusses the basics of Pig including its data model, language called Pig Latin, and components. Key points made are that Pig provides a high-level language for expressing data analysis processes, compiles queries into MapReduce programs for execution, and allows for easier programming than lower-level systems like Java MapReduce. The document also compares Pig to SQL and Hive, and demonstrates visualizing Pig jobs with the Twitter Ambrose tool.
This document discusses distributed computing and Hadoop. It begins by explaining distributed computing and how it divides programs across several computers. It then introduces Hadoop, an open-source Java framework for distributed processing of large data sets across clusters of computers. Key aspects of Hadoop include its scalable distributed file system (HDFS), MapReduce programming model, and ability to reliably process petabytes of data on thousands of nodes. Common use cases and challenges of using Hadoop are also outlined.
This document provides an overview and introduction to BigData using Hadoop and Pig. It begins with introducing the speaker and their background working with large datasets. It then outlines what will be covered, including an introduction to BigData, Hadoop, Pig, HBase and Hive. Definitions and examples are provided for each. The remainder of the document demonstrates Hadoop and Pig concepts and commands through code examples and explanations.
Facebook generates large amounts of user data daily from activities like status updates, photo uploads, and shared content. This data is stored in Hadoop using Hive for analytics. Some key facts:
- Facebook adds 4TB of new compressed data daily to its Hadoop cluster.
- The cluster has 4800 cores and 5.5PB of storage across 12TB nodes.
- Hive is used for over 7500 jobs daily and by around 200 analysts monthly.
- Performance improvements to Hive include lazy deserialization, map-side aggregation, and joins.
Hadoop and Hive Development at Facebookelliando dias
油
Facebook generates large amounts of user data daily from activities like status updates, photo uploads, and shared content. This data is stored in Hadoop using Hive for analytics. Some key facts:
- Facebook adds 4TB of new compressed data daily to its Hadoop cluster.
- The cluster has 4800 cores and 5.5PB of storage across 12TB nodes.
- Hive is used for over 7500 jobs daily and by around 200 engineers/analysts monthly.
- Performance improvements to Hive include lazy deserialization, map-side aggregation, and joins.
Hive is used at Facebook for data warehousing and analytics tasks on a large Hadoop cluster. It allows SQL-like queries on structured data stored in HDFS files. Key features include schema definitions, data summarization and filtering, extensibility through custom scripts and functions. Hive provides scalability for Facebook's rapidly growing data needs through its ability to distribute queries across thousands of nodes.
Hadoop is a scalable distributed system for storing and processing large datasets across commodity hardware. It consists of HDFS for storage and MapReduce for distributed processing. A large ecosystem of additional tools like Hive, Pig, and HBase has also developed. Hadoop provides significantly lower costs for data storage and analysis compared to traditional systems and is well-suited to unstructured or structured big data. It has seen wide adoption at companies like Yahoo, Facebook, and eBay for applications like log analysis, personalization, and fraud detection.
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
油
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
The document provides an overview of Apache Hadoop and related big data technologies. It discusses Hadoop components like HDFS for storage, MapReduce for processing, and HBase for columnar storage. It also covers related projects like Hive for SQL queries, ZooKeeper for coordination, and Hortonworks and Cloudera distributions.
This document summarizes the evolution of Hive, a data warehouse infrastructure built on top of Hadoop. It discusses Hive's origins at Facebook to manage large, unstructured data. Key points include Hive now functioning as a parallel SQL database using Hadoop for storage and execution. The document outlines new features in versions 0.6 and 0.7 like views, dynamic partitioning, and pluggable indexing. It also discusses Hive's roadmap for testing, performance improvements, and new capabilities.
Hive Training -- Motivations and Real World Use Casesnzhang
油
Hive is an open source data warehouse systems based on Hadoop, a MapReduce implementation.
This presentation introduces the motivations of developing Hive and how Hive is used in the real world situation, particularly in Facebook.
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
油
This document discusses Hadoop and Hive development at Facebook, including how they generate large amounts of user data daily, how they store the data in Hadoop clusters, and how they use Hive as a data warehouse to efficiently run SQL queries on the Hadoop data using a SQL-like language. It also outlines some of Hive's architecture and features like partitioning, buckets, and UDF/UDAF support, as well as its performance improvements over time and future planned work.
The document discusses Hadoop, its components, and how they work together. It covers HDFS, which stores and manages large files across commodity servers; MapReduce, which processes large datasets in parallel; and other tools like Pig and Hive that provide interfaces for Hadoop. Key points are that Hadoop is designed for large datasets and hardware failures, HDFS replicates data for reliability, and MapReduce moves computation instead of data for efficiency.
This document provides an overview of Hadoop and how it can be used for data consolidation, schema flexibility, and query flexibility compared to a relational database. It describes the key components of Hadoop including HDFS for storage and MapReduce for distributed processing. Examples of industry use cases are also presented, showing how Hadoop enables affordable long-term storage and scalable processing of large amounts of structured and unstructured data.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was created in 2005 and is designed to reliably handle large volumes of data and complex computations in a distributed fashion. The core of Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing data in parallel across large clusters of computers. It is widely adopted by companies handling big data like Yahoo, Facebook, Amazon and Netflix.
Hadoop is an open-source software platform for distributed storage and processing of large datasets across clusters of computers. It was created to enable applications to work with data beyond the limits of a single computer by distributing workloads and data across a cluster. Key components of Hadoop include HDFS for distributed storage, MapReduce for distributed processing, and YARN for distributed resource management.
This document provides an overview of big data concepts, including NoSQL databases, batch and real-time data processing frameworks, and analytical querying tools. It discusses scalability challenges with traditional SQL databases and introduces horizontal scaling with NoSQL systems like key-value, document, column, and graph stores. MapReduce and Hadoop are described for batch processing, while Storm is presented for real-time processing. Hive and Pig are summarized as tools for running analytical queries over large datasets.
5. OMG - NoSQL looks like a DBMS! Zookeeper for coordination atomic primitives (CAS), notifications, heartbeats Not a key-value store, not a file system, not a database Like a DLM in a database, can build locks/leases easily on top HDFS for large objects Shared storage like NFS or SAN HBase for small objects Every DB has an index at its heart Shared Storage DB (like Oracle) Map-Reduce/Hive/Pig etc. for analytics Missing: Transaction Monitor, Automatic/Transactional view/secondary-index maintenance, Triggers, Remote Replication etc.
6. Why HIVE? Human Cost >> Machine Cost Hadoops programming model is awesome, but .. map-reduce is impossible for non-engineers training 100s of engineers in java map-reduce == hard Map-reduce is non-trivial for engineers (sort vs. partition vs. grouping comparator anyone?) Much much easier to write Sql query Almost zero training cost Hard things become easy Files as insufficient data management abstraction Tables, Schemas, Partitions, Indices Metadata allows optimization, discovery, browsing Embrace all data formats: Complex Data Types, columnar or not, lazy retrieval
7. Quick Examples Create some tables: CREATE TABLE ad_imps (ad_id string, userid bigint, url string) PARTITIONED BY (ds string); CREATE TABLE dim_ads (ad_id string, campaign_id string) stored as textfile; Group-by + Join: SELECT a.campaign_id, count(1), count(DISTINCT b.user_id) FROM dim_ads a JOIN impression_logs b ON(b.ad_id = a.ad_id) WHERE b.dateid = '2008-12-01' GROUP BY a.campaign_id; Custom Transform + View: ADD FILE url_to_cat.py; CREATE VIEW tmp_adid_cat AS SELECT TRANSFORM (ad_id, url) USING url_to_cat.py AS (ad_id, cat) FROM ad_imps WHERE ds=2008-12-01; SELECT a.campaign_id, b.cat, count(1) FROM dim_ads a JOIN tmp_adid_cat b ON (b.ad_id=a.ad_id) GROUP BY a.campaign_id;
9. Data Model Hive Entity Sample Metastore Entity Sample HDFS Location Table T /wh/T Partition date=d1 /wh/T/date=d1 Bucketing column userid /wh/T/date=d1/part-0000 /wh/T/date=d1/part-1000 (hashed on userid) External Table extT /wh2/existing/dir (arbitrary location)
10. Using Hive: quick planner How to store data: Binary or text? Compressed or not? RCFile saves space over SequenceFile How to partition and load data: Initial data load using Dynamic Partitioning Incremental loading: Appending data to existing partitions? Mutable partitions/data? (hard!) Managing space consumption (use RETENTION) Performance Tuning Learning to read Explain plans
11. Join Processing Sort-merge joins Uses vanilla map-reduce using join key for sorting Single MR job for multiple joins on the same key: FROM (a join b on a.key = b.key) join c on a.key = c.key Put largest table last (reduces memory usage) Map-Joins Load smaller table into memory on each mapper No sorting map-side joins Automatic Map-Join in Hive 0.7 Bucketed (Map) Join Map-side join if join key is same as bucketing key
12. Group-By Processing Hash Based Map-side aggregation 90% improvement for count(1) aggregate Automatically reverts to regular map-reduce aggregation if cardinality is too high Can be turned off -> regular sort based group by Handling skews in groups 2-stage MR job 1 st MR - Partition on random value or distinct column (for distinct queries) and compute partial aggregates 2 nd MR Compute full aggregates
13. Common Issues Too many files: Option to merge (small) files at the end of MR job ARCHIVE partitions with many small files Using CombineHiveInputFormat to reduce number of mappers over partition with many small files High latencies for small jobs Scheduling latencies are killer Optimize number of map-reduce jobs Use automatic local mode if possible
14. Other goodies Locks using Zookeeper (to prevent read/write race) Indexes Partitioned Views Storage Handlers (to query HBase directly) Statistics collection Security Future Block Sampling and faster LIMIT queries Help query authoring and data exploration Hive Server Much faster query execution
15. Hive vs. .. PIG Hive is not a procedural language But views are similar to Pig variables Core philosophy to integrate with other languages via Streaming PIG does not provide SQL interface PIG does not have a metastore or data management model More similar than different Cascading No metastore or data management model No declarative language Seems similar to Hives internal execution operators
16. Hive Warehouse @ Facebook Two primary warehouses (HDFS + MR) High SLA pipelines Core Data Warehouse Core Warehouse, Jan 2011: ~2800 nodes 30 petabytes disk space Growing to 100PB by end of year Data access per day: ~40 terabytes added (compressed) /day 25000 map/reduce jobs/ day 300-400 users/month
17. Hive is just part of the story HDFS Improvements: HDFS RAID (saved ~5PB) High Availability(AvatarNode) NameNode improvements (locking, restart, decomissioning ) JobScheduler improvements: More efficient/concurrent JobTracker Monitoring and killing runaway tasks FairScheduler features (preemption, speculation, FIFO+FAIR) Hadoop/Hive Administration Rolling Upgrades for Hive Tools/Configuration for managing multiple clusters Hive Replication
19. Why/When use HBase? Very large indexed data store Lower management cost than building mysql cluster Very high write throughput Log Structured index trumps standard BTree Dont need complex multi-row transactions But need durability and strong consistency Read performance is not critical Reads from memory not as fast as memcache Random reads from disk suffer because of log-structure Reading recently written data is (potentially) faster Need Killer Map-Reduce integration
20. HBase Table Collection of Column Families HBase Table + Column Family == MySql Table HBase Column Family Collection of columns Each column has key (column qualifier) and value Column in HBase == row in Mysql Data Organization Table Sharded on row-key Data Indexed on row-key + column-family + column qualifier Multiple timestamped versions for each cell (mysql row) Data Model
21. Example User database in mysql create table friends (userid int, friendid int, since int) primary key (userid, friendid) create table pictures (userid int, picid bigint, at varchar(256)) primary key (userid, picid) Sharded mysql layout: Each mysql db has friends and pictures tables All entries for given user are in one mysql db Equivalent HBase Schema: friends and pictures are column families In a single HBase table with userid as row-key friendid and picid values become column qualifiers New terminology bizzare (thx. Google)
23. HBase Index Internals LSM Trees: Data stored in a series of index files (Index organized table) Bloom Filters for skipping index files entirely Short Circuiting lookups Retrieve last value assuming client stores greatest timestamp last Retrieve values stored in given time range (dont scan all files) Compact multiple index files into one periodically In Memory caching Recent writes cached in MemStore (skip-list) Pins index headers and bloom filters in memory Block cache (not row cache) for index file data blocks
24. BigTable (HBase) vs. Dynamo (Cassandra, ) Provides Strong Consistency with low read penalty Cassandra will require R>1 for strong consistency Data Resilience story is much better thx HDFS CRCs, replication, block placement Equally fast at disk reads Sticky Regions in HBase Fast path local reads in HDFS No partition tolerance, lower availability No read replicas, no built-in conflict resolution
27. HDFS Separation of Metadata from Data Metadata == Inodes, attributes, block locations, block replication File = 裡 data blocks (typically 128MB) Architected for large files and streaming reads Highly Reliable Each data block typically replicated 3X to different datanodes Clients compute and verify block checksums (end-to-end) Single namenode All metadata stored In-memory. Passive standby Client talks to both namenode and datanodes Bulk data from datanode to client linear scalability Custom Client library in Java/C/Thrift Not POSIX, not NFS
30. Programming with Map/Reduce Find the most imported package in Hive source: $ find . -name '*.java' -exec egrep '^import' '{}' \; | awk '{print $2}' | sort | uniq -c | sort -nr +0 -1 | head -1 208 org.apache.commons.logging.LogFactory; In Map-Reduce: 1a. Map using: egrep '^import'| awk '{print $2}' 1b. Reduce on first column (package name) 1c. Reduce Function: uniq -c 2a. Map using: awk {print %05d\t%s\n,100000-$1,$2} 2b. Reduce using first column (inverse counts), 1 reducer 2c. Reduce Function: Identity Scales to Terabytes
31. Rubbing it in .. hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2"\t"$1} $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1} $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 $ bin/hadoop dfs cat /tmp/largekey/part*
32. Hive Optimizations Merge Sequential Map Reduce Jobs SQL: FROM (a join b on a.key = b.key) join c on a.key = c.key SELECT A Map Reduce B C AB Map Reduce ABC key av bv 1 111 222 key av 1 111 key bv 1 222 key cv 1 333 key av bv cv 1 111 222 333
Editor's Notes
#2: Offline and Near-Real time data processing Not online
#7: Simple map-reduce is easy but it can get complicated very quickly.
#29: Rack local and node local access rocks Scalability is bound by switches
#30: Point out that now we know how HDFS works we can run maps close to data
#31: Point out how data local computing is useful in this example Exposes some of the features we need in hadoop output of reducer can be directly sent to another reducer As an exercise to the reader the results from the shell do not equal those from hadoop interesting to find why.