Big data refers to the massive amounts of unstructured data that are growing exponentially. Hadoop is an open-source framework that allows processing and storing large data sets across clusters of commodity hardware. It provides reliability and scalability through its distributed file system HDFS and MapReduce programming model. The Hadoop ecosystem includes components like Hive, Pig, HBase, Flume, Oozie, and Mahout that provide SQL-like queries, data flows, NoSQL capabilities, data ingestion, workflows, and machine learning. Microsoft integrates Hadoop with its BI and analytics tools to enable insights from diverse data sources.
This document provides an overview of big data and Hadoop. It defines big data as large volumes of diverse data that cannot be processed by traditional systems. Key characteristics are volume, velocity, variety, and veracity. Popular sources of big data include social media, emails, videos, and sensor data. Hadoop is presented as an open-source framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for storage and MapReduce as a programming model. Major tech companies like Google, Facebook, and Amazon are discussed as big players in big data.
This document discusses big data, including what it is, common data sources, its volume, velocity and variety characteristics, solutions like Hadoop and its HDFS and MapReduce components, and the impact and future of big data. It explains that big data refers to large and complex datasets that are difficult to process using traditional tools. Hadoop provides a framework to store and process big data across clusters of commodity hardware.
This document provides an overview of big data. It begins by defining big data and noting that it first emerged in the early 2000s among online companies like Google and Facebook. It then discusses the three key characteristics of big data: volume, velocity, and variety. The document outlines the large quantities of data generated daily by companies and sensors. It also discusses how big data is stored and processed using tools like Hadoop and MapReduce. Examples are given of how big data analytics can be applied across different industries. Finally, the document briefly discusses some risks and benefits of big data, as well as its impact on IT jobs.
This document provides an overview of big data, including:
- A brief history of big data from the 1920s to the coining of the term in 1989.
- An introduction explaining that big data requires different techniques and tools than traditional "small data" due to its larger size.
- A definition of big data as the storage and analysis of very large digital datasets that cannot be processed with traditional methods.
- The three key characteristics (3Vs) of big data: volume, velocity, and variety.
This document provides an overview of big data and how it can be used to forecast and predict outcomes. It discusses how large amounts of data are now being collected from various sources like the internet, sensors, and real-world transactions. This data is stored and processed using technologies like MapReduce, Hadoop, stream processing, and complex event processing to discover patterns, build models, and make predictions. Examples of current predictions include weather forecasts, traffic patterns, and targeted marketing recommendations. The document outlines challenges in big data like processing speed, security, and privacy, but argues that with the right techniques big data can help further human goals of understanding, explaining, and anticipating what will happen in the future.
Big data is data that is too large or complex for traditional data processing applications to analyze in a timely manner. It is characterized by high volume, velocity, and variety. Big data comes from a variety of sources, including business transactions, social media, sensors, and call center notes. It can be structured, unstructured, or semi-structured. Tools used for big data include NoSQL databases, MapReduce, HDFS, and analytics platforms. Big data analytics extracts useful insights from large, diverse data sets. It has applications in various domains like healthcare, retail, and transportation.
This document provides a syllabus for a course on big data. The course introduces students to big data concepts like characteristics of data, structured and unstructured data sources, and big data platforms and tools. Students will learn data analysis using R software, big data technologies like Hadoop and MapReduce, mining techniques for frequent patterns and clustering, and analytical frameworks and visualization tools. The goal is for students to be able to identify domains suitable for big data analytics, perform data analysis in R, use Hadoop and MapReduce, apply big data to problems, and suggest ways to use big data to increase business outcomes.
This document provides an introduction to big data. It defines big data as large and complex data sets that are difficult to process using traditional data management tools. It discusses the three V's of big data - volume, variety and velocity. Volume refers to the large scale of data. Variety means different data types. Velocity means the speed at which data is generated and processed. The document outlines topics that will be covered, including Hadoop, MapReduce, data mining techniques and graph databases. It provides examples of big data sources and challenges in capturing, analyzing and visualizing large and diverse data sets.
This document provides an overview of big data, including its definition, characteristics, sources, tools used, applications, benefits, and impact on IT. Big data is a term used to describe the large volumes of data, both structured and unstructured, that are so large they are difficult to process using traditional database and software techniques. It is characterized by high volume, velocity, variety, and veracity. Common sources of big data include mobile devices, sensors, social media, and software/application logs. Tools like Hadoop, MongoDB, and MapReduce are used to store, process, and analyze big data. Key applications areas include homeland security, healthcare, manufacturing, and financial trading. Benefits include better decision making, cost reductions
Disclaimer :
The images, company, product and service names that are used in this presentation, are for illustration purposes only. All trademarks and registered trademarks are the property of their respective owners.
Data/Image collected from various sources from Internet.
Intention was to present the big picture of Big Data & Hadoop
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for distributed storage and fault tolerance, YARN for resource management, and MapReduce for parallel processing of large datasets. It provides details on the architecture of HDFS including the name node, data nodes, and clients. It also explains the MapReduce programming model and job execution involving map and reduce tasks. Finally, it states that as data volumes continue rising, Hadoop provides an affordable solution for large-scale data handling and analysis through its distributed and scalable architecture.
Big Data may well be the Next Big Thing in the IT world.油The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning.
The document discusses big data issues and challenges. It defines big data as large volumes of structured and unstructured data that is growing exponentially due to increased data generation. Some key challenges discussed include storage and processing limitations of exabytes of data, privacy and security risks, and the need for new skills and training to manage and analyze big data. Examples are given of large data projects in various domains like science, healthcare, and commerce that are driving big data growth.
The document summarizes a technical seminar on Hadoop. It discusses Hadoop's history and origin, how it was developed from Google's distributed systems, and how it provides an open-source framework for distributed storage and processing of large datasets. It also summarizes key aspects of Hadoop including HDFS, MapReduce, HBase, Pig, Hive and YARN, and how they address challenges of big data analytics. The seminar provides an overview of Hadoop's architecture and ecosystem and how it can effectively process large datasets measured in petabytes.
This document provides an overview of big data, including its definition, characteristics, sources, tools, applications, risks and benefits. It defines big data as large volumes of diverse data that can be analyzed to reveal patterns and trends. The three key characteristics are volume, velocity and variety. Examples of big data sources include social media, sensors and user data. Tools used for big data include Hadoop, MongoDB and analytics programs. Big data has many applications and benefits but also risks regarding privacy and regulation. The future of big data is strong with the market expected to grow significantly in coming years.
Introduction to Hadoop and Hadoop component rebeccatho
油
This document provides an introduction to Apache Hadoop, which is an open-source software framework for distributed storage and processing of large datasets. It discusses Hadoop's main components of MapReduce and HDFS. MapReduce is a programming model for processing large datasets in a distributed manner, while HDFS provides distributed, fault-tolerant storage. Hadoop runs on commodity computer clusters and can scale to thousands of nodes.
It is a brief overview of Big Data. It contains History, Applications and Characteristics on BIg Data.
It also includes some concepts on Hadoop.
It also gives the statistics of big data and impact of it all over the world.
This document defines big data and discusses techniques for integrating large and complex datasets. It describes big data as collections that are too large for traditional database tools to handle. It outlines the "3Vs" of big data: volume, velocity, and variety. It also discusses challenges like heterogeneous structures, dynamic and continuous changes to data sources. The document summarizes techniques for big data integration including schema mapping, record linkage, data fusion, MapReduce, and adaptive blocking that help address these challenges at scale.
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
This presentation, by big data guru Bernard Marr, outlines in simple terms what Big Data is and how it is used today. It covers the 5 V's of Big Data as well as a number of high value use cases.
This document discusses different architectures for big data systems, including traditional, streaming, lambda, kappa, and unified architectures. The traditional architecture focuses on batch processing stored data using Hadoop. Streaming architectures enable low-latency analysis of real-time data streams. Lambda architecture combines batch and streaming for flexibility. Kappa architecture avoids duplicating processing logic. Finally, a unified architecture trains models on batch data and applies them to real-time streams. Choosing the right architecture depends on use cases and available components.
What is Big Data?
Big Data Laws
Why Big Data?
Industries using Big Data
Current process/SW in SCM
Challenges in SCM industry
How Big data can solve the problems?
Migration to Big data for an SCM industry
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Hritika Raj
油
This document provides an overview of big data, including its definition, characteristics, sources, tools used, applications, risks and benefits. Big data is characterized by volume, velocity and variety of structured and unstructured data that is growing exponentially. It is generated from sources like mobile devices, sensors, social media and more. Tools like Hadoop, MapReduce and data analytics are used to extract value from big data. Potential applications include healthcare, security, manufacturing and more. Risks include privacy and scale, while benefits include improved decision making and new business opportunities. The big data industry is rapidly growing and transforming IT and business.
This document provides an overview of big data. It defines big data as large volumes of diverse data that are growing rapidly and require new techniques to capture, store, distribute, manage, and analyze. The key characteristics of big data are volume, velocity, and variety. Common sources of big data include sensors, mobile devices, social media, and business transactions. Tools like Hadoop and MapReduce are used to store and process big data across distributed systems. Applications of big data include smarter healthcare, traffic control, and personalized marketing. The future of big data is promising with the market expected to grow substantially in the coming years.
Big data is data that is too large or complex for traditional data processing applications to analyze in a timely manner. It is characterized by high volume, velocity, and variety. Big data comes from a variety of sources, including business transactions, social media, sensors, and call center notes. It can be structured, unstructured, or semi-structured. Tools used for big data include NoSQL databases, MapReduce, HDFS, and analytics platforms. Big data analytics extracts useful insights from large, diverse data sets. It has applications in various domains like healthcare, retail, and transportation.
This document provides a syllabus for a course on big data. The course introduces students to big data concepts like characteristics of data, structured and unstructured data sources, and big data platforms and tools. Students will learn data analysis using R software, big data technologies like Hadoop and MapReduce, mining techniques for frequent patterns and clustering, and analytical frameworks and visualization tools. The goal is for students to be able to identify domains suitable for big data analytics, perform data analysis in R, use Hadoop and MapReduce, apply big data to problems, and suggest ways to use big data to increase business outcomes.
This document provides an introduction to big data. It defines big data as large and complex data sets that are difficult to process using traditional data management tools. It discusses the three V's of big data - volume, variety and velocity. Volume refers to the large scale of data. Variety means different data types. Velocity means the speed at which data is generated and processed. The document outlines topics that will be covered, including Hadoop, MapReduce, data mining techniques and graph databases. It provides examples of big data sources and challenges in capturing, analyzing and visualizing large and diverse data sets.
This document provides an overview of big data, including its definition, characteristics, sources, tools used, applications, benefits, and impact on IT. Big data is a term used to describe the large volumes of data, both structured and unstructured, that are so large they are difficult to process using traditional database and software techniques. It is characterized by high volume, velocity, variety, and veracity. Common sources of big data include mobile devices, sensors, social media, and software/application logs. Tools like Hadoop, MongoDB, and MapReduce are used to store, process, and analyze big data. Key applications areas include homeland security, healthcare, manufacturing, and financial trading. Benefits include better decision making, cost reductions
Disclaimer :
The images, company, product and service names that are used in this presentation, are for illustration purposes only. All trademarks and registered trademarks are the property of their respective owners.
Data/Image collected from various sources from Internet.
Intention was to present the big picture of Big Data & Hadoop
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for distributed storage and fault tolerance, YARN for resource management, and MapReduce for parallel processing of large datasets. It provides details on the architecture of HDFS including the name node, data nodes, and clients. It also explains the MapReduce programming model and job execution involving map and reduce tasks. Finally, it states that as data volumes continue rising, Hadoop provides an affordable solution for large-scale data handling and analysis through its distributed and scalable architecture.
Big Data may well be the Next Big Thing in the IT world.油The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning.
The document discusses big data issues and challenges. It defines big data as large volumes of structured and unstructured data that is growing exponentially due to increased data generation. Some key challenges discussed include storage and processing limitations of exabytes of data, privacy and security risks, and the need for new skills and training to manage and analyze big data. Examples are given of large data projects in various domains like science, healthcare, and commerce that are driving big data growth.
The document summarizes a technical seminar on Hadoop. It discusses Hadoop's history and origin, how it was developed from Google's distributed systems, and how it provides an open-source framework for distributed storage and processing of large datasets. It also summarizes key aspects of Hadoop including HDFS, MapReduce, HBase, Pig, Hive and YARN, and how they address challenges of big data analytics. The seminar provides an overview of Hadoop's architecture and ecosystem and how it can effectively process large datasets measured in petabytes.
This document provides an overview of big data, including its definition, characteristics, sources, tools, applications, risks and benefits. It defines big data as large volumes of diverse data that can be analyzed to reveal patterns and trends. The three key characteristics are volume, velocity and variety. Examples of big data sources include social media, sensors and user data. Tools used for big data include Hadoop, MongoDB and analytics programs. Big data has many applications and benefits but also risks regarding privacy and regulation. The future of big data is strong with the market expected to grow significantly in coming years.
Introduction to Hadoop and Hadoop component rebeccatho
油
This document provides an introduction to Apache Hadoop, which is an open-source software framework for distributed storage and processing of large datasets. It discusses Hadoop's main components of MapReduce and HDFS. MapReduce is a programming model for processing large datasets in a distributed manner, while HDFS provides distributed, fault-tolerant storage. Hadoop runs on commodity computer clusters and can scale to thousands of nodes.
It is a brief overview of Big Data. It contains History, Applications and Characteristics on BIg Data.
It also includes some concepts on Hadoop.
It also gives the statistics of big data and impact of it all over the world.
This document defines big data and discusses techniques for integrating large and complex datasets. It describes big data as collections that are too large for traditional database tools to handle. It outlines the "3Vs" of big data: volume, velocity, and variety. It also discusses challenges like heterogeneous structures, dynamic and continuous changes to data sources. The document summarizes techniques for big data integration including schema mapping, record linkage, data fusion, MapReduce, and adaptive blocking that help address these challenges at scale.
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
This presentation, by big data guru Bernard Marr, outlines in simple terms what Big Data is and how it is used today. It covers the 5 V's of Big Data as well as a number of high value use cases.
This document discusses different architectures for big data systems, including traditional, streaming, lambda, kappa, and unified architectures. The traditional architecture focuses on batch processing stored data using Hadoop. Streaming architectures enable low-latency analysis of real-time data streams. Lambda architecture combines batch and streaming for flexibility. Kappa architecture avoids duplicating processing logic. Finally, a unified architecture trains models on batch data and applies them to real-time streams. Choosing the right architecture depends on use cases and available components.
What is Big Data?
Big Data Laws
Why Big Data?
Industries using Big Data
Current process/SW in SCM
Challenges in SCM industry
How Big data can solve the problems?
Migration to Big data for an SCM industry
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Hritika Raj
油
This document provides an overview of big data, including its definition, characteristics, sources, tools used, applications, risks and benefits. Big data is characterized by volume, velocity and variety of structured and unstructured data that is growing exponentially. It is generated from sources like mobile devices, sensors, social media and more. Tools like Hadoop, MapReduce and data analytics are used to extract value from big data. Potential applications include healthcare, security, manufacturing and more. Risks include privacy and scale, while benefits include improved decision making and new business opportunities. The big data industry is rapidly growing and transforming IT and business.
This document provides an overview of big data. It defines big data as large volumes of diverse data that are growing rapidly and require new techniques to capture, store, distribute, manage, and analyze. The key characteristics of big data are volume, velocity, and variety. Common sources of big data include sensors, mobile devices, social media, and business transactions. Tools like Hadoop and MapReduce are used to store and process big data across distributed systems. Applications of big data include smarter healthcare, traffic control, and personalized marketing. The future of big data is promising with the market expected to grow substantially in the coming years.
Cheap data storage and high-performance analytics are going to change the face of retail sector. And big data is going to play pivotal role in this technological revolution. You can find other reports related to Big data at http://www.marketresearchreports.com/big-data
Societal Impact of Applied Data Science on the Big Data StackStealth Project
油
This document discusses big data and data science projects at a Center for Data Science. It provides an overview of various research areas like healthcare informatics, intelligent systems, social computing, and big data security. It also describes technologies used for big data like machine learning, distributed databases, and data integration. Specific projects are summarized, such as predicting hospital readmissions for congestive heart failure patients and detecting malware activity based on domain names. The document outlines the steps involved in building predictive models, from data understanding to predictive modeling. Performance of initial models is discussed, with areas for improvement noted.
This document summarizes the history of big data from 1944 to 2013. It outlines key milestones such as the first use of the term "big data" in 1997, the growth of internet traffic in the late 1990s, Doug Laney coining the three V's of big data in 2001, and the focus of big data professionals shifting from IT to business functions that utilize data in 2013. The document serves to illustrate how data storage and analysis have evolved over time due to technological advances and changing needs.
What are today's challenges of big medical data and how can we use the immense data to turn it into potentials, e.g. for precision medicine. Get insights in application examples, where big medical data are incorporated and how in-memory database technology can enable it instantaneous analysis.
How Big Data is Transforming Medical Information Insights - DIA 2014CREATION
油
Daniel Ghinn's presentation at DIA 8th Annual European Medical Information and Communications Conference explores the use of big data in medical information...
Using Big Data for Improved Healthcare Operations and AnalyticsPerficient, Inc.
油
Big Data technologies represent a major shift that is here to stay. Big Data enables the use of all types of data, including unstructured data like clinical notes and medical images, for new insights. Advanced analytics like predictive modeling and text mining will become more prevalent and intelligent with Big Data. Big Data will impact application development and require changes to data management approaches. Technologies like Hadoop, NoSQL databases, and semantic modeling will be important for healthcare Big Data.
Many believe Big Data is a brand new phenomenon. It isn't, it is part of an evolution that reaches far back history. Here are some of the key milestones in this development.
Big data is large amounts of unstructured data that require new techniques and tools to analyze. Key drivers of big data growth are increased storage capacity, processing power, and data availability. Big data analytics can uncover hidden patterns to provide competitive advantages and better business decisions. Applications include healthcare, homeland security, finance, manufacturing, and retail. The global big data market is expected to grow significantly, with India's market projected to reach $1 billion by 2015. This growth will increase demand for data scientists and analysts to support big data solutions and technologies like Hadoop and NoSQL databases.
Cloud computing, big data, and mobile are three major trends that will change the world. Cloud computing provides scalable and elastic IT resources as services over the internet. Big data involves large amounts of both structured and unstructured data that can generate business insights when analyzed. The hadoop ecosystem, including components like HDFS, mapreduce, pig, and hive, provides an architecture for distributed storage and processing of big data across commodity hardware.
This document provides an overview of Hadoop, including:
- Prerequisites for getting the most out of Hadoop include programming skills in languages like Java and Python, SQL knowledge, and basic Linux skills.
- Hadoop is a software framework for distributed processing of large datasets across computer clusters using MapReduce and HDFS.
- Core Hadoop components include HDFS for storage, MapReduce for distributed processing, and YARN for resource management.
- The Hadoop ecosystem also includes components like HBase, Pig, Hive, Mahout, Sqoop and others that provide additional functionality.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It was designed to scale up from single servers to thousands of machines, with very high fault tolerance. Hadoop features two main components - the Hadoop Distributed File System (HDFS) for storage, and MapReduce for distributed processing of large datasets in a parallel and distributed manner. Hadoop saw widespread adoption for applications such as log analysis, data mining, and large-scale graph processing.
Cloud computing, big data, and mobile technologies are driving major changes in the IT world. Cloud computing provides scalable computing resources over the internet. Big data involves extremely large data sets that are analyzed to reveal business insights. Hadoop is an open-source software framework that allows distributed processing of big data across commodity hardware. It includes tools like HDFS for storage and MapReduce for distributed computing. The Hadoop ecosystem also includes additional tools for tasks like data integration, analytics, workflow management, and more. These emerging technologies are changing how businesses use and analyze data.
This document provides an overview of Hadoop and how it can be used for data consolidation, schema flexibility, and query flexibility compared to a relational database. It describes the key components of Hadoop including HDFS for storage and MapReduce for distributed processing. Examples of industry use cases are also presented, showing how Hadoop enables affordable long-term storage and scalable processing of large amounts of structured and unstructured data.
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
油
Hadoop was developed to solve problems with data warehousing systems at Yahoo and Facebook that were limited in processing large amounts of raw data in real-time. Hadoop uses HDFS for scalable storage and MapReduce for distributed processing. It allows for agile access to raw data at scale for ad-hoc queries, data mining and analytics without being constrained by traditional database schemas. Hadoop has been widely adopted for large-scale data processing and analytics across many companies.
Content presented at a talk on Aug. 29th. Purpose is to inform a fairly technical audience on the primary tenets of Big Data and the hadoop stack. Also, did a walk-thru' of hadoop and some of the hadoop stack i.e. Pig, Hive, Hbase.
The document provides statistics on the amount of data generated and shared on various digital platforms each day: over 1 terabyte of data from NYSE, 144.8 billion emails sent, 340 million tweets, 684,000 pieces of content shared on Facebook, 72 hours of new video uploaded to YouTube per minute, and more. It outlines the massive scale of data creation and sharing occurring across social media, financial, and other digital platforms.
This document provides an overview of big data and Hadoop. It discusses what big data is, its types including structured, semi-structured and unstructured data. Some key sources of big data are also outlined. Hadoop is presented as a solution for managing big data through its core components like HDFS for storage and MapReduce for processing. The Hadoop ecosystem including other related tools like Hive, Pig, Spark and YARN is also summarized. Career opportunities in working with big data are listed in the end.
This document provides an overview of Hadoop and Big Data. It begins with introducing key concepts like structured, semi-structured, and unstructured data. It then discusses the growth of data and need for Big Data solutions. The core components of Hadoop like HDFS and MapReduce are explained at a high level. The document also covers Hadoop architecture, installation, and developing a basic MapReduce program.
This document provides an overview of the Hadoop/MapReduce/HBase framework and its applications in bioinformatics. It discusses Hadoop and its components, how MapReduce programs work, HBase which enables random access to Hadoop data, related projects like Pig and Hive, and examples of applications in bioinformatics and benchmarking of these systems.
Introduction to Apache Hadoop. Includes Hadoop v.1.0 and HDFS / MapReduce to v.2.0. Includes Impala, Yarn, Tez and the entire arsenal of projects for Apache Hadoop.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
油
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data and Hadoop are emerging topics in data warehousing for many executives, BI practices and technologists today. However, many people still aren't sure how Big Data and existing Data warehouse can be married and turn that promise into value. This presentation provides an overview of Big Data technology and how Big Data can fit to the current BI/data warehousing context.
http://www.quantumit.com.au
http://www.evisional.com
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
油
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
This document discusses MySQL and Hadoop. It provides an overview of Hadoop, Cloudera Distribution of Hadoop (CDH), MapReduce, Hive, Impala, and how MySQL can interact with Hadoop using Sqoop. Key use cases for Hadoop include recommendation engines, log processing, and machine learning. The document also compares MySQL and Hadoop in terms of data capacity, query languages, and support.
Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment. It allows for massive data storage, enormous processing power, and the ability to handle large numbers of concurrent tasks across clusters of commodity hardware. The framework includes Hadoop Distributed File System (HDFS) for reliable data storage and MapReduce for parallel processing of large datasets. An ecosystem of related projects like Pig, Hive, HBase, Sqoop and Flume extend the functionality of Hadoop.
2. What is BIG
DATA?
http://www.forbes.com/sites/oreillymedia/2012/01/19/volume-velocity-
variety-what-you-need-to-know-about-big-data
4. Search Engine
Data Scalability 10KB / doc * 20B docs = 200TB
Problems Reindex every 30 days: 200TB/30days = 6 TB/day
Log Processing / Data Warehousing
0.5KB/events * 3B pageview events/day =
1.5TB/day
100M users * 5 events * 100 feed/event *
0.1KB/feed = 5TB/day
Multipliers: 3 copies of data, 3-10 passes of raw
data
Processing Speed (Single Machine)
2-20MB/second * 100K seconds/day = 0.2-2 TB/day
5. Whats the social sentiment How do I better predict future
for my brand or products outcomes?
How do I optimize my fleet
based on weather and traffic
patterns?
9. Hadoop is a framework for running applications on large
clusters built of commodity hardware. The Hadoop
framework transparently provides applications both reliability
and data motion. Hadoop implements a computational
paradigm named Map/Reduce, where the application is
divided into many small fragments of work, each of which may
HADOOP be executed or reexecuted on any node in the cluster. In
addition, it provides a distributed file system (HDFS) that
stores data on the compute nodes, providing very high
aggregate bandwidth across the cluster. Both Map/Reduce and
the distributed file system are designed so that node failures
are automatically handled by the framework.
10. Hadoop
History
Jan 2006 Doug Cutting joins Yahoo
Feb 2006 Hadoop splits out of Nutch and Yahoo starts using it.
Dec 2006 Yahoo creating 100-node Webmap with Hadoop
Apr 2007 Yahoo on 1000-node cluster
Jan 2008 Hadoop made a top-level Apache project
Dec 2007 Yahoo creating 1000-node Webmap with Hadoop
Sep 2008 Hive added to Hadoop as a contrib project
11. Commodity hardware
BIG DATA compatibility
ECONOMICS Reduction in storage cost
Economics Open source system
The Web economy
14. Can be significantly faster than row stores for some
applications
Fetch only required columns for a query
Better cache effects
Better compression (similar attribute values within a
column)
Why Column But can be slower for other applications
OLTP with many row inserts, ..
Store?
Long war between the column store and row store
camps :-)
21. HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file
system adept at storing large volumes of unstructured data.
MapReduce is a software framework that serves as the compute layer of
Hadoop. MapReduce jobs are divided into two (obviously named) parts. The
Map function divides a query into multiple parts and processes data at the
node level. The Reduce function aggregates the results of the Map function
Hadoop to determine the answer to the query.
Ecosystem Hive is a Hadoop-based data warehouse developed by Facebook. It allows users
to write queries in SQL, which are then converted to MapReduce. This allows
SQL programmers with no MapReduce experience to use the warehouse and
makes it easier to integrate with business intelligence and visualization tools
such as Microstrategy, Tableau, Revolutions Analytics, etc.
Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy
to learn and is adept at very deep, very long data pipelines (a limitation of SQL.)
22. HBase is a non-relational database that allows for low-latency, quick
lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing
users to conduct updates, inserts and deletes. EBay and Facebook use
HBase heavily. .
Flume is a framework for populating Hadoop with data. Agents are
populated throughout ones IT infrastructure inside web servers,
application servers and mobile devices, for example to collect data and
integrate it into Hadoop.
Hadoop
Ecosystem Oozie is a workflow processing system that lets users define a series of
jobs written in multiple languages such as Map Reduce, Pig and Hive --
then intelligently link them to one another. Oozie allows users to specify,
for example, that a particular query is only to be initiated after specified
previous jobs on which it relies for data are completed
Whirr is a set of libraries that allows users to easily spin-up Hadoop clusters
on top of Amazon EC2, Rackspace or any virtual infrastructure. It supports
all major virtualized infrastructure vendors on the market.
23. Avro is a data serialization system that allows for encoding the schema of
Hadoop files. It is adept at parsing data and performing removed
procedure calls
Mahout is a data mining library. It takes the most popular data mining
algorithms for performing clustering, regression testing and statistical
modeling and implements them using the Map Reduce model
Hadoop
Ecosystem Sqoop is a connectivity tool for moving data from non-Hadoop data stores
such as relational databases and data warehouses into Hadoop. It
allows users to specify the target location inside of Hadoop and instruct
Sqoop to move data from Oracle, Teradata or other relational databases to
the target.
BigTop is an effort to create a more formal process or framework for
packaging and interoperability testing of Hadoop's sub-projects and
related components with the goal improving the Hadoop platform as a
whole.