Are you deploying Hadoop and want enterprise infrastructure manageability, reliability, and availability? The new EMC Hadoop Starter Kit shows you how to this without building HDFS data silo's.
1 of 9
Downloaded 38 times
More Related Content
EMC Hadoop Starter Kit - ViPR Edition
1. 1? Copyright 2014 EMC Corporation. All rights reserved.
EMC Hadoop Starter Kit
ViPR Edition
EMC Open Innovation Lab
2. 2? Copyright 2014 EMC Corporation. All rights reserved.
The Digital Universe
Less than 1% of
the World¡¯s Data
is Analyzed
By 2020, the Internet will
connect 7.6B people
and 200B things
(sensors, machines, cars, appliances¡)
Data Volumes
2000: 2 Exabytes a year
2011: 2 Exabytes a day
3. 3? Copyright 2014 EMC Corporation. All rights reserved.
Location & Types Of Big Data
Structured Data
Unstructured
Data
Enterprise
Forecast
Data
Location
Data
Credit
Data
Shipping
Data
Social, Video Data
Partner Public
10101010100101010
011001010101110010
1101010100101011111
Telemetry
Data
Location & Types Of Big (& Fast!) Data
4. 4? Copyright 2014 EMC Corporation. All rights reserved.
Hadoop Challenges
Depends on HDFS for data repository
¨C Must make legacy data accessible through HDFS
Hadoop HDFS inefficiencies:
¨C 3 copies for protection
¨C No advanced data efficiency: de-duplication, thin provision
¨C Security
Integration with robust traditional data center
products: compute virtualization, enterprise storage
5. 5? Copyright 2014 EMC Corporation. All rights reserved.
Hadoop Storage Options
Hadoop HDFS
? Leverage Hadoop distro
HDFS data services
? Compute, and data
converged on cluster of
servers
Storage Array
? Name node and Data
node services from
storage array (i.e. EMC
Isilon)
Storage OS
Name node and Data node
services from storage OS
(i.e. EMC ViPR)
6. 6? Copyright 2014 EMC Corporation. All rights reserved.
ViPR HDFS
HDFS is becoming the de facto file
system for distributed applications
ViPR is a great platform for HDFS
¨C Addresses limitations of off-the-shelf HDFS
¨C Brings HDFS to existing storage hardware
¨C Enables HDFS/object/file scenarios
¨C Flexible software model allows colocation
7. 7? Copyright 2014 EMC Corporation. All rights reserved.
Support Mixed Workloads
Object, File and HDFS operations on the same data
VIRTUAL ARRAY
Isilon
3rd Party
VNX
5500
ViPR Data Services offer three
bucket options:
¨C Object
¨C HDFS
¨C ObjectandHDFS
ObjectandHDFS provides user with
access to either S3 or HDFS
¨C Full compatibility with existing
object based APIs
? Amazon S3, Openstack Swift, Atmos
Object HDFS
Object
& HDFS
8. 8? Copyright 2014 EMC Corporation. All rights reserved.
Simple, Easy, Cost Effective
EMC Starter Kit for Hadoop ¨C ViPR Edition
Deployment guides for major Hadoop distributions:
¨C Pivotal, Cloudera, and Hortonworks
Four step deployment:
¨C Deploy preferred Hadoop Distribution
¨C Deploy EMC ViPR with Object, and HDFS data services
¨C Configure Hadoop distribution to use ViPR HDFS target
¨C Validation Process
? Load data file via S3 interface
? Test MapReduce job
Editor's Notes
#3: We are experiencing a perfect storm of technology and analytic innovation. In the past analysis started with an hypotheses and a corresponding set of data with specific elements that needed to be collected. The data collected was scrubbed and stored in neat columns and rows. Analysis depended on precise data collection. Today with the reduction in the cost of storing, and computing data, along with the amount of date we can collect analysis is based on discovering corrleation.
#4: Today data is being collected and stored. That data is available for analysis. Analytics processing today does not depend on neat data because the size of the data sets minimizes the impact of anomalies. New analytic systems such as Hadoop have been created and are optimized for this type of analysis. As an IT provider what are the challenges associated with deploying Hadoop
#7: The Hadoop Distributed File System (HDFS) is becoming increasingly popular as a file system layer for distributed applications, beyond Hadoop.Scenarios: High aggregate throughput access to data, e.g. MapReduce. In some cases, low latency access.Concerns: Scale, durability, cost, managementHDFS is becoming a de facto file system for distributed applications but it has some challenges and limitations that have slowed adoption within enterprises. Many enterprises don¡¯t have the Big Data analytics expertise. Also, enterprises have experimented in the lab, which means they have a dedicated Hadoop cluster and need a way to move or copy data into the cluster for analytics. Data that has value can be hard to identify or move from where it primarily resides. ViPR offers a great platform for HDFS. By delivering HDFS as a data service rather than as a file system on a dedicated infrastructure, it brings the capabilities of HDFS to the data where it resides. Much like the object data service, it enables hybrid data types such as HDFS-on-object or HDFS-on-file. The HDFS data service is built in software so it allows for colocation.
#8: In addition to physical segregation, buckets provide logical segregation within the object store. Just like in S3, a user can create buckets which logically segregate applications or sets of data. These buckets can grown and shrink on-demand. The actual data objects are distributed and intermingled across the physical devices that comprise the virtual storage array.