ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
1? Copyright 2014 EMC Corporation. All rights reserved.
EMC Hadoop Starter Kit
ViPR Edition
EMC Open Innovation Lab
2? Copyright 2014 EMC Corporation. All rights reserved.
The Digital Universe
Less than 1% of
the World¡¯s Data
is Analyzed
By 2020, the Internet will
connect 7.6B people
and 200B things
(sensors, machines, cars, appliances¡­)
Data Volumes
2000: 2 Exabytes a year
2011: 2 Exabytes a day
3? Copyright 2014 EMC Corporation. All rights reserved.
Location & Types Of Big Data
Structured Data
Unstructured
Data
Enterprise
Forecast
Data
Location
Data
Credit
Data
Shipping
Data
Social, Video Data
Partner Public
10101010100101010
011001010101110010
1101010100101011111
Telemetry
Data
Location & Types Of Big (& Fast!) Data
4? Copyright 2014 EMC Corporation. All rights reserved.
Hadoop Challenges
Depends on HDFS for data repository
¨C Must make legacy data accessible through HDFS
Hadoop HDFS inefficiencies:
¨C 3 copies for protection
¨C No advanced data efficiency: de-duplication, thin provision
¨C Security
Integration with robust traditional data center
products: compute virtualization, enterprise storage
5? Copyright 2014 EMC Corporation. All rights reserved.
Hadoop Storage Options
Hadoop HDFS
? Leverage Hadoop distro
HDFS data services
? Compute, and data
converged on cluster of
servers
Storage Array
? Name node and Data
node services from
storage array (i.e. EMC
Isilon)
Storage OS
Name node and Data node
services from storage OS
(i.e. EMC ViPR)
6? Copyright 2014 EMC Corporation. All rights reserved.
ViPR HDFS
HDFS is becoming the de facto file
system for distributed applications
ViPR is a great platform for HDFS
¨C Addresses limitations of off-the-shelf HDFS
¨C Brings HDFS to existing storage hardware
¨C Enables HDFS/object/file scenarios
¨C Flexible software model allows colocation
7? Copyright 2014 EMC Corporation. All rights reserved.
Support Mixed Workloads
Object, File and HDFS operations on the same data
VIRTUAL ARRAY
Isilon
3rd Party
VNX
5500
ViPR Data Services offer three
bucket options:
¨C Object
¨C HDFS
¨C ObjectandHDFS
ObjectandHDFS provides user with
access to either S3 or HDFS
¨C Full compatibility with existing
object based APIs
? Amazon S3, Openstack Swift, Atmos
Object HDFS
Object
& HDFS
8? Copyright 2014 EMC Corporation. All rights reserved.
Simple, Easy, Cost Effective
EMC Starter Kit for Hadoop ¨C ViPR Edition
Deployment guides for major Hadoop distributions:
¨C Pivotal, Cloudera, and Hortonworks
Four step deployment:
¨C Deploy preferred Hadoop Distribution
¨C Deploy EMC ViPR with Object, and HDFS data services
¨C Configure Hadoop distribution to use ViPR HDFS target
¨C Validation Process
? Load data file via S3 interface
? Test MapReduce job
EMC Hadoop Starter Kit - ViPR Edition

More Related Content

EMC Hadoop Starter Kit - ViPR Edition

  • 1. 1? Copyright 2014 EMC Corporation. All rights reserved. EMC Hadoop Starter Kit ViPR Edition EMC Open Innovation Lab
  • 2. 2? Copyright 2014 EMC Corporation. All rights reserved. The Digital Universe Less than 1% of the World¡¯s Data is Analyzed By 2020, the Internet will connect 7.6B people and 200B things (sensors, machines, cars, appliances¡­) Data Volumes 2000: 2 Exabytes a year 2011: 2 Exabytes a day
  • 3. 3? Copyright 2014 EMC Corporation. All rights reserved. Location & Types Of Big Data Structured Data Unstructured Data Enterprise Forecast Data Location Data Credit Data Shipping Data Social, Video Data Partner Public 10101010100101010 011001010101110010 1101010100101011111 Telemetry Data Location & Types Of Big (& Fast!) Data
  • 4. 4? Copyright 2014 EMC Corporation. All rights reserved. Hadoop Challenges Depends on HDFS for data repository ¨C Must make legacy data accessible through HDFS Hadoop HDFS inefficiencies: ¨C 3 copies for protection ¨C No advanced data efficiency: de-duplication, thin provision ¨C Security Integration with robust traditional data center products: compute virtualization, enterprise storage
  • 5. 5? Copyright 2014 EMC Corporation. All rights reserved. Hadoop Storage Options Hadoop HDFS ? Leverage Hadoop distro HDFS data services ? Compute, and data converged on cluster of servers Storage Array ? Name node and Data node services from storage array (i.e. EMC Isilon) Storage OS Name node and Data node services from storage OS (i.e. EMC ViPR)
  • 6. 6? Copyright 2014 EMC Corporation. All rights reserved. ViPR HDFS HDFS is becoming the de facto file system for distributed applications ViPR is a great platform for HDFS ¨C Addresses limitations of off-the-shelf HDFS ¨C Brings HDFS to existing storage hardware ¨C Enables HDFS/object/file scenarios ¨C Flexible software model allows colocation
  • 7. 7? Copyright 2014 EMC Corporation. All rights reserved. Support Mixed Workloads Object, File and HDFS operations on the same data VIRTUAL ARRAY Isilon 3rd Party VNX 5500 ViPR Data Services offer three bucket options: ¨C Object ¨C HDFS ¨C ObjectandHDFS ObjectandHDFS provides user with access to either S3 or HDFS ¨C Full compatibility with existing object based APIs ? Amazon S3, Openstack Swift, Atmos Object HDFS Object & HDFS
  • 8. 8? Copyright 2014 EMC Corporation. All rights reserved. Simple, Easy, Cost Effective EMC Starter Kit for Hadoop ¨C ViPR Edition Deployment guides for major Hadoop distributions: ¨C Pivotal, Cloudera, and Hortonworks Four step deployment: ¨C Deploy preferred Hadoop Distribution ¨C Deploy EMC ViPR with Object, and HDFS data services ¨C Configure Hadoop distribution to use ViPR HDFS target ¨C Validation Process ? Load data file via S3 interface ? Test MapReduce job

Editor's Notes

  • #3: We are experiencing a perfect storm of technology and analytic innovation. In the past analysis started with an hypotheses and a corresponding set of data with specific elements that needed to be collected. The data collected was scrubbed and stored in neat columns and rows. Analysis depended on precise data collection. Today with the reduction in the cost of storing, and computing data, along with the amount of date we can collect analysis is based on discovering corrleation.
  • #4: Today data is being collected and stored. That data is available for analysis. Analytics processing today does not depend on neat data because the size of the data sets minimizes the impact of anomalies. New analytic systems such as Hadoop have been created and are optimized for this type of analysis. As an IT provider what are the challenges associated with deploying Hadoop
  • #7: The Hadoop Distributed File System (HDFS) is becoming increasingly popular as a file system layer for distributed applications, beyond Hadoop.Scenarios: High aggregate throughput access to data, e.g. MapReduce. In some cases, low latency access.Concerns: Scale, durability, cost, managementHDFS is becoming a de facto file system for distributed applications but it has some challenges and limitations that have slowed adoption within enterprises. Many enterprises don¡¯t have the Big Data analytics expertise. Also, enterprises have experimented in the lab, which means they have a dedicated Hadoop cluster and need a way to move or copy data into the cluster for analytics. Data that has value can be hard to identify or move from where it primarily resides. ViPR offers a great platform for HDFS. By delivering HDFS as a data service rather than as a file system on a dedicated infrastructure, it brings the capabilities of HDFS to the data where it resides. Much like the object data service, it enables hybrid data types such as HDFS-on-object or HDFS-on-file. The HDFS data service is built in software so it allows for colocation.
  • #8: In addition to physical segregation, buckets provide logical segregation within the object store. Just like in S3, a user can create buckets which logically segregate applications or sets of data. These buckets can grown and shrink on-demand. The actual data objects are distributed and intermingled across the physical devices that comprise the virtual storage array.