The document provides an overview of Hadoop at Nokia including:
- The technical setup of Hadoop clusters in Berlin and London including hardware specifications and architecture
- Applications of Hadoop at Nokia for reporting, data exploration, and productizing insights from data
- Infrastructure management using Puppet to configure Hadoop clusters
- Examples of data analyses including visualizing Ikea search patterns in Berlin and London
1 of 33
Downloaded 41 times
More Related Content
Hadoop at Nokia
1. H A D O O P AT N O K I A
JOSH DEVINS, NOKIA HADOOP MEETUP, JANUARY 2011
BERLIN
Wednesday, April 20, 2011
Two parts:
* technical setup
* applications
before starting Question: Hadoop experience levels from none to some to lots, and what
about cluster mgmt?
3. GLOBAL ARCHITECTURE
Wednesday, April 20, 2011
Scribe for logging
agents on local machines forward to downstream collector nodes
collectors forward on to more downstream nodes or to 鍖nal destination(s) like HDFS
buffering at each stage to deal with network outages
must consider the storage available on ALL nodes where Scribe is running to determine your
risk of potential data loss since Scribe buffers to local disk
global Scribe not deployed yet, but London DC is done
do it all over again? probably use Flume, but Scribe was being researched before Flume
existed
* much more 鍖exible and easily extensible
* more reliability guarantees and tunable (data loss acceptable or not)
* can also do syslog wire protocol which is nice for compatibilitys sake
4. DATA NODE HARDWARE
DC LONDON BERLIN
cores 12x (w/ HT) 4x 2.00 GHz (w/ HT)
RAM 48GB 16GB
disks 12x 2TB 4x 1TB
storage 24TB 4TB
LAN 1Gb 2x 1Gb (bonded)
Wednesday, April 20, 2011
http://www.鍖ickr.com/photos/torkildr/3462607995/in/photostream/
BERLIN
HP DL160 G6
1x Quad-core Intel Xeon E5504 @ 2.00 GHz (4-cores total)
16GB DDR3 RAM
4x 1TB 7200 RPM SATA
2x 1Gb LAN
iLO Lights-Out 100 Advanced
5. MASTER NODE HARDWARE
DC LONDON BERLIN
cores 12x (w/ HT) 8x 2.00 GHz (w/ HT)
RAM 48GB 32GB
disks 12x 2TB 4x 1TB
storage 24TB 4TB
LAN 10Gb 4x 1Gb (bonded, DRBD/Heartbeat)
Wednesday, April 20, 2011
BERLIN
HP DL160 G6
2x Quad-core Intel Xeon E5504 @ 2.00 GHz (8-cores total)
32GB DDR3 RAM
4x 1TB 7200 RPM SATA
4x 1Gb Ethernet (2x LAN, 2x DRBD/Heartbeat)
iLO Lights-Out 100 Advanced (hadoop-master[01-02]-ilo.devbln)
6. MEANING?
Size
Berlin: 2 master nodes, 13 data nodes, ~17TB HDFS
London: large enough to handle a years worth of activity log
data, with plans for rapid expansion
Scribe
250,000 1KB msg/sec
244MB/sec, 14.3GB/hr, 343GB/day
Wednesday, April 20, 2011
Berlin: 1 rack, 2 switches for...
London: its secret!
7. PHYSICAL OR CLOUD?
Physical Cloud (AWS)
Capital cost Elastic MR, 15 extra large nodes, 10%
utilized: $1,560
1 rack w/ 2x switches
S3, 5TB: $7,800
15x HP DL160 servers
$9,360 or 6,835
~20,000
Annual operating costs
power and cooling: 5,265 @ 0.24
kWh
rent: 3,600
hardware support contract: 2,000
(disks replaced on warranty)
10,865
Wednesday, April 20, 2011
http://www.鍖ickr.com/photos/dumbledad/4745475799/
Question: How many run own clusters of physical hardware vs AWS or virtualized?
actual decision is completely dependent on may factors including maybe existing DC, data
set sizes, etc.
10. GRAPHING AND
MONITORING
Ganglia for graphing/trending
native support in Hadoop to push metrics to Ganglia
map or reduce tasks running, slots open, HDFS I/O, etc.
excellent for system graphing like CPU, memory, etc.
scales out horizontally
no con鍖guration - just push metrics from nodes to collectors and they will graph it
Nagios for monitoring
built into our Puppet infrastructure
machines go up, automatically into Nagios with basic system checks
scriptable to easily check other things like JMX
Wednesday, April 20, 2011
basically always have up: jobtracker, Oozie, Ganglia
13. GANGLIA
start reduce
map
Wednesday, April 20, 2011
detail view can see actually the phases of a map reduce job
not totally accurate here since there were multiple jobs running at the same time
15. INFRASTRUCTURE
MANAGEMENT
Wednesday, April 20, 2011
we use Puppet
Question: Who here has used or knows of Puppet/Chef/etc?
pros
* used throughout the rest of our infrastructure
* all con鍖guration of Hadoop and machines is in source control/Subversion
cons
* no push from central
* can only pull from each node (pssh is your friend, poke Puppet on all nodes)
* thats it, Puppet rules
16. PUPPET FOR HADOOP
1
2
3
Wednesday, April 20, 2011
more or less there are 3 steps in the Puppet chain
17. PUPPET FOR HADOOP
package {
hadoop: ensure => '0.20.2+320-14';
rsync: ensure => installed;
lzo: ensure => installed;
lzo-devel: ensure => installed;
}
service {
iptables:
ensure => stopped,
enable => false;
}
# Hadoop account
include account::users::la::hadoop
file {
'/home/hadoop/.ssh/id_rsa':
mode => 600,
source => 'puppet:///modules/hadoop/home/hadoop/.ssh/id_rsa';
}
Wednesday, April 20, 2011
example Puppet manifest
note: we rolled our own RPMs from the Cloudera packages since we didnt like where
Cloudera put stuff on the servers and wanted a bit more control
19. APPLICATIONS
Wednesday, April 20, 2011
http://www.鍖ickr.com/photos/thomaspurves/1039363039/sizes/o/in/photostream/
that wraps up the setup stuff, any questions on that?
20. REPORTING
Wednesday, April 20, 2011
operational - access logs, throughput, general usage, dashboards
business reporting - what are all of the products doing, how do they compare to other
months
ad-hoc - random business queries
almost all of this goes through Pig
there are several pipelines that use Oozie tie together parts
lots of parsing and decoding in Java MR job, then Pig for the heavy lifting
mostly goes into a RDBMS using Sqoop for display and querying in other tools
currently using Tableau to do live dashboards
21. IKEA!
Wednesday, April 20, 2011
other than reporting, we also occasionally do some data exploration, which can be quite fun
any guesses what this is a plot of?
geo-searches for Ikea!
22. Prenzl Berg Yuppies
Ikea Spandau
Ikea Schoenefeld
Ikea Tempelhof
Wednesday, April 20, 2011
Ikea geo-searches bounded to Berlin
can we make any assumptions about what the actual locations are?
kind of, but not much data here
clearly there is a Tempelhof cluster but the others are not very evident
certainly shows the relative popularity of all the locations
Ikea Lichtenberg was not open yet during this time frame
23. Ikea Edmonton
Ikea Wembley
Ikea Lakeside
Ikea Croydon
Wednesday, April 20, 2011
Ikea geo-searches bounded to London
can we make any assumptions about what the actual locations are?
turns out we can!
using a clustering algorithm like K-Means (maybe from Mahout) we probably could guess
> this is considering search location, what about time?
24. Berlin
Wednesday, April 20, 2011
distribution of searches over days of the week and hours of the day
certainly can make some comments about the hours that Berliners are awake
can we make assumptions about average opening hours?
25. Berlin
Wednesday, April 20, 2011
upwards trend a couple hours before opening
can also clearly make some statements about the best time to visit Ikea in Berlin - Sat night!
BERLIN
* Mon-Fri 10am-9pm
* Saturday 10am-10pm
27. London
Wednesday, April 20, 2011
LONDON
* Mon-Fri 10am-10pm
* Saturday 9am-10pm
* Sunday 11am-5pm
> potential revenue stream?
> what to do with this data or data like this?
28. PRODUCTIZING
Wednesday, April 20, 2011
taking data and ideas and turning this into something useful, features that mean something
often the next step after data mining and exploration
either static data shipped to devices or web products, or live data that is constantly fed back
to web products/web services
29. BERLIN
Wednesday, April 20, 2011
another example of something that can be productized
Berlin
* traffic sensors
* map tiles
31. BERLIN LA
Wednesday, April 20, 2011
Starbucks index comes from POI data set, not from the heatmaps you just saw
32. JOIN US!
Nokia is hiring in Berlin!
analytics engineers, smart data folks
software engineers
operations
josh.devins@nokia.com
www.nokia.com/careers
Wednesday, April 20, 2011
33. THANKS!
JOSH DEVINS www.joshdevins.net @joshdevins
Wednesday, April 20, 2011