�ݺ�ߣ

BIG DATA
The rise of the data scientist

http://ﬂowingdata.com/2009/06/04/rise-of-the-data-scientist/
Tuesday, June 8, 2010

Holidaycheck
Travel platform: review +
book

12+ countries (.de ... .cn)

30% growth / year,
profitable

Almost 1.5 mio hotel reviews

1.6 mio + pics


Data @ HC
internet-driven 15 Gb Operational
company Data

traditional: MVC/ 12 Gb logs / day
3-Tier/RDBMS/
caching 5 searches /
second
50+ Apache
instances

My scientist friend: “That’s neat, but it’s not data science.”


The I/O Bottleneck
“The problem is simple: Memory, Disk size and CPU and even
network performance continue to grow much faster than disk I/O
performance.”
2004 to 2009

CPU: still following Moore's Law (transistor x2 every 18
months)

Memory Bandwidth (Intel): 9.3x

Disk Density (SATA): 8x

Disk I/O: 0.8x

Network speed: routers can easily saturate the fastest hard
drives

http://blogs.cisco.com/datacenter/comments/networking_delivering_more_by_exceeding_the_law_of_moore/


I/O Repercussions

Turn to memcache

Try out SSD

Try out asynchronous writes (e.g. message queues)

Try to solve/hack the I/O problem: Sharding, in-memory DB

Our problems seem big, but are they really?


So what is Big Data anyway?
“The term Big data from software engineering and computer science
describes datasets that grow so large that they become awkward to work
with using on-hand database management tools”

kilo to mega to giga to tera to peta to exa to zetta to yotta


NoSQL = Not Only SQL
Trade-Offs, e.g. transactions, data loss
e.g. Document Stores (MongoDB) e.g. Key-Value Stores (MemcacheDB)
e.g. Graph Databases (Neo4j) Map/Reduce algorithm


Medium Data
“With yesterday's scientific technology most businesses should be able to
handle their data analysis needs.”

HC: 12 Gb logfiles / day = medium data problem

Solved (?) with: RDBMS + NoSQL

(2006) Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson
C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber

(2004) MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat


3 sexy skills of data geeks

“The sexy job in the next ten years will be statisticians… The ability
to take data—to be able to understand it, to process it, to extract
value from it, to visualize it, to communicate it. Hal Valerian (Google)”

http://dataspora.com/blog/sexy-data-geeks/


3 skills: statistics

sentiment analysis machine learning natural language processing
recommendation engines good old-fashioned regression


3 skills: visualization
Q: Are you hiring statisticians, visualization experts & data plumbers?

Vs.

TheOathMeal Edward Tufte, Ben Fry


3 skills: data plumbing

Glue languages: Python, Perl, regex, XSLT

Admin: setting up, maintaining clusters

Afﬁnity with OSS & *nix

NoSQL = NoSchema = Transform Data

/^([w!#$%&'*+-/=?^`{|}~]+.)*[w!#$%&
'*+-/=?^`{|}~]+@((((([a-z0-9]{1}[a-z0-9-]{0,62}[a-
z0-9]{1})|[a-z]).)+[a-z]{2,6})|(d{1,3}.){3}d{1,3}(:d{1,5})?)$/i


More Data beats smart algorithms

face recognition

spelling correction machine translation

http://videos.syntience.com/ai-meetups/peternorvig.html
http://dataspora.com/blog/tipping-points-and-big-data/


Ethics of data

Black Hat vs. White Hat <=> Black Data vs. White data

White: Amazon free public datasets (e.g. human genome)

Black: Scientific climate data (or the lack of PUBLIC data)

Just like money, information flows to the least taxed location in a
global world.


Take-Away & Discuss
“Don't throw away data if you don’t have to, because
unlike material goods, data becomes more valuable the
more of it is created. As a society, I don't think we
understand this completely yet.”
q: Who is using a NoSQL db?
Share Stories?
q: Do you know how much data you are
q: Do you hire statisticians? throwing away?

q: Do you hire visualization q: Any tips on introducing NoSQL in
experts? companies?
q: Share: how big is your data?

q: Do you own your customer data or q: Do you own your analytics data?
does Facebook?
q: How are you exploiting
q: Do you own your content or does asynchronicity?
Google?
q: Should information be regulated
(privacy)? Can it?


�ݺ�ߣ

Big Data @ Bodensee Barcamp 2010

More Related Content

Similar to Big Data @ Bodensee Barcamp 2010 (20)

Recently uploaded (20)

Big Data @ Bodensee Barcamp 2010

Editor's Notes