Orbitz used Hadoop and Hive to address the challenge of processing and analyzing large amounts of log and user data. They were able to improve their hotel sorting and ranking by using machine learning algorithms on data stored in Hadoop. Statistical analysis of the Hadoop data provided insights into user behaviors and helped optimize aspects of the user experience like hotel search and recommendations. Orbitz found Hadoop to be a cost-effective solution that has expanded to more uses across the company.
1 of 34
More Related Content
Hadoop and Hive at Orbitz, Hadoop World 2010
1. Hadoop and Hive at Orbitz
Jonathan Seidman and Ramesh Venkataramaiah
Hadoop World 2010
2. Agenda
≒ Orbitz Worldwide
≒ The challenge of big data at Orbitz
≒ Hadoop as a solution to the data challenge
≒ Applications of Hadoop and Hive at Orbitz improving hotel
sort
≒ Sample analysis and data trends
≒ Other uses of Hadoop and Hive at Orbitz
≒ Lessons learned and conclusion
page 2
5. Data Challenges at Orbitz
On Orbitz alone we do millions of searches and transactions daily,
which leads to hundreds of gigabytes of log data every day.
So how do we store and process all of this data?
page 5
8. ≒ Adding data to our data warehouse also requires a lengthy
plan/implement/deploy cycle.
≒ Because of the expense and time our data teams need to be
very judicious about which data gets added. This means that
potentially valuable data may not be saved.
≒ We needed a solution that would allow us to economically store
and process the growing volumes of data we collect.
page 8
10. ≒ Important to note that Hadoop is not a replacement to a data
warehouse, but rather is a complement to it.
≒ On the other hand, Hadoop offers benefits other than just cost.
page 10
12. page 12
How can we improve hotel ranking?
Hey! Lets use machine learning!
All the cool kids are doing it!
13. Requires data lots of data
≒ Web analytics software providing session data about user
behavior.
≒ Unfortunately specific data fields we needed werent loaded
into our data warehouse, and just to make things worse the
only archive of raw logs available only went back a few days.
≒ We decided to turn to Hadoop to provide a long-term archive
for these logs.
≒ Storing raw data in HDFS provides access to data not available
elsewhere, for example hotel impression data:
115004,1,70.00;35217,2,129.00;239756,3,99.00;83389,4,99.00!
page 13
14. Now we need to process the data
≒ Extract data from raw Webtrends logs for input to a trained
classification process.
≒ Logs provide input to MapReduce processing which extracts
required fields.
≒ Previous process used a series of Perl and Bash scripts to
extract data serially.
≒ Comparison of performance
Months worth of data
Manual process took 109m14s
MapReduce process took 25m58s
page 14
21. Once data is in hive
≒ Provides input data to machine learning processes.
≒ Used to create data exports for further analysis with R scripts,
allowing us to derive more complex statistics and visualizations
of our data.
≒ Provides useful metrics, many of which were unavailable with
our existing data stores.
≒ Used for aggregating data for import into our data warehouse
for creation of new data cubes, providing analysts access to
data unavailable in existing data cubes.
page 21
22. Statistical Analysis: Infrastructure and Dataset
page 22
≒ Hive + R platform for query processing and statistical analysis.
≒ R - Open-source stat package with visualization.
≒ Hive Dataset:
Customer hotel booking on our sites and User rating of hotels.
≒ Investigation:
Are there built-in data bias? Any Lurking variables?
What approximations and biases exist?
Are variables pair-wise correlated?
Are there macro patterns?
23. Statistical Analysis - Positional Bias
page 23
≒ Lurking variable is
Positional Bias.
≒ Top positions invariably
picked the most.
≒ Aim to position Best Ranked
Hotels at the top based on
customer search criteria and
user ratings.
24. Statistical Analysis - Kernel Density
page 24
≒ User Ratings of Hotels
≒ Strongly affected by the number
of bins used.
≒ Kernel density plots are usually
a much more effective way to
overcome the limitations of
histograms.
26. Statistical Analysis - More seasonal variations
page 26
≒ Customer hotel stay gets longer during summer months
≒ Could help in designing search based on seasons.
≒ Outliers removed.
27. Analysis: take aways
page 27
≒ Costs of cleaning and processing data is significant.
≒ Tendency to create stories out of noise.
≒ Median is not the message; Find macro patterns first.
≒ If website originated data, watch for hidden bias in data collection.
28. Lessons Learned
≒ Make sure youre using the appropriate tool avoid the temptation to
start throwing all of your data in Hadoop when a relational store may be
a better choice.
≒ Expect the unexpected in your data. When processing billions of records
of data its inevitable that youll encounter at least one bad record which
will blow up your processing.
≒ To get buy-in from upper management,
present a long-term, unstructured
data growth story and explain how this
will help harness long-tail opportunities.
page 28
29. Lessons Learned (continued)
≒ Hadoops limited security model creates challenges when
trying to deploy Hadoop in the enterprise.
≒ Configuration currently seems to be a black art. It can be
difficult to understand which parameters to set and how to
determine an optimal configuration.
≒ Watch your memory use. Sloppy programming practices will
bite you when your code needs to process large volumes of
data.
page 29
31. Just a few more examples of how Hadoop is being used at Orbitz
≒ Measuring page download performance: using web analytics logs as
input, a set of MapReduce scripts are used to derive detailed client
side performance metrics which allow us to track trends in page
download times.
≒ Searching production logs: an effort is underway to utilize Hadoop to
store and process our large volume of production logs, allowing
developers and analysts to perform tasks such as troubleshooting
production issues.
≒ Cache analysis: extraction and aggregation of data to provide input to
analyses intended to improve the performance of data caches utilized
by our web sites.
page 31
32. Applications of Hadoop at orbitz are just beginning
≒ Were in the process of quadrupling the capacity of our
production cluster.
≒ Multiple teams are working on new applications of Hadoop
≒ We continue to explore the use of associated tools Hbase,
Pig, Flume, etc.
page 32
33. References
≒ Hadoop project: http://hadoop.apache.org/
≒ Hive project: http://hadoop.apache.org/hive/
≒ Hive A Petabyte Scale Data Warehouse Using Hadoop:
http://i.stanford.edu/~ragho/hive-icde2010.pdf
≒ Hadoop The Definitive Guide, Tom White, OReilly Press, 2009
≒ Why Model, J. Epstein, 2008
≒ Beautiful Data, T. Segaran & J. Hammerbacher, 2009
≒ Karmasphere Developer Study: http://www.karmasphere.com/
images/documents/Karmasphere-HadoopDeveloperResearch.pdf
page 33
34. Contact
≒ Jonathan Seidman:
jseidman@orbitz.com
@jseidman
Chicago area Hadoop User Group: http://www.meetup.com/
Chicago-area-Hadoop-User-Group-CHUG/
≒ Ramesh Venkataramaiah:
rvenkataramaiah@orbitz.com
page 34