This document discusses an agile approach to analyzing web log data from uSwitch, an online business. It describes acquiring log data from many distributed applications, analyzing the data using tools like Apache Hive and Cascalog, and taking action by exposing insights to help personalize websites for visitors. The goal is to explore and exploit the data through a collaborative team that acquires, analyzes, and takes action on the data in an iterative way without unnecessary complexity.
1 of 25
Download to read offline
More Related Content
An agile approach to knowledge discovery on web log data
1. An agile approach to
knowledge discovery
of web log data
Paul Lam, Thibaut Sacreste, Paul Ingles
OR54, Edinburgh, 4 September 2012
7. Product personalisation
30% of Amazon
sales comes from its
recommendation
engine [1]
Examples on
uSwitch homepage
[1] Schumpeter, Building with big data, Economist, 26 May 2011
9. Data team at uSwitch
a core team of 3 complementary skilled people:
data scientist
back-end developer
software architect
not a boundary of our roles
guess who loves ggplot and who does the NLP work
collaborate with domain experts (designers, marketers, product
managers, developers, etc) across the company
12. Data extraction considerations
hundreds of applications distributed over ~50 Amazon EC2 instances
10+ of the apps are actively worked on at any given time
projects are owned by small, autonomous teams
great for the business, not so great to get data from
18. TF-IDF
Extended from word
count example
Single-purpose
methods
Composition of
functions
github.com/Quantisan/Impatient
github.com/Cascading/Impatient
19. Our data processing methodology
No monolithic framework
Only build what we need as
we go
Composability, extensibility,
maintainability
21. 80/20
Acquire 80% of work
Action Analyse
80% of result
22. Three Es
Enlighten
R with rhdfs and ggplot, Sinatra + D3.js
Expose
Scheduled Hadoop jobs to load processed data into MySQL for
everyone to use
Exploit
Real-time customer intelligence to personalise website for each
visitor
23. Result
Data from all levels are accessible
Information is easy
"Sweet! I don't have to do anything! -- Hemal, uSwitch developer
Opening dialogue about using data
24. Summary
Develop incrementally and iterate
Mitigate unnecessary complexity
25. Contact
Paul Lam, data scientist at uSwitch
@Quantisan
paul.lam@forward.co.uk
#3: So what’s so special about web log data?\n
#4: Web log contains visitor information such as: what page they’re looking at, what browser or device they’re using, and how they came about to our site.\n
#5: uswitch is the second largest price comparison website in the UK. In terms of data, we’re collecting about a 100 GB of data per month, most of which are web log data.\n
#6: It is literally trails of footprints of each and every one of our customer. By studying and analysing our web log data, we can better understand our customers.\n\nexplain graph\n\n3 clusters = 3 businesses\n