際際滷

際際滷Share a Scribd company logo
An agile approach to
       knowledge discovery
                   of web log data
Paul Lam, Thibaut Sacreste, Paul Ingles

OR54, Edinburgh, 4 September 2012
Why web log data
Visitor information

   web page requested

   client IP address

   request timestamp

   query string

   bytes served

   user agent

   referrer
uSwitch



   an online business

   100 GB of uncompressed
    data per month
Behavioural analysis
Purchasing habits
Product personalisation


   30% of Amazon
    sales comes from its
    recommendation
    engine [1]

   Examples on
    uSwitch homepage




                           [1] Schumpeter, Building with big data, Economist, 26 May 2011
Goals



   Exploration of data

   Exploitation of data
Data team at uSwitch

   a core team of 3 complementary skilled people:

       data scientist

       back-end developer

       software architect

   not a boundary of our roles

       guess who loves ggplot and who does the NLP work

   collaborate with domain experts (designers, marketers, product
    managers, developers, etc) across the company
Challenges and Solutions


               Acquire




      Action             Analyse
Acquire
Data extraction considerations


   hundreds of applications distributed over ~50 Amazon EC2 instances

   10+ of the apps are actively worked on at any given time

   projects are owned by small, autonomous teams

   great for the business, not so great to get data from
Distributed data pipeline




Ingles, P., Users as Data, http://vimeo.com/45136211, EuroClojure, 24 May, 2012
Analyse
One of two millions a day

   {:status 200, :scheme http, :pipe ., :request-uri /broadband/?
    gclid=CPnYgdqj0bECFa4mtAodVEsAYA, :http-x-forwarded-for 92.9.200.50, :msec
    1344196910.137, :sent-http-set-cookie -, :body-bytes-sent 18836, :query-string
    gclid=CPnYgdj0bECa4mtAdVEsAYA, :request-content-type -, :cookie-urefs -, :request
    GET /broadband/?gclid=CPnYgdj0bECa4mtAdVEsAYA HTTP/1.1, :upstream-
    response-time 0.164, :sent-http-content-type text/html, :hostname nginx-
    lb-20120229-1942-24.uswitchinternal.com, :sent-http-location -, :time-local 05/Aug/
    2012:20:01:50 +0000, :http-referer http://www.google.co.uk/aclk?
    sa=l&ai=D1556&rct=j&q=best%20value%20internet%20uk, :http-user-agent Mozilla/
    5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.60
    Safari/537.1, :request-time 0.164, :request-body -, :http-host
    www.uswitch.com, :upstream-addr 178.32.60.100:80, :sent-http-server -, :upstream-
    status 200, :uscc <ANON>}
Ad-hoc queries - Apache Hive
Word Count - Cascalog
TF-IDF

   Extended from word
    count example

   Single-purpose
    methods

   Composition of
    functions



   github.com/Quantisan/Impatient


   github.com/Cascading/Impatient
Our data processing methodology


   No monolithic framework

   Only build what we need as
    we go

   Composability, extensibility,
    maintainability
Action
80/20


                   Acquire        80% of work




          Action             Analyse



80% of result
Three Es

   Enlighten

       R with rhdfs and ggplot, Sinatra + D3.js

   Expose

       Scheduled Hadoop jobs to load processed data into MySQL for
        everyone to use

   Exploit

       Real-time customer intelligence to personalise website for each
        visitor
Result


   Data from all levels are accessible

   Information is easy

       "Sweet! I don't have to do anything! -- Hemal, uSwitch developer

   Opening dialogue about using data
Summary



   Develop incrementally and iterate

   Mitigate unnecessary complexity
Contact



   Paul Lam, data scientist at uSwitch

   @Quantisan

   paul.lam@forward.co.uk

More Related Content

An agile approach to knowledge discovery on web log data

  • 1. An agile approach to knowledge discovery of web log data Paul Lam, Thibaut Sacreste, Paul Ingles OR54, Edinburgh, 4 September 2012
  • 2. Why web log data
  • 3. Visitor information web page requested client IP address request timestamp query string bytes served user agent referrer
  • 4. uSwitch an online business 100 GB of uncompressed data per month
  • 7. Product personalisation 30% of Amazon sales comes from its recommendation engine [1] Examples on uSwitch homepage [1] Schumpeter, Building with big data, Economist, 26 May 2011
  • 8. Goals Exploration of data Exploitation of data
  • 9. Data team at uSwitch a core team of 3 complementary skilled people: data scientist back-end developer software architect not a boundary of our roles guess who loves ggplot and who does the NLP work collaborate with domain experts (designers, marketers, product managers, developers, etc) across the company
  • 10. Challenges and Solutions Acquire Action Analyse
  • 12. Data extraction considerations hundreds of applications distributed over ~50 Amazon EC2 instances 10+ of the apps are actively worked on at any given time projects are owned by small, autonomous teams great for the business, not so great to get data from
  • 13. Distributed data pipeline Ingles, P., Users as Data, http://vimeo.com/45136211, EuroClojure, 24 May, 2012
  • 15. One of two millions a day {:status 200, :scheme http, :pipe ., :request-uri /broadband/? gclid=CPnYgdqj0bECFa4mtAodVEsAYA, :http-x-forwarded-for 92.9.200.50, :msec 1344196910.137, :sent-http-set-cookie -, :body-bytes-sent 18836, :query-string gclid=CPnYgdj0bECa4mtAdVEsAYA, :request-content-type -, :cookie-urefs -, :request GET /broadband/?gclid=CPnYgdj0bECa4mtAdVEsAYA HTTP/1.1, :upstream- response-time 0.164, :sent-http-content-type text/html, :hostname nginx- lb-20120229-1942-24.uswitchinternal.com, :sent-http-location -, :time-local 05/Aug/ 2012:20:01:50 +0000, :http-referer http://www.google.co.uk/aclk? sa=l&ai=D1556&rct=j&q=best%20value%20internet%20uk, :http-user-agent Mozilla/ 5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.60 Safari/537.1, :request-time 0.164, :request-body -, :http-host www.uswitch.com, :upstream-addr 178.32.60.100:80, :sent-http-server -, :upstream- status 200, :uscc <ANON>}
  • 16. Ad-hoc queries - Apache Hive
  • 17. Word Count - Cascalog
  • 18. TF-IDF Extended from word count example Single-purpose methods Composition of functions github.com/Quantisan/Impatient github.com/Cascading/Impatient
  • 19. Our data processing methodology No monolithic framework Only build what we need as we go Composability, extensibility, maintainability
  • 21. 80/20 Acquire 80% of work Action Analyse 80% of result
  • 22. Three Es Enlighten R with rhdfs and ggplot, Sinatra + D3.js Expose Scheduled Hadoop jobs to load processed data into MySQL for everyone to use Exploit Real-time customer intelligence to personalise website for each visitor
  • 23. Result Data from all levels are accessible Information is easy "Sweet! I don't have to do anything! -- Hemal, uSwitch developer Opening dialogue about using data
  • 24. Summary Develop incrementally and iterate Mitigate unnecessary complexity
  • 25. Contact Paul Lam, data scientist at uSwitch @Quantisan paul.lam@forward.co.uk

Editor's Notes

  • #2: \n
  • #3: So what&amp;#x2019;s so special about web log data?\n
  • #4: Web log contains visitor information such as: what page they&amp;#x2019;re looking at, what browser or device they&amp;#x2019;re using, and how they came about to our site.\n
  • #5: uswitch is the second largest price comparison website in the UK. In terms of data, we&amp;#x2019;re collecting about a 100 GB of data per month, most of which are web log data.\n
  • #6: It is literally trails of footprints of each and every one of our customer. By studying and analysing our web log data, we can better understand our customers.\n\nexplain graph\n\n3 clusters = 3 businesses\n
  • #7: \n
  • #8: in addition to providing information, we can also make use of the data within the website itself.\n
  • #9: \n
  • #10: \n
  • #11: \n
  • #12: There are 3 stages to our data process.\n
  • #13: \n
  • #14: \n
  • #15: distributed and asynchronous message queue\n\npush data\n
  • #16: \n
  • #17: semi-structured\n\npostcode field example\n
  • #18: \n
  • #19: \n
  • #20: \n
  • #21: \n
  • #22: \n
  • #23: \n
  • #24: \n
  • #25: \n
  • #26: \n
  • #27: \n
  • #28: \n
  • #29: \n
  • #30: \n