ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Ana Martinez
Kin Lane

February 2012   M.C. Escher
CityGrid Architecture + API Overview from O'Reilly Strata Conference
CityGrid Architecture + API Overview from O'Reilly Strata Conference
The problem
Big Bottleneck!
Single POF!
CityGrid Architecture + API Overview from O'Reilly Strata Conference
Places Processing
Places Processing
              Source 2
              • Name
              • Address
              • Phone
              • reviews
  Source 1                 Source 3
  • Name                   • Name
  • Address                • Address
  • Phone                  • Phone
  • Images                 • menu



                CityGrid
                 Place
Why is it hard?
Book is to ISBN what Product is to UPC and what Place is to ______


No centrally regulated unique id (tax id is, but not public). Now what?

Spago
176 Canon Dr
Beverly Hills, CA 90210
310-944-3924



R. French Ac & Heating Inc               Ray French Air Conditioning & Heating
                                         Service
2211 martin luther king blvd             2211 MLK boulevard #104
los angeles, CA, 90069                   west Hollywood, CA, 90069
310-358-5903                             866-465-5303
Problem Definition
• Medium size data set
  – 21mill rows, 120 cols

• Time to process: Daily

• Hybrid environment

• Not all data is from same source
Solution




       Normalizer   Matcher   Merger
Normalizer


  Soundex     Metaphone      NYSIIS


        Matching
         Rating     Coverphone
        Approach
Know Your Data
Stop Words
 • The Viper Room           Viper Room

Stemming
 • av               aven           avenu
 • avenue           avn            avnue
Compression
 • county line      county rd      county road

Trunction
 • apt                      unit                 #
Normalizer
         123 Martin Luther King.n

           123 MartinLutherKing.

           123 martinlutherking.

    Martin Luther King | martinlutherking
                  canon column



          the | n | ave | (tokens)
Matching Strategy




   Do what you can on automated fashion and
       complement with manual steps.
Matching Strategy




Exact matching
            Set similarity joins
                                   Custom fuzzy matching
Matching Strategy
• C - Support Vector Machine

• Threashold: 0.996
  – Precision: 98.1%
  – Recall: 97.5%




        84% + manual -> % Match Rate
Merger

Rules:
   Provider truthworthiness
   Voting rules
   New data vs Old data
   Super providers
                              History:
                                         Accepted
                                         Rejected
Example
123 M L K Road Ste 45 123 Martin Luther King Rd       123 Martin L King Drive #45
123 m l k road ste 45      123 martinluther king rd   123 martin l king drive #45
(123) (m) (l) (k) (road)   (123) (martin) (luther)    (123) (martin) (l) (king)
(ste) (45)                 (king) (rd)                (drive) (#) (45)
123 mlk road ste 45        123 martinlutherkingrd     123 martinlking drive # 45
123 mlkrdste 45            123 mlkrd                  123 mlkdr #45
123 mlkrd                  123 mlkrd                  123 mlkdr
123 mlk                    123 mlk                    123 mlk


          MATCH!                     MATCH!                       MATCH!
Findings & Tips
• Domain Knowledge




                     • Automation
                     • Mechanical Turk
                     • Machine Learning

  Run every 2hrs -> Match Rate of %
CityGrid Architecture + API Overview from O'Reilly Strata Conference
CityGrid Architecture + API Overview from O'Reilly Strata Conference
Solution for Search APIs
CityGrid Architecture + API Overview from O'Reilly Strata Conference
Solution for Places API
CityGrid Architecture + API Overview from O'Reilly Strata Conference
CityGrid Architecture + API Overview from O'Reilly Strata Conference
Performance Results
Updates


          • Hours


          • Real Time
CityGrid Architecture + API Overview from O'Reilly Strata Conference
Places Detail – Demo Time!
• Details by ID

  – http://api.citygridmedia.com/content/places/v2/detail?listing_i
    d=11280452&client_ip=123.4.56.78&publisher=test

  – http://api.citygridmedia.com/content/places/v2/detail?public_i
    d=pinks-hot-dogs-los-angeles-
    2&client_ip=123.4.56.78&publisher=test
Improvements
• Shard Listing and Content Data

• Integrate Mongo across all APIs
APIs
        Now we have rich Places API

How do we make developers aware they exist?

How do we get them to successfully integrate?
APIs – Supporting Developer Area
 Common Building Blocks

   • Getting Started
   •Terms of Use
     Publisher Overview
   • Documentation
   • FAQ
   • Terms of Use
APIs – Supporting Developer Area
 Developers Tools
   • Code Samples
   •Terms of Use
     Libraries
   • Mobile SDKs
   • Starter Kits
   • Hackathon Toolkits
   • Partner APIs
APIs – Evangelism - Online
 •   Blogging
 •   Twitter
 •   LinkedIn
 •   Facebook of Use
       Terms
 •   Github
 •   Stack Overflow
 •   Quora
 •   Hacker News
 •   StumbleUpon
 •   Reddit
APIs – Evangelism - Offline


 •   Conferences
 •   Hackathons
      Terms of Use
 •   Meetups
 •   Workshops
APIs – Easy Start + Engage Immediately

•   Testable APIs
•   Self-Service
       Terms of Use
•   Email After Registration
•   Follow on Twitter
•   Follow on LinkedIn
APIs – Feedback Loop + Voice

•   Email Support
•   Forum(s) of Use
        Terms
•   Twitter
•   LinkedIn
APIs – Monetization = Sustainability

•   Local Web Advertising
•   Local Mobile Advertising
       Terms of Use
•   Local Custom Ads
•   Places that Pay
APIs – Evangelize Internally

•   Developer Feedback
•   Roadmap Suggestions
      Terms of Use
•   Landscape Analysis
•   Technology Awareness
•   Trends
•   Internal Hackathons
APIs – Measure & Repeat


  Terms of Use
CityGrid Architecture + API Overview from O'Reilly Strata Conference
CityGrid Architecture + API Overview from O'Reilly Strata Conference

More Related Content

CityGrid Architecture + API Overview from O'Reilly Strata Conference

  • 9. Places Processing Source 2 • Name • Address • Phone • reviews Source 1 Source 3 • Name • Name • Address • Address • Phone • Phone • Images • menu CityGrid Place
  • 10. Why is it hard? Book is to ISBN what Product is to UPC and what Place is to ______ No centrally regulated unique id (tax id is, but not public). Now what? Spago 176 Canon Dr Beverly Hills, CA 90210 310-944-3924 R. French Ac & Heating Inc Ray French Air Conditioning & Heating Service 2211 martin luther king blvd 2211 MLK boulevard #104 los angeles, CA, 90069 west Hollywood, CA, 90069 310-358-5903 866-465-5303
  • 11. Problem Definition • Medium size data set – 21mill rows, 120 cols • Time to process: Daily • Hybrid environment • Not all data is from same source
  • 12. Solution Normalizer Matcher Merger
  • 13. Normalizer Soundex Metaphone NYSIIS Matching Rating Coverphone Approach
  • 14. Know Your Data Stop Words • The Viper Room Viper Room Stemming • av aven avenu • avenue avn avnue Compression • county line county rd county road Trunction • apt unit #
  • 15. Normalizer 123 Martin Luther King.n 123 MartinLutherKing. 123 martinlutherking. Martin Luther King | martinlutherking canon column the | n | ave | (tokens)
  • 16. Matching Strategy Do what you can on automated fashion and complement with manual steps.
  • 17. Matching Strategy Exact matching Set similarity joins Custom fuzzy matching
  • 18. Matching Strategy • C - Support Vector Machine • Threashold: 0.996 – Precision: 98.1% – Recall: 97.5% 84% + manual -> % Match Rate
  • 19. Merger Rules: Provider truthworthiness Voting rules New data vs Old data Super providers History: Accepted Rejected
  • 20. Example 123 M L K Road Ste 45 123 Martin Luther King Rd 123 Martin L King Drive #45 123 m l k road ste 45 123 martinluther king rd 123 martin l king drive #45 (123) (m) (l) (k) (road) (123) (martin) (luther) (123) (martin) (l) (king) (ste) (45) (king) (rd) (drive) (#) (45) 123 mlk road ste 45 123 martinlutherkingrd 123 martinlking drive # 45 123 mlkrdste 45 123 mlkrd 123 mlkdr #45 123 mlkrd 123 mlkrd 123 mlkdr 123 mlk 123 mlk 123 mlk MATCH! MATCH! MATCH!
  • 21. Findings & Tips • Domain Knowledge • Automation • Mechanical Turk • Machine Learning Run every 2hrs -> Match Rate of %
  • 30. Updates • Hours • Real Time
  • 32. Places Detail – Demo Time! • Details by ID – http://api.citygridmedia.com/content/places/v2/detail?listing_i d=11280452&client_ip=123.4.56.78&publisher=test – http://api.citygridmedia.com/content/places/v2/detail?public_i d=pinks-hot-dogs-los-angeles- 2&client_ip=123.4.56.78&publisher=test
  • 33. Improvements • Shard Listing and Content Data • Integrate Mongo across all APIs
  • 34. APIs Now we have rich Places API How do we make developers aware they exist? How do we get them to successfully integrate?
  • 35. APIs – Supporting Developer Area Common Building Blocks • Getting Started •Terms of Use Publisher Overview • Documentation • FAQ • Terms of Use
  • 36. APIs – Supporting Developer Area Developers Tools • Code Samples •Terms of Use Libraries • Mobile SDKs • Starter Kits • Hackathon Toolkits • Partner APIs
  • 37. APIs – Evangelism - Online • Blogging • Twitter • LinkedIn • Facebook of Use Terms • Github • Stack Overflow • Quora • Hacker News • StumbleUpon • Reddit
  • 38. APIs – Evangelism - Offline • Conferences • Hackathons Terms of Use • Meetups • Workshops
  • 39. APIs – Easy Start + Engage Immediately • Testable APIs • Self-Service Terms of Use • Email After Registration • Follow on Twitter • Follow on LinkedIn
  • 40. APIs – Feedback Loop + Voice • Email Support • Forum(s) of Use Terms • Twitter • LinkedIn
  • 41. APIs – Monetization = Sustainability • Local Web Advertising • Local Mobile Advertising Terms of Use • Local Custom Ads • Places that Pay
  • 42. APIs – Evangelize Internally • Developer Feedback • Roadmap Suggestions Terms of Use • Landscape Analysis • Technology Awareness • Trends • Internal Hackathons
  • 43. APIs – Measure & Repeat Terms of Use

Editor's Notes