The document discusses strategies for normalizing, matching, and merging place data from multiple sources to create a centralized place database. It describes techniques for preprocessing strings like removing stop words and stemming. It also outlines a matching strategy that uses techniques like exact matching, similarity joins, and fuzzy matching combined with manual review. Metrics like precision and recall are used to measure the matching rate over time. The solution allows for building APIs and services on top of the centralized place database.
1 of 45
Downloaded 20 times
More Related Content
CityGrid Architecture + API Overview from O'Reilly Strata Conference
9. Places Processing
Source 2
• Name
• Address
• Phone
• reviews
Source 1 Source 3
• Name • Name
• Address • Address
• Phone • Phone
• Images • menu
CityGrid
Place
10. Why is it hard?
Book is to ISBN what Product is to UPC and what Place is to ______
No centrally regulated unique id (tax id is, but not public). Now what?
Spago
176 Canon Dr
Beverly Hills, CA 90210
310-944-3924
R. French Ac & Heating Inc Ray French Air Conditioning & Heating
Service
2211 martin luther king blvd 2211 MLK boulevard #104
los angeles, CA, 90069 west Hollywood, CA, 90069
310-358-5903 866-465-5303
11. Problem Definition
• Medium size data set
– 21mill rows, 120 cols
• Time to process: Daily
• Hybrid environment
• Not all data is from same source
14. Know Your Data
Stop Words
• The Viper Room Viper Room
Stemming
• av aven avenu
• avenue avn avnue
Compression
• county line county rd county road
Trunction
• apt unit #
15. Normalizer
123 Martin Luther King.n
123 MartinLutherKing.
123 martinlutherking.
Martin Luther King | martinlutherking
canon column
the | n | ave | (tokens)
16. Matching Strategy
Do what you can on automated fashion and
complement with manual steps.
34. APIs
Now we have rich Places API
How do we make developers aware they exist?
How do we get them to successfully integrate?
35. APIs – Supporting Developer Area
Common Building Blocks
• Getting Started
•Terms of Use
Publisher Overview
• Documentation
• FAQ
• Terms of Use
36. APIs – Supporting Developer Area
Developers Tools
• Code Samples
•Terms of Use
Libraries
• Mobile SDKs
• Starter Kits
• Hackathon Toolkits
• Partner APIs
38. APIs – Evangelism - Offline
• Conferences
• Hackathons
Terms of Use
• Meetups
• Workshops
39. APIs – Easy Start + Engage Immediately
• Testable APIs
• Self-Service
Terms of Use
• Email After Registration
• Follow on Twitter
• Follow on LinkedIn
40. APIs – Feedback Loop + Voice
• Email Support
• Forum(s) of Use
Terms
• Twitter
• LinkedIn
41. APIs – Monetization = Sustainability
• Local Web Advertising
• Local Mobile Advertising
Terms of Use
• Local Custom Ads
• Places that Pay