Petar Petrovski, Volha Bryl, Christian Bizer. Integrating Product Data from Websites offering Microdata Markup.The 4th Workshop on Data Extraction and Object Search (DEOS) @ WWW 2014
1 of 38
Downloaded 10 times
More Related Content
Integrating Product Data from Websites offering Microdata Markup
1. Integrating Product Data from
Websites offering Microdata
Markup
School of Business Informatics and Mathematics
Petar Petrovski, Volha Bryl, Christian Bizer
Data and Web Science Research Group
University of Mannheim, Germany
2. Outline
1. HTML-embedded Data on the Web
2. The Data Integration Pipeline
1. Microdata extraction
2. Classification
3. Feature extraction
4. Identity resolution
5. Data Fusion
3. Conclusions
2Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
3. HTML-embedded Data
More and more Websites semantically markup the
content of their HTML pages.
Microformats
Microdata
RDFa
3Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
4. Schema.org
ask site owners to embed
data to enrich search results.
200+ Classes: Product, Review, LocalBusiness, Person, Place, Event,
Encoding: Microdata or RDFa
4Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
5. Usage of Schema.org Data @ Google
Data snippets
within
search results
Data snippets
within
info boxes
5Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
6. Websites Containing Structured Data
(November 2013)
1.7 million websites (PLDs) out of 12.8 million
provide Microformat, Microdata or RDFa data (13%)
585 million of the 2.2 billion pages contain
Microformat, Microdata or RDFa data (26%).
http://webdatacommons.org/structureddata/
Google, October 2013:
15% of all websites provide structured data.
6Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
7. Top Classes, Microdata (2013)
schema = Schema.org
datavoc = Googles
Rich Snippet Vocabulary
7Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
8. Example: Microdata, Local Business
8Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
10. The Data Integration Pipeline
Objective: integrate all data found on the web
describing a specific entity (e.g. product or organization)
Motivation: enables creation of powerful applications, e.g.
comparison shopping portals
Use case: product data
Implemented Pipeline:
10Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
11. Outline
1. HTML-embedded Data on the Web
2. The Data Integration Pipeline
1. Microdata extraction
2. Classification
3. Feature extraction
4. Identity resolution
5. Data Fusion
3. Conclusions
11Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
12. Web Data Commons Extraction
Framework
Web Data Commons project: extracts structured data from
the Common Crawl
http://webdatacommons.org/
http://commoncrawl.org/
Code available at:
https://subversion.assembla.com/svn/commondata/
Based on Anything To Triples (any23) library for extracting
structured data: http://any23.apache.org
Common Crawl 2012
3 billion HTML pages, 40.6 million websites
7.3 billion statements describing 1.15 billion things
9.4 million product offers from 9240 e-shops
Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
13. Looking Deeper into E-Commerce Data
Microdata Product (2013)
13Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
15. Example: Title and Description
Title
Description
AppleMacBook Air MC968/A 11.6-Inch Laptop
Faster Flash Storage with 64 GB Solid State Drive and USB 3.0. 720p FaceTime HD
Camera. The new 1.6 GHz Intel Core i5 Processor with Intel HD Graphics 3000
enabling beautiful rendering and 4GB DDR3 RAM. 11.6 LED display with the best
resolution
Title
Description
The MacBook Air MC 968/A powered by Intel Core i5(1.6GHz, 3MB L3). 64 GB SSD
and 4096 MB of DDR3 RAM. 29.464cm (11.6) TFT 1366x768, Intel HD Graphics,
IEEE 802.11a/b/g, Bluetooth 4.0, FaceTme camera, OS X LIon
Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 4
GB, 64 GB, Mac OS X Lion 10.7
Various abbreviations can be
found describing same features Often imprecise values due to rounding
in numeric values can be found
Different descriptions follow
different levels of detail
Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
16. Outline
1. HTML-embedded Data on the Web
2. The Data Integration Pipeline
1. Microdata extraction
2. Classification
3. Feature extraction
4. Identity resolution
5. Data Fusion
3. Conclusions
16Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
17. Product Classification
Starting from 9.4 million products:
Products with English descriptions with length grater than 20 words
=> 1,986,359 products from 9,240 e-shops
Training set
18,000 labeled products, 9 classes
Training the model
Na誰ve Bayes Classifier
Features generation
4 step process tokenizing and removing stop words, pruning,
n-grams, TF-IDF
~3600 features
17Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
18. Classification Performance
Category Precision % Recall % #
Books 86.58 87.95 233,249
Movies, Music & Games 89.81 70.63 186,832
Electronics & Computers 92.98 88.00 219,118
Home, Garden & Tools 73.81 60.78 186,495
Grocery, Health & Beauty 70.20 72.86 120,573
Toys, Kids, Baby & Pets 75.00 64.85 114,236
Clothing, Shoes & Jewelry 88.56 89.93 206,315
Sports & Outdoors 72.83 67.90 143,156
Automotive & Industrial 73.06 65.50 168,567
Average 80.31 74.26 1,578,541
18Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
The offers originate from 9,240 e-shops
19. Outline
1. HTML-embedded Data on the Web
2. The Data Integration Pipeline
1. Microdata extraction
2. Classification
3. Feature extraction
4. Identity resolution
5. Data Fusion
3. Conclusions
19Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
20. Product Feature Extraction
Low precision (69%) for identity resolution without product feature
extraction
Used later as a baseline for identity resolution
We developed the Free Text Preprocessor
Makes the data more structured by extracting new property-
value pairs from free-text properties
https://www.assembla.com/spaces/silk/wiki/Silk_Free_Text_Preprocessor
20Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
21. Free Text Preprocessor by Example
<http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" .
<http://wdc.org/resource/2> <http://schema.org/Product/description>
"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g.
Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .
21Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
22. Free Text Preprocessor by Example
<http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" .
<http://wdc.org/resource/2> <http://schema.org/Product/description>
"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g.
Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .
<http://wdc.org/resource/2> <http://schema.org/Product/Brand> "Apple" .
<http://wdc.org/resource/2> <http://schema.org/Product/Model> "iPod nano" .
<http://wdc.org/resource/2> <http://schema.org/Product/Storage> "8GB" .
<http://wdc.org/resource/2> <http://schema.org/Product/Display> "1.5-inch" .
22Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
23. Silk Free Text Preprocessor by Example
<http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" .
<http://wdc.org/resource/2> <http://schema.org/Product/description>
"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g.
Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .
<http://wdc.org/resource/2> <http://schema.org/Product/Brand> "Apple" .
<http://wdc.org/resource/2> <http://schema.org/Product/Model> "iPod nano" .
<http://wdc.org/resource/2> <http://schema.org/Product/Storage> "8GB" .
<http://wdc.org/resource/2> <http://schema.org/Product/Display> "1.5-inch" .
Free Text Preprocessor
Specification
23Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
24. Extractors Bag-of-words
Learning
Creating a list of words for every feature in the training set
Extraction
Matching tokens against the learned lists
Pros
Good for extracting nominal and numerical (with units of measurement) attributes
Cons
Bad for extracting multi-token values
Inconclusive for values that refer to more than one feature
Brand
Storage
Display
Samsung Benq Apple Cannon
64 GB megabytes 512GB
42-inch 3.5-inches Inches 15.24cm
24Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
25. Extractors Feature-Value Pairs
Learns feature-value pairs from the structured data
Extraction
Tagging taking n-grams up to 4 and matching against the values from the training set
Parsing taking the combination of feature-value pairs that best describes an object
from the training dataset
Pros
Extracting multi-token values
Cons
Inconclusive for values that refer to more than one feature
<Model, Asus EEE 10.1 Inch>
<Processor, 1.66 GHz Intel Atom N445>
<Display, 10.1-inches>
..
<Model, Panasonic Viera>
<Display, 42-Inch>
25Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
26. Extractors Manual Configuration
Manually configure features and extraction methods
1. Regular expressions
E.g. Processor - d*.?d+GHz
2. Dictionary search
E.g. Dictionary of brands (Samsung, Panasonic, Lenovo, Apple)
Pros
Extraction process can be fine-tuned according to the data
Good solution when no training (structured) data are available
Cons
Needs domain knowledge
Non-trivial to efficiently pick extraction methods manually
26Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
27. Extraction Experiments
Dataset for extraction 5,000 electronic
products from WDC
Training dataset (structured data)
20 electronics products Amazon dataset
27Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
28. Extraction Accuracy
Brand Model Storage Display Processor Dimension
iPod Nano .92 .98 .86 .49 .12 .78
Galaxy SII .72 .87 .89 .81 .40 .91
GalaxyTab 7.7 .80 .92 .89 .85 .72 .93
Ixus 120IS 1 .96 N/A .89 N/A .56
Vaio VPC .99 .65 .81 .77 .73 .32
Viera 42 .95 .72 N/A .82 N/A .64
Sandisk 1 1 .85 N/A N/A .31
Extraction using Combination configuration
(bag-of-words for Brand, Storage and Display;
feature-value pairs for Model and Dimension;
custom regular expression for the Processor)
28Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
29. Outline
1. HTML-embedded Data on the Web
2. The Data Integration Pipeline
1. Microdata extraction
2. Classification
3. Feature extraction
4. Identity resolution
5. Data Fusion
3. Conclusions
29Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
30. Identity Resolution
We used Silk a tool for discovering relationships
between data items within different linked data
sources
Provides a expressive language for defining linkage rules
Uses genetic programming to learn linkage rules
Has shown high performance on various datasets
https://www.assembla.com/spaces/silk/wiki/Home
30Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
31. Identity Resolution Experiments
Gold standard: 5,000 links manually annotated
2,500 positive/2,500 negative
20 electronics products Amazon dataset (reference set)
Experiment on 5 configurations
Baseline (no feature extraction step)
Bag-of-words
Feature-value pairs
Manual configuration
Combinations
31Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
34. Outline
1. HTML-embedded Data on the Web
2. The Data Integration Pipeline
1. Microdata extraction
2. Classification
3. Feature extraction
4. Identity resolution
5. Data Fusion
3. Conclusions
34Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
35. Data Fusion
Input: clusters of products after identity resolution
Properties worth fusing/combining
AggregateRating and Review
35Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
36. Fusion Results
Product Offers Reviews Ratings
iPod Nano 8GB 829 84 0
iPhone 4 16GB 624 35 52
Sony Ericsson Xperia Mini 450 31 12
iPad 16GB 423 40 48
Motorola XOOM 32GB 270 12 0
Samsun Galaxy SII 142 8 0
36Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
37. Conclusions
By using Microdata, thousands of websites help us to
understand their content
We have implemented the 5-step data integration pipeline
From Microdata markup to an integrated dataset
A newly introduced feature extraction step is crucial for the
precision of data integration
Identity resolution precision increases from 69% to 85%
Future work
Automatically learning regular expressions
Automatically discovering combinations of extractors
37Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer