Concept matching, ANN scoring.
Apprentice: Concept matching, ANN scoring, ANN training from positive
examples.
- JCC 2011 - Curic , Chile -
o 11.11.11 37 / 40
Focused Crawling for Vertical Search Conclusion
Summary
- Vertical search is important due to the Web size and user needs.
- Focused crawling is needed for vertical search.
- Link analysis and machine learning techniques have been applied.
- Adaptive and ontology-based approaches have been proposed.
- Open challenges: efficiency, coverage, freshness, dynamic content.
- Future work: distributed architectures, deep
1 of 40
Downloaded 63 times
More Related Content
Focused Crawling for Vertical Search
1. Focused Crawling for Vertical Search
Focused Crawling for Vertical Search
Marcelo Mendoza
11.11.11
- JCC 2011 - Curic卒, Chile -
o 11.11.11 1 / 40
3. Focused Crawling for Vertical Search Vertical Search
Why Web Vertical Search Matters?
Web size: More than 20 billion pages.
Millions of users, millions of queries, millions of needs.
Advantages:
1 Greater precision due to limited scope
2 Leverage domain knowledge (ontologies)
Domains: business, medicine, science, education, ...
- JCC 2011 - Curic卒, Chile -
o 11.11.11 3 / 40
7. Focused Crawling for Vertical Search Crawling
Hyperlinks among web pages
- JCC 2011 - Curic卒, Chile -
o 11.11.11 7 / 40
8. Focused Crawling for Vertical Search Crawling
The Web as a graph
web pages
hyperlinks
- JCC 2011 - Curic卒, Chile -
o 11.11.11 8 / 40
9. Focused Crawling for Vertical Search Crawling
The Web: Some facts
The size of the Web: 11.5 billion of pages (indexable, 2005).
The deep Web: available by quering databases.
Static / dynamic pages.
Graph model: Free-scale network, degree distribution power law.
The Web structure: Bow-tie model (IN/SCC/OUT/ISLANDS).
- JCC 2011 - Curic卒, Chile -
o 11.11.11 9 / 40
10. Focused Crawling for Vertical Search Crawling
Crawler architecture
Online resource: C. Castillo, E鍖ective Web Crawling (PhD Thesis) URL
- JCC 2011 - Curic卒, Chile -
o 11.11.11 10 / 40
11. Focused Crawling for Vertical Search Crawling
Crawling strategies
Breadth-鍖rst crawlers: URL frontier implemented as a FIFO queue.
Preferential crawlers: URL frontier implemented as a priority queue.
Priority scores:
1 Topological properties (e.g. indegree of the target page).
2 Content properties (e.g. similarity between a query and the source
page).
3 Hybrid measures.
- JCC 2011 - Curic卒, Chile -
o 11.11.11 11 / 40
12. Focused Crawling for Vertical Search Crawling
Universal / Focused crawling
Universal crawlers: General purpose.
Challenges:
1 Scalability
2 Coverage / Freshness
Focused crawlers: We may want to crawl pages in certain topics.
Challenges:
1 Coverage / Accuracy
- JCC 2011 - Curic卒, Chile -
o 11.11.11 12 / 40
17. Focused Crawling for Vertical Search State-of-the-art
Early algorithms: Fish search
Bra, P., and Post, R. (1994)
Query (keywords), source page terms, term-based distance, best-鍖rst
- JCC 2011 - Curic卒, Chile -
o 11.11.11 17 / 40
18. Focused Crawling for Vertical Search State-of-the-art
Early algorithms: Shark search
Hersovici et al. (1998)
Query (keywords), anchor text, term-based distance, best-鍖rst
- JCC 2011 - Curic卒, Chile -
o 11.11.11 18 / 40
19. Focused Crawling for Vertical Search State-of-the-art
Early algorithms: ARACHNID
Menczer, F. (1997)
Multi-agents, evolutionary inspired: mutation (new seeds), 鍖tness (score
acc.), term-based scores.
- JCC 2011 - Curic卒, Chile -
o 11.11.11 19 / 40
20. Focused Crawling for Vertical Search State-of-the-art
Context: Link Analysis
The Web graph as an information source (beyond the text)
Kleinberg, J. (1998)
HITS: authoritative pages (OUT), hub pages (IN).
Brin, S. & Page, L. (1998)
PageRank: Random walk over the Web graph, stationary probability
vector.
- JCC 2011 - Curic卒, Chile -
o 11.11.11 20 / 40
21. Focused Crawling for Vertical Search State-of-the-art
Link-based algorithms
Cho, J., Garcia-Molina, H., Page L. (1998)
Link-based scores: Backlinks count, PageRank
Chakrabarti, S., Van den Berg, M., and Dom, B. (1999)
Topic distillation: Text-based classi鍖er over web page examples per
category (o鍖-line dataset construction, human labeling, content text
positive and negative examples). On-line phase: Anchor-based score (ML)
+ HITS-based score for distillation.
- JCC 2011 - Curic卒, Chile -
o 11.11.11 21 / 40
22. Focused Crawling for Vertical Search State-of-the-art
Link-based algorithms: Basic assumptions
Seed
Target
Davidson, B. (2000)
Topical locality: Locality based on anchor text and links.
- JCC 2011 - Curic卒, Chile -
o 11.11.11 22 / 40
23. Focused Crawling for Vertical Search State-of-the-art
Link-based algorithms: Basic assumptions
Menczer, F. (2004)
Link cluster conjecture: Related pages tend to be linked.
- JCC 2011 - Curic卒, Chile -
o 11.11.11 23 / 40
24. Focused Crawling for Vertical Search State-of-the-art
Link-based algorithms: Backlink graph
Considering how far is the target: Layered backlink graph!
Diligenti et al. (2000)
Using the backlink graph for multiclass learning. Greedy approach.
Babaria et al. (2007)
Using the backling graph for ordinal regression. Greedy approach.
- JCC 2011 - Curic卒, Chile -
o 11.11.11 24 / 40
25. Focused Crawling for Vertical Search State-of-the-art
O鍖-line learning-based algorithms
Kinds of features
The content of the web pages which are known to link to the
candidate URL.
URL tokens from the candidate URL.
- JCC 2011 - Curic卒, Chile -
o 11.11.11 25 / 40
26. Focused Crawling for Vertical Search State-of-the-art
O鍖-line learning-based algorithms
Rennie & McCallum (1999)
1st stage (O鍖-line): Text-based features (anchor + header + title of the
target). 2nd stage (On-line): Candidate URL scoring based on the text
classi鍖er (candidate URL (anchor + URL text)).
Li et al. (2005)
1st stage (O鍖-line): ID3 learning strategy. Anchor text-based features.
2nd stage (On-line): Candidate URL scoring based on the text classi鍖er
(candidate URL (anchor)).
- JCC 2011 - Curic卒, Chile -
o 11.11.11 26 / 40
27. Focused Crawling for Vertical Search State-of-the-art
O鍖-line learning-based algorithms
Pant & Srinivasan (2006)
1st stage (O鍖-line): SVM learning strategy. Content text-based features.
2nd stage (On-line): Candidate URL scoring based on the text classi鍖er
(candidate URL (surrounding text)).
Feng et al. (2010)
1st stage (O鍖-line): Term-based weights. Weighted graph construction.
2nd stage (O鍖-line): PageRank over the weighted graph. 3rd stage
(O鍖-line): Labeling based on PageRank. Term-based learning. 4th stage
(On-line): Candidate URL scoring based on the text classi鍖er (candidate
URL (anchor)).
- JCC 2011 - Curic卒, Chile -
o 11.11.11 27 / 40
28. Focused Crawling for Vertical Search State-of-the-art
Machine Learning-based adaptive algorithms
Learning on-the-鍖y from the context
- JCC 2011 - Curic卒, Chile -
o 11.11.11 28 / 40
29. Focused Crawling for Vertical Search State-of-the-art
Machine Learning-based adaptive algorithms
Learning on-the-鍖y from the context
"Ba
ch"
"Bach"
candidate URL
Aggarwal et al. (2000)
1st stage (O鍖-line): Crawling for dataset construction. Human labeling
(positive examples). Bayes learning strategy. Content text-based features.
2nd stage (On-line): Candidate URL scoring based on the text classi鍖er +
feature selection based on interest ratio (candidate URL (anchor)).
- JCC 2011 - Curic卒, Chile -
o 11.11.11 29 / 40
30. Focused Crawling for Vertical Search State-of-the-art
Machine Learning-based adaptive algorithms
Learning on-the-鍖y from the context
Chakrabarti et al. (2002)
1st stage (O鍖-line): Crawling for dataset construction. Human labeling
(positive examples). Content text-based features. 2nd stage (On-line):
Training from positive examples using fetched pages (more sophisticated
features such as DOM tree). 3rd stage (On-line): URL scoring based on
the apprentice learner.
- JCC 2011 - Curic卒, Chile -
o 11.11.11 30 / 40
31. Focused Crawling for Vertical Search State-of-the-art
Machine Learning-based adaptive algorithms
Learning to skip o鍖-topic pages
Seed
Target
- JCC 2011 - Curic卒, Chile -
o 11.11.11 31 / 40
33. Focused Crawling for Vertical Search State-of-the-art
Machine Learning-based adaptive algorithms
Learning to skip o鍖-topic pages: Tunneling!
Bergmark et al. (2002)
1st stage (O鍖-line): Crawling for dataset construction. Human labeling
(positive examples). Content text-based features. 2nd stage (O鍖-line):
Tunneling module construction. Cuto鍖 threshold learning based on
nugget-dud paths. 3rd stage (On-line): Apprentice tunneling learner.
Adaptive cuto鍖 based on paths evaluated by using fetched pages.
- JCC 2011 - Curic卒, Chile -
o 11.11.11 33 / 40
34. Focused Crawling for Vertical Search State-of-the-art
Machine Learning-based adaptive algorithms
Agents for path detection: Ants
Gasparetti & Micarelli (2004)
Close in aim to ARACHNID (multi agents, multi seeds). Back and forth
trips to relevant resources generates pheromone trails. Shortest paths
attract more ants.
- JCC 2011 - Curic卒, Chile -
o 11.11.11 34 / 40
35. Focused Crawling for Vertical Search State-of-the-art
Ontology driven crawling strategies
Knowledge representation: Ontologies
sc : SubClassOf
dom : Domain
range : Range Camp Nou
i : InstanceOf
eq : Equivalent i
range city Barcelona
sp : SubPropertyOf
i dom
sc
sports stadiums
country coastal_city
sp sp
eq range dom i
football soccer plays_in Spain
sp
national i
teams Barcelona F.C.
- JCC 2011 - Curic卒, Chile -
o 11.11.11 35 / 40
37. Focused Crawling for Vertical Search State-of-the-art
Ontology driven crawling strategies
Ontology-based learning strategy
Zheng et al. (2008)
Relevance scoring for fetched pages. 1st stage: Concept matching
(ontology + lexicon), Concept distances, Doc. scoring. 2nd stage: ANN
training. 3rd stage (On-line): term-based URL scoring (ANN, anchor as
input).
- JCC 2011 - Curic卒, Chile -
o 11.11.11 37 / 40
38. Focused Crawling for Vertical Search State-of-the-art
More features for unvisited URL scoring
Feng et al. (2010)
On-line PageRank + term scoring (anchor, surrounding)
Patel & Schmidt (2011)
Term scoring based on matching and document structure (structure of the
current page).
- JCC 2011 - Curic卒, Chile -
o 11.11.11 38 / 40