�ݺ�ߣ

Focused Crawling for Vertical Search


Marcelo Mendoza

11.11.11

- JCC 2011 - Curic´, Chile -
o 11.11.11 1 / 40


Overview

1 Vertical Search

2 Crawling

3 State-of-the-art

4 Conclusion

o 11.11.11 2 / 40

Focused Crawling for Vertical Search Vertical Search

Why Web Vertical Search Matters?

Web size: More than 20 billion pages.
Millions of users, millions of queries, millions of needs.
Advantages:
1 Greater precision due to limited scope
2 Leverage domain knowledge (ontologies)
Domains: business, medicine, science, education, ...

o 11.11.11 3 / 40


Science Vertical Search

scienceresearch.com
o 11.11.11 4 / 40


Business Vertical Search

biznar.com
o 11.11.11 5 / 40


Education Vertical Search

contentcompass.cl1
1
Fondef D08I1155
o 11.11.11 6 / 40

Focused Crawling for Vertical Search Crawling

Hyperlinks among web pages

o 11.11.11 7 / 40


The Web as a graph

web pages

hyperlinks

o 11.11.11 8 / 40


The Web: Some facts

The size of the Web: 11.5 billion of pages (indexable, 2005).
The deep Web: available by quering databases.
Static / dynamic pages.
Graph model: Free-scale network, degree distribution ≈ power law.
The Web structure: Bow-tie model (IN/SCC/OUT/ISLANDS).

o 11.11.11 9 / 40


Crawler architecture

Online resource: C. Castillo, Eﬀective Web Crawling (PhD Thesis) URL

o 11.11.11 10 / 40


Crawling strategies

Breadth-ﬁrst crawlers: URL frontier implemented as a FIFO queue.
Preferential crawlers: URL frontier implemented as a priority queue.
Priority scores:
1 Topological properties (e.g. indegree of the target page).
2 Content properties (e.g. similarity between a query and the source
page).
3 Hybrid measures.

o 11.11.11 11 / 40


Universal / Focused crawling

Universal crawlers: General purpose.
Challenges:
1 Scalability
2 Coverage / Freshness

Focused crawlers: We may want to crawl pages in certain topics.
Challenges:
1 Coverage / Accuracy

o 11.11.11 12 / 40


Focused Crawling
Breadth-ﬁrst: depth 1

Seed
Target

o 11.11.11 13 / 40


Focused Crawling

Seed
Target

o 11.11.11 14 / 40


Focused Crawling

Seed
Target

o 11.11.11 15 / 40


Focused Crawling
Breadth-ﬁrst: unreacheble pages, excessive computational costs!

Seed
Target

o 11.11.11 16 / 40

Focused Crawling for Vertical Search State-of-the-art

Early algorithms: Fish search

Bra, P., and Post, R. (1994)
Query (keywords), source page terms, term-based distance, best-ﬁrst
o 11.11.11 17 / 40


Early algorithms: Shark search

Hersovici et al. (1998)
Query (keywords), anchor text, term-based distance, best-ﬁrst
o 11.11.11 18 / 40


Early algorithms: ARACHNID

Menczer, F. (1997)
Multi-agents, evolutionary inspired: mutation (new seeds), ﬁtness (score
acc.), term-based scores.
o 11.11.11 19 / 40


Context: Link Analysis

The Web graph as an information source (beyond the text)

Kleinberg, J. (1998)
HITS: authoritative pages (OUT), hub pages (IN).

Brin, S. & Page, L. (1998)
PageRank: Random walk over the Web graph, stationary probability
vector.

o 11.11.11 20 / 40


Link-based algorithms

Cho, J., Garcia-Molina, H., Page L. (1998)
Link-based scores: Backlinks count, PageRank

Chakrabarti, S., Van den Berg, M., and Dom, B. (1999)
Topic distillation: Text-based classiﬁer over web page examples per
category (oﬀ-line dataset construction, human labeling, content text
positive and negative examples). On-line phase: Anchor-based score (ML)
+ HITS-based score for distillation.

o 11.11.11 21 / 40


Link-based algorithms: Basic assumptions

Seed
Target

Davidson, B. (2000)
Topical locality: Locality based on anchor text and links.
o 11.11.11 22 / 40


Link-based algorithms: Basic assumptions

Menczer, F. (2004)
Link cluster conjecture: Related pages tend to be linked.
o 11.11.11 23 / 40


Link-based algorithms: Backlink graph
Considering how far is the target: Layered backlink graph!

Diligenti et al. (2000)
Using the backlink graph for multiclass learning. Greedy approach.

Babaria et al. (2007)
Using the backling graph for ordinal regression. Greedy approach.
o 11.11.11 24 / 40


Oﬀ-line learning-based algorithms
Kinds of features
The content of the web pages which are known to link to the
candidate URL.
URL tokens from the candidate URL.

o 11.11.11 25 / 40



Rennie & McCallum (1999)
1st stage (Off-line): Text-based features (anchor + header + title of the
target). 2nd stage (On-line): Candidate URL scoring based on the text
classifier (candidate URL (anchor + URL text)).

Li et al. (2005)
1st stage (Off-line): ID3 learning strategy. Anchor text-based features.
2nd stage (On-line): Candidate URL scoring based on the text classifier
(candidate URL (anchor)).

o 11.11.11 26 / 40



Pant & Srinivasan (2006)
1st stage (Off-line): SVM learning strategy. Content text-based features.
2nd stage (On-line): Candidate URL scoring based on the text classifier
(candidate URL (surrounding text)).

Feng et al. (2010)
1st stage (Off-line): Term-based weights. Weighted graph construction.
2nd stage (Off-line): PageRank over the weighted graph. 3rd stage
(Off-line): Labeling based on PageRank. Term-based learning. 4th stage
(On-line): Candidate URL scoring based on the text classifier (candidate
URL (anchor)).

o 11.11.11 27 / 40


Machine Learning-based adaptive algorithms

Learning on-the-ﬂy from the context

o 11.11.11 28 / 40


"Ba
ch"

"Bach"
candidate URL

Aggarwal et al. (2000)
1st stage (Oﬀ-line): Crawling for dataset construction. Human labeling
(positive examples). Bayes learning strategy. Content text-based features.
2nd stage (On-line): Candidate URL scoring based on the text classiﬁer +
feature selection based on interest ratio (candidate URL (anchor)).
o 11.11.11 29 / 40



Chakrabarti et al. (2002)
(positive examples). Content text-based features. 2nd stage (On-line):
Training from positive examples using fetched pages (more sophisticated
features such as DOM tree). 3rd stage (On-line): URL scoring based on
the apprentice learner.
o 11.11.11 30 / 40


Learning to skip oﬀ-topic pages

Seed
Target

o 11.11.11 31 / 40


Learning to skip oﬀ-topic pages

111
000
111
000
111
000 Dud
111
000

Seed
0.8 0.7 0.25 0.1
Target 0.2

0.7 0.6
111
000 0.45
0.8 111
000
111
000 0.7
0.7 111
000
0.7 0.7
0.5 111
000 111
000
111
000 111
000
0.75 000
111 0.5 0.75
0.5 0.5
0.4 0.2 0.15
0.8
0.7
0.5

o 11.11.11 32 / 40


Learning to skip off-topic pages: Tunneling!

Bergmark et al. (2002)
(positive examples). Content text-based features. 2nd stage (Off-line):
Tunneling module construction. Cutoff threshold learning based on
nugget-dud paths. 3rd stage (On-line): Apprentice tunneling learner.
Adaptive cutoff based on paths evaluated by using fetched pages.

o 11.11.11 33 / 40



Agents for path detection: Ants

Gasparetti & Micarelli (2004)
Close in aim to ARACHNID (multi agents, multi seeds). Back and forth
trips to relevant resources generates pheromone trails. Shortest paths
attract more ants.

o 11.11.11 34 / 40


Ontology driven crawling strategies

Knowledge representation: Ontologies
sc : SubClassOf
dom : Domain
range : Range Camp Nou
i : InstanceOf
eq : Equivalent i
range city Barcelona
sp : SubPropertyOf
i dom
sc
sports stadiums
country coastal_city
sp sp
eq range dom i
football soccer plays_in Spain

sp

national i
teams Barcelona F.C.

o 11.11.11 35 / 40



Ontology-based match expansion

Ehrig & Maedge (2003)
Relevance scoring. 1st stage: Concept matching (ontology + lexicon). 2nd
stage: Ontology-based expansion. 3rd stage: Summarization.

o 11.11.11 36 / 40


Ontology-based learning strategy

Zheng et al. (2008)
Relevance scoring for fetched pages. 1st stage: Concept matching
(ontology + lexicon), Concept distances, Doc. scoring. 2nd stage: ANN
training. 3rd stage (On-line): term-based URL scoring (ANN, anchor as
input).
o 11.11.11 37 / 40


More features for unvisited URL scoring

Feng et al. (2010)
On-line PageRank + term scoring (anchor, surrounding)

Patel & Schmidt (2011)
Term scoring based on matching and document structure (structure of the
current page).

o 11.11.11 38 / 40

Focused Crawling for Vertical Search Conclusion

Challenges

Precision / Recall trade off
Benchmarking
Ontology IE for effective crawling
Unbiased seed identification
Efficiency issues (scalability,...)

o 11.11.11 39 / 40

Focused Crawling for Vertical Search Conclusion

References

References here

o 11.11.11 40 / 40

�ݺ�ߣ

Focused Crawling for Vertical Search

More Related Content

Focused Crawling for Vertical Search