際際滷

際際滷Share a Scribd company logo
Focused Crawling for Vertical Search




Focused Crawling for Vertical Search

                       Marcelo Mendoza


                              11.11.11




                        - JCC 2011 - Curic卒, Chile -
                                          o            11.11.11   1 / 40
Focused Crawling for Vertical Search


Overview



1   Vertical Search


2   Crawling


3   State-of-the-art


4   Conclusion




                                         - JCC 2011 - Curic卒, Chile -
                                                           o            11.11.11   2 / 40
Focused Crawling for Vertical Search   Vertical Search


Why Web Vertical Search Matters?




   Web size: More than 20 billion pages.
   Millions of users, millions of queries, millions of needs.
   Advantages:
     1   Greater precision due to limited scope
     2   Leverage domain knowledge (ontologies)
   Domains: business, medicine, science, education, ...




                                      - JCC 2011 - Curic卒, Chile -
                                                        o              11.11.11   3 / 40
Focused Crawling for Vertical Search   Vertical Search


Science Vertical Search




scienceresearch.com
                                      - JCC 2011 - Curic卒, Chile -
                                                        o              11.11.11   4 / 40
Focused Crawling for Vertical Search   Vertical Search


Business Vertical Search




biznar.com
                                     - JCC 2011 - Curic卒, Chile -
                                                       o              11.11.11   5 / 40
Focused Crawling for Vertical Search   Vertical Search


Education Vertical Search




contentcompass.cl1
  1
      Fondef D08I1155
                                         - JCC 2011 - Curic卒, Chile -
                                                           o              11.11.11   6 / 40
Focused Crawling for Vertical Search   Crawling


Hyperlinks among web pages




                                  - JCC 2011 - Curic卒, Chile -
                                                    o            11.11.11   7 / 40
Focused Crawling for Vertical Search   Crawling


The Web as a graph

           web pages

           hyperlinks




                                  - JCC 2011 - Curic卒, Chile -
                                                    o            11.11.11   8 / 40
Focused Crawling for Vertical Search   Crawling


The Web: Some facts

   The size of the Web: 11.5 billion of pages (indexable, 2005).
   The deep Web: available by quering databases.
   Static / dynamic pages.
   Graph model: Free-scale network, degree distribution  power law.
   The Web structure: Bow-tie model (IN/SCC/OUT/ISLANDS).




                                     - JCC 2011 - Curic卒, Chile -
                                                       o            11.11.11   9 / 40
Focused Crawling for Vertical Search   Crawling


Crawler architecture




Online resource: C. Castillo, E鍖ective Web Crawling (PhD Thesis) URL

                                      - JCC 2011 - Curic卒, Chile -
                                                        o            11.11.11   10 / 40
Focused Crawling for Vertical Search   Crawling


Crawling strategies




    Breadth-鍖rst crawlers: URL frontier implemented as a FIFO queue.
    Preferential crawlers: URL frontier implemented as a priority queue.
    Priority scores:
      1   Topological properties (e.g. indegree of the target page).
      2   Content properties (e.g. similarity between a query and the source
          page).
      3   Hybrid measures.




                                       - JCC 2011 - Curic卒, Chile -
                                                         o            11.11.11   11 / 40
Focused Crawling for Vertical Search   Crawling


Universal / Focused crawling



    Universal crawlers: General purpose.
    Challenges:
      1   Scalability
      2   Coverage / Freshness


    Focused crawlers: We may want to crawl pages in certain topics.
    Challenges:
      1   Coverage / Accuracy




                                       - JCC 2011 - Curic卒, Chile -
                                                         o            11.11.11   12 / 40
Focused Crawling for Vertical Search   Crawling


Focused Crawling
Breadth-鍖rst: depth 1



                 Seed
                 Target




                                      - JCC 2011 - Curic卒, Chile -
                                                        o            11.11.11   13 / 40
Focused Crawling for Vertical Search   Crawling


Focused Crawling
Breadth-鍖rst: depth 2



                 Seed
                 Target




                                      - JCC 2011 - Curic卒, Chile -
                                                        o            11.11.11   14 / 40
Focused Crawling for Vertical Search   Crawling


Focused Crawling
Breadth-鍖rst: depth 3



                 Seed
                 Target




                                      - JCC 2011 - Curic卒, Chile -
                                                        o            11.11.11   15 / 40
Focused Crawling for Vertical Search   Crawling


Focused Crawling
Breadth-鍖rst: unreacheble pages, excessive computational costs!



                 Seed
                 Target




                                      - JCC 2011 - Curic卒, Chile -
                                                        o            11.11.11   16 / 40
Focused Crawling for Vertical Search   State-of-the-art


Early algorithms: Fish search




Bra, P., and Post, R. (1994)
Query (keywords), source page terms, term-based distance, best-鍖rst
                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   17 / 40
Focused Crawling for Vertical Search   State-of-the-art


Early algorithms: Shark search




Hersovici et al. (1998)
Query (keywords), anchor text, term-based distance, best-鍖rst
                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   18 / 40
Focused Crawling for Vertical Search   State-of-the-art


Early algorithms: ARACHNID




Menczer, F. (1997)
Multi-agents, evolutionary inspired: mutation (new seeds), 鍖tness (score
acc.), term-based scores.
                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   19 / 40
Focused Crawling for Vertical Search   State-of-the-art


Context: Link Analysis



The Web graph as an information source (beyond the text)

Kleinberg, J. (1998)
HITS: authoritative pages (OUT), hub pages (IN).

Brin, S. & Page, L. (1998)
PageRank: Random walk over the Web graph, stationary probability
vector.




                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   20 / 40
Focused Crawling for Vertical Search   State-of-the-art


Link-based algorithms



Cho, J., Garcia-Molina, H., Page L. (1998)
Link-based scores: Backlinks count, PageRank

Chakrabarti, S., Van den Berg, M., and Dom, B. (1999)
Topic distillation: Text-based classi鍖er over web page examples per
category (o鍖-line dataset construction, human labeling, content text
positive and negative examples). On-line phase: Anchor-based score (ML)
+ HITS-based score for distillation.




                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   21 / 40
Focused Crawling for Vertical Search   State-of-the-art


Link-based algorithms: Basic assumptions


                        Seed
                        Target




Davidson, B. (2000)
Topical locality: Locality based on anchor text and links.
                                       - JCC 2011 - Curic卒, Chile -
                                                         o               11.11.11   22 / 40
Focused Crawling for Vertical Search   State-of-the-art


Link-based algorithms: Basic assumptions




Menczer, F. (2004)
Link cluster conjecture: Related pages tend to be linked.
                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   23 / 40
Focused Crawling for Vertical Search   State-of-the-art


Link-based algorithms: Backlink graph
Considering how far is the target: Layered backlink graph!




Diligenti et al. (2000)
Using the backlink graph for multiclass learning. Greedy approach.

Babaria et al. (2007)
Using the backling graph for ordinal regression. Greedy approach.
                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   24 / 40
Focused Crawling for Vertical Search   State-of-the-art


O鍖-line learning-based algorithms
Kinds of features
    The content of the web pages which are known to link to the
    candidate URL.
    URL tokens from the candidate URL.




                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   25 / 40
Focused Crawling for Vertical Search   State-of-the-art


O鍖-line learning-based algorithms



Rennie & McCallum (1999)
1st stage (O鍖-line): Text-based features (anchor + header + title of the
target). 2nd stage (On-line): Candidate URL scoring based on the text
classi鍖er (candidate URL (anchor + URL text)).

Li et al. (2005)
1st stage (O鍖-line): ID3 learning strategy. Anchor text-based features.
2nd stage (On-line): Candidate URL scoring based on the text classi鍖er
(candidate URL (anchor)).




                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   26 / 40
Focused Crawling for Vertical Search   State-of-the-art


O鍖-line learning-based algorithms


Pant & Srinivasan (2006)
1st stage (O鍖-line): SVM learning strategy. Content text-based features.
2nd stage (On-line): Candidate URL scoring based on the text classi鍖er
(candidate URL (surrounding text)).

Feng et al. (2010)
1st stage (O鍖-line): Term-based weights. Weighted graph construction.
2nd stage (O鍖-line): PageRank over the weighted graph. 3rd stage
(O鍖-line): Labeling based on PageRank. Term-based learning. 4th stage
(On-line): Candidate URL scoring based on the text classi鍖er (candidate
URL (anchor)).



                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   27 / 40
Focused Crawling for Vertical Search   State-of-the-art


Machine Learning-based adaptive algorithms

Learning on-the-鍖y from the context




                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   28 / 40
Focused Crawling for Vertical Search   State-of-the-art


Machine Learning-based adaptive algorithms
Learning on-the-鍖y from the context
             "Ba
                ch"



    "Bach"
                         candidate URL




Aggarwal et al. (2000)
1st stage (O鍖-line): Crawling for dataset construction. Human labeling
(positive examples). Bayes learning strategy. Content text-based features.
2nd stage (On-line): Candidate URL scoring based on the text classi鍖er +
feature selection based on interest ratio (candidate URL (anchor)).
                                         - JCC 2011 - Curic卒, Chile -
                                                           o               11.11.11   29 / 40
Focused Crawling for Vertical Search   State-of-the-art


Machine Learning-based adaptive algorithms
Learning on-the-鍖y from the context




Chakrabarti et al. (2002)
1st stage (O鍖-line): Crawling for dataset construction. Human labeling
(positive examples). Content text-based features. 2nd stage (On-line):
Training from positive examples using fetched pages (more sophisticated
features such as DOM tree). 3rd stage (On-line): URL scoring based on
the apprentice learner.
                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   30 / 40
Focused Crawling for Vertical Search   State-of-the-art


Machine Learning-based adaptive algorithms
Learning to skip o鍖-topic pages



                Seed
                Target




                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   31 / 40
Focused Crawling for Vertical Search              State-of-the-art


Machine Learning-based adaptive algorithms
Learning to skip o鍖-topic pages

                                                                                                111
                                                                                                000
                                                                                                111
                                                                                                000
                                                                                                111
                                                                                                000   Dud
                                                                                                111
                                                                                                000

                Seed
                                                         0.8     0.7 0.25 0.1
                Target                                                             0.2

                                                         0.7         0.6
                                                                 111
                                                                 000 0.45
                                        0.8                      111
                                                                 000
                                                                 111
                                                                 000                     0.7
                      0.7            111
                                     000
                                                         0.7               0.7
                              0.5    111
                                     000          111
                                                  000
                                     111
                                     000          111
                                                  000
                                             0.75 000
                                                  111          0.5          0.75
                        0.5                                                                     0.5
                               0.4    0.2         0.15
                                                                                   0.8
                                                                                          0.7
                               0.5




                                            - JCC 2011 - Curic卒, Chile -
                                                              o                                             11.11.11   32 / 40
Focused Crawling for Vertical Search   State-of-the-art


Machine Learning-based adaptive algorithms
Learning to skip o鍖-topic pages: Tunneling!




Bergmark et al. (2002)
1st stage (O鍖-line): Crawling for dataset construction. Human labeling
(positive examples). Content text-based features. 2nd stage (O鍖-line):
Tunneling module construction. Cuto鍖 threshold learning based on
nugget-dud paths. 3rd stage (On-line): Apprentice tunneling learner.
Adaptive cuto鍖 based on paths evaluated by using fetched pages.

                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   33 / 40
Focused Crawling for Vertical Search   State-of-the-art


Machine Learning-based adaptive algorithms

Agents for path detection: Ants




Gasparetti & Micarelli (2004)
Close in aim to ARACHNID (multi agents, multi seeds). Back and forth
trips to relevant resources generates pheromone trails. Shortest paths
attract more ants.

                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   34 / 40
Focused Crawling for Vertical Search            State-of-the-art


Ontology driven crawling strategies


Knowledge representation: Ontologies
      sc      :   SubClassOf
      dom     :   Domain
      range   :   Range                Camp Nou
      i       :   InstanceOf
      eq      :   Equivalent                                                                                i
                                               range                       city                                 Barcelona
      sp      :   SubPropertyOf
                                         i               dom
                                                                                                    sc
                             sports          stadiums
                                                                       country               coastal_city
                                  sp     sp
                      eq                                           range          dom                       i
       football              soccer                plays_in                                                       Spain

                                         sp

                                              national         i
                                               teams                        Barcelona F.C.




                                                   - JCC 2011 - Curic卒, Chile -
                                                                     o                                             11.11.11   35 / 40
Focused Crawling for Vertical Search   State-of-the-art


Ontology driven crawling strategies

Ontology-based match expansion




Ehrig & Maedge (2003)
Relevance scoring. 1st stage: Concept matching (ontology + lexicon). 2nd
stage: Ontology-based expansion. 3rd stage: Summarization.



                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   36 / 40
Focused Crawling for Vertical Search   State-of-the-art


Ontology driven crawling strategies
Ontology-based learning strategy




Zheng et al. (2008)
Relevance scoring for fetched pages. 1st stage: Concept matching
(ontology + lexicon), Concept distances, Doc. scoring. 2nd stage: ANN
training. 3rd stage (On-line): term-based URL scoring (ANN, anchor as
input).
                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   37 / 40
Focused Crawling for Vertical Search   State-of-the-art


More features for unvisited URL scoring




Feng et al. (2010)
On-line PageRank + term scoring (anchor, surrounding)

Patel & Schmidt (2011)
Term scoring based on matching and document structure (structure of the
current page).




                                      - JCC 2011 - Curic卒, Chile -
                                                        o               11.11.11   38 / 40
Focused Crawling for Vertical Search   Conclusion


Challenges




   Precision / Recall trade o鍖
   Benchmarking
   Ontology IE for e鍖ective crawling
   Unbiased seed identi鍖cation
   E鍖ciency issues (scalability,...)




                                     - JCC 2011 - Curic卒, Chile -
                                                       o            11.11.11   39 / 40
Focused Crawling for Vertical Search   Conclusion


References




References here




                                      - JCC 2011 - Curic卒, Chile -
                                                        o            11.11.11   40 / 40

More Related Content

Focused Crawling for Vertical Search

  • 1. Focused Crawling for Vertical Search Focused Crawling for Vertical Search Marcelo Mendoza 11.11.11 - JCC 2011 - Curic卒, Chile - o 11.11.11 1 / 40
  • 2. Focused Crawling for Vertical Search Overview 1 Vertical Search 2 Crawling 3 State-of-the-art 4 Conclusion - JCC 2011 - Curic卒, Chile - o 11.11.11 2 / 40
  • 3. Focused Crawling for Vertical Search Vertical Search Why Web Vertical Search Matters? Web size: More than 20 billion pages. Millions of users, millions of queries, millions of needs. Advantages: 1 Greater precision due to limited scope 2 Leverage domain knowledge (ontologies) Domains: business, medicine, science, education, ... - JCC 2011 - Curic卒, Chile - o 11.11.11 3 / 40
  • 4. Focused Crawling for Vertical Search Vertical Search Science Vertical Search scienceresearch.com - JCC 2011 - Curic卒, Chile - o 11.11.11 4 / 40
  • 5. Focused Crawling for Vertical Search Vertical Search Business Vertical Search biznar.com - JCC 2011 - Curic卒, Chile - o 11.11.11 5 / 40
  • 6. Focused Crawling for Vertical Search Vertical Search Education Vertical Search contentcompass.cl1 1 Fondef D08I1155 - JCC 2011 - Curic卒, Chile - o 11.11.11 6 / 40
  • 7. Focused Crawling for Vertical Search Crawling Hyperlinks among web pages - JCC 2011 - Curic卒, Chile - o 11.11.11 7 / 40
  • 8. Focused Crawling for Vertical Search Crawling The Web as a graph web pages hyperlinks - JCC 2011 - Curic卒, Chile - o 11.11.11 8 / 40
  • 9. Focused Crawling for Vertical Search Crawling The Web: Some facts The size of the Web: 11.5 billion of pages (indexable, 2005). The deep Web: available by quering databases. Static / dynamic pages. Graph model: Free-scale network, degree distribution power law. The Web structure: Bow-tie model (IN/SCC/OUT/ISLANDS). - JCC 2011 - Curic卒, Chile - o 11.11.11 9 / 40
  • 10. Focused Crawling for Vertical Search Crawling Crawler architecture Online resource: C. Castillo, E鍖ective Web Crawling (PhD Thesis) URL - JCC 2011 - Curic卒, Chile - o 11.11.11 10 / 40
  • 11. Focused Crawling for Vertical Search Crawling Crawling strategies Breadth-鍖rst crawlers: URL frontier implemented as a FIFO queue. Preferential crawlers: URL frontier implemented as a priority queue. Priority scores: 1 Topological properties (e.g. indegree of the target page). 2 Content properties (e.g. similarity between a query and the source page). 3 Hybrid measures. - JCC 2011 - Curic卒, Chile - o 11.11.11 11 / 40
  • 12. Focused Crawling for Vertical Search Crawling Universal / Focused crawling Universal crawlers: General purpose. Challenges: 1 Scalability 2 Coverage / Freshness Focused crawlers: We may want to crawl pages in certain topics. Challenges: 1 Coverage / Accuracy - JCC 2011 - Curic卒, Chile - o 11.11.11 12 / 40
  • 13. Focused Crawling for Vertical Search Crawling Focused Crawling Breadth-鍖rst: depth 1 Seed Target - JCC 2011 - Curic卒, Chile - o 11.11.11 13 / 40
  • 14. Focused Crawling for Vertical Search Crawling Focused Crawling Breadth-鍖rst: depth 2 Seed Target - JCC 2011 - Curic卒, Chile - o 11.11.11 14 / 40
  • 15. Focused Crawling for Vertical Search Crawling Focused Crawling Breadth-鍖rst: depth 3 Seed Target - JCC 2011 - Curic卒, Chile - o 11.11.11 15 / 40
  • 16. Focused Crawling for Vertical Search Crawling Focused Crawling Breadth-鍖rst: unreacheble pages, excessive computational costs! Seed Target - JCC 2011 - Curic卒, Chile - o 11.11.11 16 / 40
  • 17. Focused Crawling for Vertical Search State-of-the-art Early algorithms: Fish search Bra, P., and Post, R. (1994) Query (keywords), source page terms, term-based distance, best-鍖rst - JCC 2011 - Curic卒, Chile - o 11.11.11 17 / 40
  • 18. Focused Crawling for Vertical Search State-of-the-art Early algorithms: Shark search Hersovici et al. (1998) Query (keywords), anchor text, term-based distance, best-鍖rst - JCC 2011 - Curic卒, Chile - o 11.11.11 18 / 40
  • 19. Focused Crawling for Vertical Search State-of-the-art Early algorithms: ARACHNID Menczer, F. (1997) Multi-agents, evolutionary inspired: mutation (new seeds), 鍖tness (score acc.), term-based scores. - JCC 2011 - Curic卒, Chile - o 11.11.11 19 / 40
  • 20. Focused Crawling for Vertical Search State-of-the-art Context: Link Analysis The Web graph as an information source (beyond the text) Kleinberg, J. (1998) HITS: authoritative pages (OUT), hub pages (IN). Brin, S. & Page, L. (1998) PageRank: Random walk over the Web graph, stationary probability vector. - JCC 2011 - Curic卒, Chile - o 11.11.11 20 / 40
  • 21. Focused Crawling for Vertical Search State-of-the-art Link-based algorithms Cho, J., Garcia-Molina, H., Page L. (1998) Link-based scores: Backlinks count, PageRank Chakrabarti, S., Van den Berg, M., and Dom, B. (1999) Topic distillation: Text-based classi鍖er over web page examples per category (o鍖-line dataset construction, human labeling, content text positive and negative examples). On-line phase: Anchor-based score (ML) + HITS-based score for distillation. - JCC 2011 - Curic卒, Chile - o 11.11.11 21 / 40
  • 22. Focused Crawling for Vertical Search State-of-the-art Link-based algorithms: Basic assumptions Seed Target Davidson, B. (2000) Topical locality: Locality based on anchor text and links. - JCC 2011 - Curic卒, Chile - o 11.11.11 22 / 40
  • 23. Focused Crawling for Vertical Search State-of-the-art Link-based algorithms: Basic assumptions Menczer, F. (2004) Link cluster conjecture: Related pages tend to be linked. - JCC 2011 - Curic卒, Chile - o 11.11.11 23 / 40
  • 24. Focused Crawling for Vertical Search State-of-the-art Link-based algorithms: Backlink graph Considering how far is the target: Layered backlink graph! Diligenti et al. (2000) Using the backlink graph for multiclass learning. Greedy approach. Babaria et al. (2007) Using the backling graph for ordinal regression. Greedy approach. - JCC 2011 - Curic卒, Chile - o 11.11.11 24 / 40
  • 25. Focused Crawling for Vertical Search State-of-the-art O鍖-line learning-based algorithms Kinds of features The content of the web pages which are known to link to the candidate URL. URL tokens from the candidate URL. - JCC 2011 - Curic卒, Chile - o 11.11.11 25 / 40
  • 26. Focused Crawling for Vertical Search State-of-the-art O鍖-line learning-based algorithms Rennie & McCallum (1999) 1st stage (O鍖-line): Text-based features (anchor + header + title of the target). 2nd stage (On-line): Candidate URL scoring based on the text classi鍖er (candidate URL (anchor + URL text)). Li et al. (2005) 1st stage (O鍖-line): ID3 learning strategy. Anchor text-based features. 2nd stage (On-line): Candidate URL scoring based on the text classi鍖er (candidate URL (anchor)). - JCC 2011 - Curic卒, Chile - o 11.11.11 26 / 40
  • 27. Focused Crawling for Vertical Search State-of-the-art O鍖-line learning-based algorithms Pant & Srinivasan (2006) 1st stage (O鍖-line): SVM learning strategy. Content text-based features. 2nd stage (On-line): Candidate URL scoring based on the text classi鍖er (candidate URL (surrounding text)). Feng et al. (2010) 1st stage (O鍖-line): Term-based weights. Weighted graph construction. 2nd stage (O鍖-line): PageRank over the weighted graph. 3rd stage (O鍖-line): Labeling based on PageRank. Term-based learning. 4th stage (On-line): Candidate URL scoring based on the text classi鍖er (candidate URL (anchor)). - JCC 2011 - Curic卒, Chile - o 11.11.11 27 / 40
  • 28. Focused Crawling for Vertical Search State-of-the-art Machine Learning-based adaptive algorithms Learning on-the-鍖y from the context - JCC 2011 - Curic卒, Chile - o 11.11.11 28 / 40
  • 29. Focused Crawling for Vertical Search State-of-the-art Machine Learning-based adaptive algorithms Learning on-the-鍖y from the context "Ba ch" "Bach" candidate URL Aggarwal et al. (2000) 1st stage (O鍖-line): Crawling for dataset construction. Human labeling (positive examples). Bayes learning strategy. Content text-based features. 2nd stage (On-line): Candidate URL scoring based on the text classi鍖er + feature selection based on interest ratio (candidate URL (anchor)). - JCC 2011 - Curic卒, Chile - o 11.11.11 29 / 40
  • 30. Focused Crawling for Vertical Search State-of-the-art Machine Learning-based adaptive algorithms Learning on-the-鍖y from the context Chakrabarti et al. (2002) 1st stage (O鍖-line): Crawling for dataset construction. Human labeling (positive examples). Content text-based features. 2nd stage (On-line): Training from positive examples using fetched pages (more sophisticated features such as DOM tree). 3rd stage (On-line): URL scoring based on the apprentice learner. - JCC 2011 - Curic卒, Chile - o 11.11.11 30 / 40
  • 31. Focused Crawling for Vertical Search State-of-the-art Machine Learning-based adaptive algorithms Learning to skip o鍖-topic pages Seed Target - JCC 2011 - Curic卒, Chile - o 11.11.11 31 / 40
  • 32. Focused Crawling for Vertical Search State-of-the-art Machine Learning-based adaptive algorithms Learning to skip o鍖-topic pages 111 000 111 000 111 000 Dud 111 000 Seed 0.8 0.7 0.25 0.1 Target 0.2 0.7 0.6 111 000 0.45 0.8 111 000 111 000 0.7 0.7 111 000 0.7 0.7 0.5 111 000 111 000 111 000 111 000 0.75 000 111 0.5 0.75 0.5 0.5 0.4 0.2 0.15 0.8 0.7 0.5 - JCC 2011 - Curic卒, Chile - o 11.11.11 32 / 40
  • 33. Focused Crawling for Vertical Search State-of-the-art Machine Learning-based adaptive algorithms Learning to skip o鍖-topic pages: Tunneling! Bergmark et al. (2002) 1st stage (O鍖-line): Crawling for dataset construction. Human labeling (positive examples). Content text-based features. 2nd stage (O鍖-line): Tunneling module construction. Cuto鍖 threshold learning based on nugget-dud paths. 3rd stage (On-line): Apprentice tunneling learner. Adaptive cuto鍖 based on paths evaluated by using fetched pages. - JCC 2011 - Curic卒, Chile - o 11.11.11 33 / 40
  • 34. Focused Crawling for Vertical Search State-of-the-art Machine Learning-based adaptive algorithms Agents for path detection: Ants Gasparetti & Micarelli (2004) Close in aim to ARACHNID (multi agents, multi seeds). Back and forth trips to relevant resources generates pheromone trails. Shortest paths attract more ants. - JCC 2011 - Curic卒, Chile - o 11.11.11 34 / 40
  • 35. Focused Crawling for Vertical Search State-of-the-art Ontology driven crawling strategies Knowledge representation: Ontologies sc : SubClassOf dom : Domain range : Range Camp Nou i : InstanceOf eq : Equivalent i range city Barcelona sp : SubPropertyOf i dom sc sports stadiums country coastal_city sp sp eq range dom i football soccer plays_in Spain sp national i teams Barcelona F.C. - JCC 2011 - Curic卒, Chile - o 11.11.11 35 / 40
  • 36. Focused Crawling for Vertical Search State-of-the-art Ontology driven crawling strategies Ontology-based match expansion Ehrig & Maedge (2003) Relevance scoring. 1st stage: Concept matching (ontology + lexicon). 2nd stage: Ontology-based expansion. 3rd stage: Summarization. - JCC 2011 - Curic卒, Chile - o 11.11.11 36 / 40
  • 37. Focused Crawling for Vertical Search State-of-the-art Ontology driven crawling strategies Ontology-based learning strategy Zheng et al. (2008) Relevance scoring for fetched pages. 1st stage: Concept matching (ontology + lexicon), Concept distances, Doc. scoring. 2nd stage: ANN training. 3rd stage (On-line): term-based URL scoring (ANN, anchor as input). - JCC 2011 - Curic卒, Chile - o 11.11.11 37 / 40
  • 38. Focused Crawling for Vertical Search State-of-the-art More features for unvisited URL scoring Feng et al. (2010) On-line PageRank + term scoring (anchor, surrounding) Patel & Schmidt (2011) Term scoring based on matching and document structure (structure of the current page). - JCC 2011 - Curic卒, Chile - o 11.11.11 38 / 40
  • 39. Focused Crawling for Vertical Search Conclusion Challenges Precision / Recall trade o鍖 Benchmarking Ontology IE for e鍖ective crawling Unbiased seed identi鍖cation E鍖ciency issues (scalability,...) - JCC 2011 - Curic卒, Chile - o 11.11.11 39 / 40
  • 40. Focused Crawling for Vertical Search Conclusion References References here - JCC 2011 - Curic卒, Chile - o 11.11.11 40 / 40