際際滷

際際滷Share a Scribd company logo
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University Presented by Guoqiang Su & Wei Li
Contents Motivation Related work Page Rank & Random Surfer Model Implementation Application Conclusion
Motivation Web: heterogeneous and unstructured Free of quality control on the web Commercial interest to manipulate ranking
Related Work Academic citation analysis Link-based analysis Clustering methods of link structure Hubs & Authorities Model
Backlink Link Structure of the Web Approximation of importance / quality
PageRank Pages with lots of backlinks are important Backlinks coming from important pages convey more importance to a page Problem: Rank Sink
Rank Sink Page cycles pointed by some incoming link Problem: this loop will accumulate rank but never distribute any rank outside
Escape Term Solution: Rank Source c is maximized and  = 1 E(u) is some vector over the web pages  uniform, favorite page etc.
Matrix Notation R is the dominant eigenvector and c is the dominant eigenvalue of  because c is maximized
Computing PageRank - initialize vector over web pages loop: - new ranks sum of normalized backlink ranks      - compute normalizing factor   - add escape term   - control parameter while  - stop when converged
Random Surfer Model Page Rank corresponds to the probability distribution of a random walk on the web graphs E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever
Implementation Computing resources   24 million pages   75 million URLs Memory and disk storage Weight Vector  (4 byte float)   Matrix  A  (linear access)
Implementation (Con't) Unique integer ID for each URL Sort and Remove dangling links Rank initial assignment Iteration until convergence Add back dangling links and Re-compute
Convergence Properties Graph (V, E) is an  expander  with factor    if for all (not too large) subsets S: |As|     |s| Eigenvalue separation:  Largest eigenvalue is sufficiently larger than the second-largest eigenvalue Random walk  converges fast to a limiting probability distribution on a set of nodes in the graph.
Convergence Properties (con't) PageRank computation is O(log(|V|)) due to rapidly  mixing graph  G of the web.
Personalized PageRank Rank Source E can be initialized :   uniformly over all pages:  e.g. copyright  warnings, disclaimers, mailing lists archives    result in overly high ranking   total weight on a single page,  e.g .  Netscape, McCarthy    great variation of ranks under different single pages  as rank source   and everything in-between,  e.g. server root pages    allow manipulation by commercial interests
Applications I Estimate web traffic   Server/page aliases   Link/traffic disparity, e.g. porn sites, free web-mail Backlink predictor   Citation counts have been used to predict future citations    very difficult to map the citation structure of the web completely   avoid the local maxima that citation counts get stuck in and get better performance
Applications II - Ranking Proxy Surfer's Navigation Aid Annotating links by PageRank (bar graph) Not query dependent
Issues Users are no random walkers   Content based methods Starting point distribution     Actual usage data as starting vector Reinforcing effects/bias towards main pages How about traffic to ranking pages? No query specific rank Linkage spam   PageRank favors pages that managed to get other pages to link to  them   Linkage not necessarily a sign of relevancy, only of promotion  (advertisement)
Evaluation I
Evaluation II
Conclusion PageRank is a global ranking based on the web's graph structure PageRank use backlinks information to bring order to the web PageRank can separate out representative pages as cluster center A great variety of applications

More Related Content

The Pagerank

  • 1. The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University Presented by Guoqiang Su & Wei Li
  • 2. Contents Motivation Related work Page Rank & Random Surfer Model Implementation Application Conclusion
  • 3. Motivation Web: heterogeneous and unstructured Free of quality control on the web Commercial interest to manipulate ranking
  • 4. Related Work Academic citation analysis Link-based analysis Clustering methods of link structure Hubs & Authorities Model
  • 5. Backlink Link Structure of the Web Approximation of importance / quality
  • 6. PageRank Pages with lots of backlinks are important Backlinks coming from important pages convey more importance to a page Problem: Rank Sink
  • 7. Rank Sink Page cycles pointed by some incoming link Problem: this loop will accumulate rank but never distribute any rank outside
  • 8. Escape Term Solution: Rank Source c is maximized and = 1 E(u) is some vector over the web pages uniform, favorite page etc.
  • 9. Matrix Notation R is the dominant eigenvector and c is the dominant eigenvalue of because c is maximized
  • 10. Computing PageRank - initialize vector over web pages loop: - new ranks sum of normalized backlink ranks - compute normalizing factor - add escape term - control parameter while - stop when converged
  • 11. Random Surfer Model Page Rank corresponds to the probability distribution of a random walk on the web graphs E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever
  • 12. Implementation Computing resources 24 million pages 75 million URLs Memory and disk storage Weight Vector (4 byte float) Matrix A (linear access)
  • 13. Implementation (Con't) Unique integer ID for each URL Sort and Remove dangling links Rank initial assignment Iteration until convergence Add back dangling links and Re-compute
  • 14. Convergence Properties Graph (V, E) is an expander with factor if for all (not too large) subsets S: |As| |s| Eigenvalue separation: Largest eigenvalue is sufficiently larger than the second-largest eigenvalue Random walk converges fast to a limiting probability distribution on a set of nodes in the graph.
  • 15. Convergence Properties (con't) PageRank computation is O(log(|V|)) due to rapidly mixing graph G of the web.
  • 16. Personalized PageRank Rank Source E can be initialized : uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives result in overly high ranking total weight on a single page, e.g . Netscape, McCarthy great variation of ranks under different single pages as rank source and everything in-between, e.g. server root pages allow manipulation by commercial interests
  • 17. Applications I Estimate web traffic Server/page aliases Link/traffic disparity, e.g. porn sites, free web-mail Backlink predictor Citation counts have been used to predict future citations very difficult to map the citation structure of the web completely avoid the local maxima that citation counts get stuck in and get better performance
  • 18. Applications II - Ranking Proxy Surfer's Navigation Aid Annotating links by PageRank (bar graph) Not query dependent
  • 19. Issues Users are no random walkers Content based methods Starting point distribution Actual usage data as starting vector Reinforcing effects/bias towards main pages How about traffic to ranking pages? No query specific rank Linkage spam PageRank favors pages that managed to get other pages to link to them Linkage not necessarily a sign of relevancy, only of promotion (advertisement)
  • 22. Conclusion PageRank is a global ranking based on the web's graph structure PageRank use backlinks information to bring order to the web PageRank can separate out representative pages as cluster center A great variety of applications