The document introduces PageRank, an algorithm developed by Larry Page for ranking web pages. It discusses the motivation for developing PageRank to bring order to the unstructured web. It provides an overview of how PageRank models the random surfing behavior of users to determine the importance of pages based on both the number and quality of backlinks. It also summarizes some implementations and applications of PageRank, as well as limitations and areas for further refinement.
1 of 22
Download to read offline
More Related Content
The Pagerank
1. The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University Presented by Guoqiang Su & Wei Li
6. PageRank Pages with lots of backlinks are important Backlinks coming from important pages convey more importance to a page Problem: Rank Sink
7. Rank Sink Page cycles pointed by some incoming link Problem: this loop will accumulate rank but never distribute any rank outside
8. Escape Term Solution: Rank Source c is maximized and = 1 E(u) is some vector over the web pages uniform, favorite page etc.
9. Matrix Notation R is the dominant eigenvector and c is the dominant eigenvalue of because c is maximized
10. Computing PageRank - initialize vector over web pages loop: - new ranks sum of normalized backlink ranks - compute normalizing factor - add escape term - control parameter while - stop when converged
11. Random Surfer Model Page Rank corresponds to the probability distribution of a random walk on the web graphs E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever
12. Implementation Computing resources 24 million pages 75 million URLs Memory and disk storage Weight Vector (4 byte float) Matrix A (linear access)
13. Implementation (Con't) Unique integer ID for each URL Sort and Remove dangling links Rank initial assignment Iteration until convergence Add back dangling links and Re-compute
14. Convergence Properties Graph (V, E) is an expander with factor if for all (not too large) subsets S: |As| |s| Eigenvalue separation: Largest eigenvalue is sufficiently larger than the second-largest eigenvalue Random walk converges fast to a limiting probability distribution on a set of nodes in the graph.
16. Personalized PageRank Rank Source E can be initialized : uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives result in overly high ranking total weight on a single page, e.g . Netscape, McCarthy great variation of ranks under different single pages as rank source and everything in-between, e.g. server root pages allow manipulation by commercial interests
17. Applications I Estimate web traffic Server/page aliases Link/traffic disparity, e.g. porn sites, free web-mail Backlink predictor Citation counts have been used to predict future citations very difficult to map the citation structure of the web completely avoid the local maxima that citation counts get stuck in and get better performance
18. Applications II - Ranking Proxy Surfer's Navigation Aid Annotating links by PageRank (bar graph) Not query dependent
19. Issues Users are no random walkers Content based methods Starting point distribution Actual usage data as starting vector Reinforcing effects/bias towards main pages How about traffic to ranking pages? No query specific rank Linkage spam PageRank favors pages that managed to get other pages to link to them Linkage not necessarily a sign of relevancy, only of promotion (advertisement)
22. Conclusion PageRank is a global ranking based on the web's graph structure PageRank use backlinks information to bring order to the web PageRank can separate out representative pages as cluster center A great variety of applications