This document discusses graph theory, network science, and bibliometrics. It provides an overview of PageRank, describing how it uses a finite state Markov chain and the steady state of the transition matrix to assign scores based on the structure of citation networks. PageRank aims to improve on simple citation counts by considering both the number and quality of citations. The document notes some limitations of PageRank and describes practical considerations for applying it, such as using sparse matrix operations and accounting for the directed nature of citations.
1 of 23
Download to read offline
More Related Content
finding nobel prize window by PageRank
1. Finding Nobel prize window by
PageRank
FUJITA Yuji, Turnstone Research Inst., Nihon Univ.
2. Graph and Network
¢ Graph theory
C Part of mathmatics
¢ Network science
C Inter-disciplinary study of
¢ Graph theory
¢ Physics
¢ Social science
¢ Informatics
¢ particular topics from finance, biology, ...
3. Graph theory
Date back to 1730's
¢ Objectives
C Lower dimensional topological structure
C Combinatorial and topological studies
¢ Topics
C Four colour theorem
C Invariants
From Wikipedia
4. Network science
¢ Objectives
C Statistics and dynamics
C Social, Financial, Technological themes
¢ Topics
C 6 degrees of separation
C Scale-free networks
C PageRank
Title:syms.eps
Creator:gnuplot 4.0 patchlevel 0
CreationDate:Sun Jan 13 23:04:28 2008
5. Bibliometrics
¢ Quantitative evaluation of (academic) documents
¢ Conventional approach: number of citation
¢ Citation network
C Node: paper Edge: citation
C directed graph
¢ More true metric: PageRank
7. Top articles
Clinical Effects of an angiotensin-converting-enzyme inhibitor,
Medicine ramipril, on cardiovascular events in high-risk patients
Clinical Vitamin E supplementation and cardiovascular events in high-
Medicine risk patients
Immunology Cytotoxic T lymphocyte-associated antigen 4 plays an
essential role in the function of CD25(+)CD4(+) regulatory
cells that control intestinal inflammation
Immunology Immunologic self-tolerance maintained by CD25(+)CD4(+)
regulatory T cells constitutively expressing cytotoxic T
lymphocyte-associated antigen 4
Physics String theory and noncommutative geometry
Physics Large-N limit of non-commutative gauge theories
Molecular Smac, a mitochondrial protein that promotes cytochrome c-
Biology & dependent caspase activation by eliminating IAP inhibition
Genetics
Molecular Identification of DIABLO, a mammalian protein that promotes
Biology & apoptosis by binding to and antagonizing IAP proteins
Genetics
Molecular Systematic variation in gene expression patterns in human
9. PageRank overview
¢ Link from a great node is more important
? degree as a score
¢ But how can it be done? - the process can be
lost in a loop..
Figure from ^The PageRank Citation Ranking: Bringing Order to the Web ̄
10. Finite state Markov chain
¢ Node: status, Transition matrix: moving along
the edge
C Row: linked (cited) vector
C Column: link (cite) vector
¢ Probability vector refreshed by multiplying the
transition matrix
11. Steady state gives PageRank
¢ Some Markov chain has a unique steady state
¢ Steady state given by eigenvector
C A vector such that Mx = ax
¢ Eigenvector given by linear algebra
C Widely known how to compute
12. Why PageRank works?
¢ Not all citations are equally significant
¢ Less citation can be a signal of even more
great work
C Fundamental work not cited directly
¢ Academic cascade
13. Meanings of citation
¢ Brainchild
¢ History
¢ Respect
¢ Identity
something more than <a>tag</a>
14. To reach the top
¢ Many great children
C Each child give birth to many works
= great scientific achievement
15. Limitations
¢ Prof. Yamanaka's work (CELL, 2006) has poor
PageRank score, which is a shame to say at
least.
¢ SPAM issues; not so serious as naiive citation
count
16. To practice
¢ Get citation data
C Product or scrape
¢ Transition matrix
C Random surfer model
¢ Iterate matrix-vector product operation
C Sparse matrix operation
17. Data
¢ Tomson-Reuter, Elsevier, ´
¢ Scrape the web (arXive..)
¢ Common SQL server will hold the data
¢ NLP required
18. Transition matrix
¢ Not all transition matrix has unique
eigenvector
¢ Random surfer model: let the graph be
connected and get out of loop
+ =
19. Adaptation to papers
¢ Old paper cannot cite newer one
C Non-uniform random surfing
¢ Adjust decay rate
20. Sparse matrix
¢ Most of the elements are Zeros
¢ Compressed form reduces space and time
¢ libcsparse
C made by UFL people and others, distributed under
LGPL
21. Reference
L Page, S Brin, R Motwani, T The PageRank citation ranking: bringing order to
Winograd the web.
Dylan Walker1,2 , Huafeng Ranking Scientific Publications Using a Simple
Xie2,3 , Koon-Kiu Yan1,2 , Model of Network Traffic
Sergei Maslov2
P. Chen,1, ? H. Xie,2, 3, ? S. Finding Scientific Gems with Google
Maslov,3, ? and S. Redner1, ′
Hajime BABA Google の蜘畜 - PageRank 莿彌瞠h
22. Acknowledgment
¢ Mr. Kazuhisa Takei for ruby interface of
libcsparse in ffi
¢ Dr. Mari Jibu for citation data handling
¢ Dr. Wataru Souma for network scientific
suggestions and comments
¢ Dr. Yoshi Fujiwara for choosing this topic and
invitation
¢ Free software developers
23. About me
¢ 2010- Turnstone Research, Inst.
¢ 2011- Nihon Univ. researcher
¢ 2009-2010 finance sector
¢ 2007-2009 Network analysis at NiCT
¢ 2001-2007 Venture firm CEO
¢ 1994-2002 Discrete math graduate student
¢ Ski, climbing, bicycle, art
#9: The Protein Data Bank Effects of an angiotensin-converting-enzyme inhibitor, ramipril, on cardiovascular events in high-risk patients The genome sequence of Drosophila melanogaster String theory and noncommutative geometry The complete atomic structure of the large ribosomal subunit at 2.4 angstrom resolution Smac, a mitochondrial protein that promotes cytochrome c-dependent caspase activation by eliminating IAP inhibition Identification of DIABLO, a mammalian protein that promotes apoptosis by binding to and antagonizing IAP proteins The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Class switch recombination and hypermutation require activation-induced cytidine deaminase (AID), a potential RNA editing enzyme Cytotoxic T lymphocyte-associated antigen 4 plays an essential role in the function of CD25(+)CD4(+) regulatory cells that control intestinal inflammationnil