際際滷

際際滷Share a Scribd company logo
Finding Nobel prize window by
         PageRank




           FUJITA Yuji, Turnstone Research Inst., Nihon Univ.
Graph and Network
¢   Graph theory
    C   Part of mathmatics
¢   Network science
    C   Inter-disciplinary study of
        ¢   Graph theory
        ¢   Physics
        ¢   Social science
        ¢   Informatics
        ¢   particular topics from finance, biology, ...
Graph theory
    Date back to 1730's
¢   Objectives
    C   Lower dimensional topological structure
    C   Combinatorial and topological studies
¢   Topics
    C   Four colour theorem
    C   Invariants


                                                  From Wikipedia
Network science
¢   Objectives
    C   Statistics and dynamics
    C   Social, Financial, Technological themes
¢   Topics
    C   6 degrees of separation
    C   Scale-free networks
    C   PageRank
                          Title:syms.eps
                          Creator:gnuplot 4.0 patchlevel 0
                          CreationDate:Sun Jan 13 23:04:28 2008
Bibliometrics
¢   Quantitative evaluation of (academic) documents
¢   Conventional approach: number of citation

¢   Citation network
    C   Node: paper Edge: citation
    C   directed graph
¢   More true metric: PageRank
Citation vs PageRank




Best cited do not have the best score
Top articles
Clinical     Effects of an angiotensin-converting-enzyme inhibitor,
Medicine     ramipril, on cardiovascular events in high-risk patients
Clinical     Vitamin E supplementation and cardiovascular events in high-
Medicine     risk patients
Immunology   Cytotoxic T lymphocyte-associated antigen 4 plays an
             essential role in the function of CD25(+)CD4(+) regulatory
             cells that control intestinal inflammation
Immunology   Immunologic self-tolerance maintained by CD25(+)CD4(+)
             regulatory T cells constitutively expressing cytotoxic T
             lymphocyte-associated antigen 4
Physics      String theory and noncommutative geometry
Physics      Large-N limit of non-commutative gauge theories
Molecular    Smac, a mitochondrial protein that promotes cytochrome c-
Biology &    dependent caspase activation by eliminating IAP inhibition
Genetics
Molecular    Identification of DIABLO, a mammalian protein that promotes
Biology &    apoptosis by binding to and antagonizing IAP proteins
Genetics
Molecular    Systematic variation in gene expression patterns in human
Graph expression
¢   Embedding: drawing on sphere/space
¢   Matrix
PageRank overview
¢   Link from a great node is more important
      ? degree as a score
¢   But how can it be done? - the process can be
    lost in a loop..




                        Figure from ^The PageRank Citation Ranking: Bringing Order to the Web ̄
Finite state Markov chain
¢   Node: status, Transition matrix: moving along
    the edge
    C   Row: linked (cited) vector
    C   Column: link (cite) vector
¢   Probability vector refreshed by multiplying the
    transition matrix
Steady state gives PageRank
¢   Some Markov chain has a unique steady state
¢   Steady state given by eigenvector
    C   A vector such that Mx = ax
¢   Eigenvector given by linear algebra
    C   Widely known how to compute
Why PageRank works?
¢   Not all citations are equally significant
¢   Less citation can be a signal of even more
    great work
    C   Fundamental work not cited directly
¢   Academic cascade
Meanings of citation
¢   Brainchild
¢   History
¢   Respect
¢   Identity


    something more than <a>tag</a>
To reach the top
¢   Many great children
    C   Each child give birth to many works


    = great scientific achievement
Limitations
¢   Prof. Yamanaka's work (CELL, 2006) has poor
    PageRank score, which is a shame to say at
    least.
¢   SPAM issues; not so serious as naiive citation
    count
To practice
¢   Get citation data
    C   Product or scrape
¢   Transition matrix
    C   Random surfer model
¢   Iterate matrix-vector product operation
    C   Sparse matrix operation
Data
¢   Tomson-Reuter, Elsevier, ´
¢   Scrape the web (arXive..)
¢   Common SQL server will hold the data
¢   NLP required
Transition matrix
¢   Not all transition matrix has unique
    eigenvector
¢   Random surfer model: let the graph be
    connected and get out of loop


                 +                   =
Adaptation to papers
¢   Old paper cannot cite newer one
    C   Non-uniform random surfing
¢   Adjust decay rate
Sparse matrix
¢   Most of the elements are Zeros
¢   Compressed form reduces space and time

¢   libcsparse
    C   made by UFL people and others, distributed under
        LGPL
Reference
L Page, S Brin, R Motwani, T      The PageRank citation ranking: bringing order to
Winograd                          the web.
Dylan Walker1,2 , Huafeng         Ranking Scientific Publications Using a Simple
Xie2,3 , Koon-Kiu Yan1,2 ,        Model of Network Traffic
Sergei Maslov2
P. Chen,1, ? H. Xie,2, 3, ? S.    Finding Scientific Gems with Google
Maslov,3, ? and S. Redner1, ′
Hajime BABA                       Google の蜘畜 - PageRank 莿彌瞠h
Acknowledgment
¢   Mr. Kazuhisa Takei for ruby interface of
    libcsparse in ffi
¢   Dr. Mari Jibu for citation data handling
¢   Dr. Wataru Souma for network scientific
    suggestions and comments
¢   Dr. Yoshi Fujiwara for choosing this topic and
    invitation
¢   Free software developers
About me
¢   2010- Turnstone Research, Inst.
¢   2011- Nihon Univ. researcher
¢   2009-2010 finance sector
¢   2007-2009 Network analysis at NiCT
¢   2001-2007 Venture firm CEO
¢   1994-2002 Discrete math graduate student

¢   Ski, climbing, bicycle, art

More Related Content

finding nobel prize window by PageRank

  • 1. Finding Nobel prize window by PageRank FUJITA Yuji, Turnstone Research Inst., Nihon Univ.
  • 2. Graph and Network ¢ Graph theory C Part of mathmatics ¢ Network science C Inter-disciplinary study of ¢ Graph theory ¢ Physics ¢ Social science ¢ Informatics ¢ particular topics from finance, biology, ...
  • 3. Graph theory Date back to 1730's ¢ Objectives C Lower dimensional topological structure C Combinatorial and topological studies ¢ Topics C Four colour theorem C Invariants From Wikipedia
  • 4. Network science ¢ Objectives C Statistics and dynamics C Social, Financial, Technological themes ¢ Topics C 6 degrees of separation C Scale-free networks C PageRank Title:syms.eps Creator:gnuplot 4.0 patchlevel 0 CreationDate:Sun Jan 13 23:04:28 2008
  • 5. Bibliometrics ¢ Quantitative evaluation of (academic) documents ¢ Conventional approach: number of citation ¢ Citation network C Node: paper Edge: citation C directed graph ¢ More true metric: PageRank
  • 6. Citation vs PageRank Best cited do not have the best score
  • 7. Top articles Clinical Effects of an angiotensin-converting-enzyme inhibitor, Medicine ramipril, on cardiovascular events in high-risk patients Clinical Vitamin E supplementation and cardiovascular events in high- Medicine risk patients Immunology Cytotoxic T lymphocyte-associated antigen 4 plays an essential role in the function of CD25(+)CD4(+) regulatory cells that control intestinal inflammation Immunology Immunologic self-tolerance maintained by CD25(+)CD4(+) regulatory T cells constitutively expressing cytotoxic T lymphocyte-associated antigen 4 Physics String theory and noncommutative geometry Physics Large-N limit of non-commutative gauge theories Molecular Smac, a mitochondrial protein that promotes cytochrome c- Biology & dependent caspase activation by eliminating IAP inhibition Genetics Molecular Identification of DIABLO, a mammalian protein that promotes Biology & apoptosis by binding to and antagonizing IAP proteins Genetics Molecular Systematic variation in gene expression patterns in human
  • 8. Graph expression ¢ Embedding: drawing on sphere/space ¢ Matrix
  • 9. PageRank overview ¢ Link from a great node is more important ? degree as a score ¢ But how can it be done? - the process can be lost in a loop.. Figure from ^The PageRank Citation Ranking: Bringing Order to the Web ̄
  • 10. Finite state Markov chain ¢ Node: status, Transition matrix: moving along the edge C Row: linked (cited) vector C Column: link (cite) vector ¢ Probability vector refreshed by multiplying the transition matrix
  • 11. Steady state gives PageRank ¢ Some Markov chain has a unique steady state ¢ Steady state given by eigenvector C A vector such that Mx = ax ¢ Eigenvector given by linear algebra C Widely known how to compute
  • 12. Why PageRank works? ¢ Not all citations are equally significant ¢ Less citation can be a signal of even more great work C Fundamental work not cited directly ¢ Academic cascade
  • 13. Meanings of citation ¢ Brainchild ¢ History ¢ Respect ¢ Identity something more than <a>tag</a>
  • 14. To reach the top ¢ Many great children C Each child give birth to many works = great scientific achievement
  • 15. Limitations ¢ Prof. Yamanaka's work (CELL, 2006) has poor PageRank score, which is a shame to say at least. ¢ SPAM issues; not so serious as naiive citation count
  • 16. To practice ¢ Get citation data C Product or scrape ¢ Transition matrix C Random surfer model ¢ Iterate matrix-vector product operation C Sparse matrix operation
  • 17. Data ¢ Tomson-Reuter, Elsevier, ´ ¢ Scrape the web (arXive..) ¢ Common SQL server will hold the data ¢ NLP required
  • 18. Transition matrix ¢ Not all transition matrix has unique eigenvector ¢ Random surfer model: let the graph be connected and get out of loop + =
  • 19. Adaptation to papers ¢ Old paper cannot cite newer one C Non-uniform random surfing ¢ Adjust decay rate
  • 20. Sparse matrix ¢ Most of the elements are Zeros ¢ Compressed form reduces space and time ¢ libcsparse C made by UFL people and others, distributed under LGPL
  • 21. Reference L Page, S Brin, R Motwani, T The PageRank citation ranking: bringing order to Winograd the web. Dylan Walker1,2 , Huafeng Ranking Scientific Publications Using a Simple Xie2,3 , Koon-Kiu Yan1,2 , Model of Network Traffic Sergei Maslov2 P. Chen,1, ? H. Xie,2, 3, ? S. Finding Scientific Gems with Google Maslov,3, ? and S. Redner1, ′ Hajime BABA Google の蜘畜 - PageRank 莿彌瞠h
  • 22. Acknowledgment ¢ Mr. Kazuhisa Takei for ruby interface of libcsparse in ffi ¢ Dr. Mari Jibu for citation data handling ¢ Dr. Wataru Souma for network scientific suggestions and comments ¢ Dr. Yoshi Fujiwara for choosing this topic and invitation ¢ Free software developers
  • 23. About me ¢ 2010- Turnstone Research, Inst. ¢ 2011- Nihon Univ. researcher ¢ 2009-2010 finance sector ¢ 2007-2009 Network analysis at NiCT ¢ 2001-2007 Venture firm CEO ¢ 1994-2002 Discrete math graduate student ¢ Ski, climbing, bicycle, art

Editor's Notes

  • #4: s雰議にはグラフ尖が枠佩しているが , ネットワ`ク親僥徭悶は芙氏ネットワ`ク蛍裂がコンピュ`タ鞠參念から芙氏僥宀によってg樹されてきた . 旋喘辛嬬な秤鵑I尖嬬薦の寄から , y議な返隈が吭龍を隔ったり , y薦僥が址喘されるようになったのは , インタ`ネットの噸式參週
  • #5: s雰議にはグラフ尖が枠佩しているが , ネットワ`ク親僥徭悶は芙氏ネットワ`ク蛍裂がコンピュ`タ鞠參念から芙氏僥宀によってg樹されてきた . 旋喘辛嬬な秤鵑I尖嬬薦の寄から , y議な返隈が吭龍を隔ったり , y薦僥が址喘されるようになったのは , インタ`ネットの噸式參週
  • #6: s雰議にはグラフ尖が枠佩しているが, ネットワ`ク親僥徭悶は芙氏ネットワ`ク蛍裂がコンピュ`タ鞠參念から芙氏僥宀によってg樹されてきた. 旋喘辛嬬な秤鵑I尖嬬薦の寄から, y議な返隈が吭龍を隔ったり, y薦僥が址喘されるようになったのは, インタ`ネットの噸式參週
  • #9: The Protein Data Bank Effects of an angiotensin-converting-enzyme inhibitor, ramipril, on cardiovascular events in high-risk patients The genome sequence of Drosophila melanogaster String theory and noncommutative geometry The complete atomic structure of the large ribosomal subunit at 2.4 angstrom resolution Smac, a mitochondrial protein that promotes cytochrome c-dependent caspase activation by eliminating IAP inhibition Identification of DIABLO, a mammalian protein that promotes apoptosis by binding to and antagonizing IAP proteins The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Class switch recombination and hypermutation require activation-induced cytidine deaminase (AID), a potential RNA editing enzyme Cytotoxic T lymphocyte-associated antigen 4 plays an essential role in the function of CD25(+)CD4(+) regulatory cells that control intestinal inflammationnil
  • #10: 販吭のグラフは 3 肝圷に托めzみ辛嬬 . N方 0 の爆中 ( 峠中 ) に托めzみ辛嬬なものや , そうでないものなど , ラ採僥議な燕Fを嚥えたものの麿に , 佩双として燕Fすることもできる . そして書指弊になるのは , こっちのほう .
  • #11: 枠羨つものがQまってないと書みてるノ`ドのスコアもQまらないけど , それがQまらないとY蕉枠羨つノ`ドのスコアもQまらないよ ? どうすりゃいいの ?