ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Course Overview:
An Introduction to Information
 Retrieval and Applications


           J. H. Wang
          Feb. 22, 2012
Instructor & TA
? Instructor
    ¨C   J. H. Wang ( ÍõÕýºÀ )
    ¨C   Assistant Professor, CSIE, NTUT
    ¨C   Office: R1534, Technology Building
    ¨C   E-mail: jhwang@csie.ntut.edu.tw
    ¨C   Tel: ext. 4238
    ¨C   Office Hour: 9:00-12:00 am, every Tuesday and
        Wednesday
? TA
    ¨C Mr. Liu ( „¢å«Ö® )
    ¨C R1424, Technology Building
IR, Spring 2012      NTUT CSIE                2
Course Description
? Course Web Page
    ¨C http://www.ntut.edu.tw/~jhwang/IR/
? Time: 9:10-12:00am, Thu.
? Classroom: R1322, Technology Building
? Textbook:
    ¨C Christopher D. Manning, Prabhakar Raghavan and Hinrich
      Schuetze, Introduction to Information Retrieval, Cambridge
      University Press, 2008.
        ? Available online
        ? International Student Edition, imported by Kai-Fa ( é_°l )
          Publishing
? Prerequisites:
    ¨C Basic knowledge of data structures and algorithms, linear
      algebra, and probability theory
    ¨C Programming experience is *required* for homeworks &
      projects
IR, Spring 2012         NTUT CSIE                        3
Additional References
? References:
    ¨C Ricardo Baeza-Yates and Berthier Ribeiro-Neto,
      Modern Information Retrieval: The Concepts and
      Technology behind Search, Addison-Wesley, 2011.
        ? This is the second edition of their book Modern Information
          Retrieval in 1999. ( ÈAͨ )
    ¨C Stefan Buettcher, Charles L.A. Clarke, and Gordon V.
      Cormack, Information Retrieval: Implementing and
      Evaluating Search Engines, MIT Press, 2010.
    ¨C Bruce Croft, Donald Metzler, and Trevor Strohman,
      Search Engines: Information Retrieval in Practice,
      Addison-Wesley, 2010. ( È«ÈA )
IR, Spring 2012        NTUT CSIE                     4
More Books on IR
? Gerald Salton, Automatic information organization and
  retrieval, McGraw-Hill, 1968.
? Gerald Salton and M.J. McGill, Introduction to modern
  information retrieval, McGraw-Hill, 1983.
    ¨C Two classics, but out-of-print.
? C. J. van Rijsbergen, Information Retrieval, Butterworths,
  1979.
    ¨C The classic. More than 40 years old, but still worth reading.
? K. Sparck Jones, P. Willett, Readings in Information
  Retrieval, Morgan Kaufmann, 1997.
    ¨C A collection of classical IR papers. (out of print)
? I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann,
  Managing Gigabytes, 2nd edition, 1999.
    ¨C The authority on index construction and compression.
IR, Spring 2012           NTUT CSIE                         5
Grading Policy
? Homework assignments and
  programming exercises: 40%
? Mid-term exam: 25%
? Term project: 35%
    ¨C Including the proposal and final report




IR, Spring 2012    NTUT CSIE            6
Programming Exercises and Term
            Project
? About 3 programming exercises
    ¨C Team-based (at most 2 persons per team)
    ¨C You can either write your own code or reuse existing
      open source code
? The term project
    ¨C Either team-based system development (the same as
      programming exercises)
    ¨C Or academic paper presentation
        ? Only one person per team allowed
    ¨C A proposal is required before midterm (Apr. 12, 2012)

IR, Spring 2012       NTUT CSIE              7
About the Term Project
? The score you get depends on the difficulty and
  quality of your project
    ¨C For system development:
        ? System functions and correctness
    ¨C For academic paper presentation
        ? Quality and your presentation of the paper
        ? Major methods/experimental results *must* be presented
        ? Papers from top conferences are strongly suggested
            ¨C E.g. SIGIR, WWW, CIKM, WSDM, JCDL, ICMR, ¡­
        ? Proposals are *required* for each team, and will counted in
          the score

IR, Spring 2012        NTUT CSIE                     8
Online Submission
? Submission instructions
    ¨C Programs, project proposals, and project
      reports in electronic files must be submitted to
      the TA online at:
        ? http://140.124.183.39/ir/
    ¨C Before submission:
        ? User name: Your student ID
        ? Please change your default password at your first
          login

IR, Spring 2012      NTUT CSIE                9
What this Course is NOT about
? This course will NOT tell you
    ¨C The tips and tricks of using search engines,
      although power users might have better ideas on how
      to improve them
        ? There¡¯re plenty of books and websites on that¡­
    ¨C How to find books in libraries,
      although it¡¯s somewhat related to the basic IR
      concepts
    ¨C How to make money on the Web,
      although the currently largest search engine did it


IR, Spring 2012        NTUT CSIE                    10
What¡¯s Information Retrieval




IR, Spring 2012   NTUT CSIE   11
On Wikipedia




IR, Spring 2012    NTUT CSIE     12
On Google Images




IR, Spring 2012      NTUT CSIE   13
On Google Video Search




IR, Spring 2012   NTUT CSIE   14
On Google News (TW)




IR, Spring 2012   NTUT CSIE   15
On Google News (US)




IR, Spring 2012   NTUT CSIE   16
On Blogs




IR, Spring 2012   NTUT CSIE   17
On Google Translate¡­




IR, Spring 2012   NTUT CSIE   18
Or More Related Keywords
? NBA
? New York Knicks
? Linsanity
?¡­




IR, Spring 2012   NTUT CSIE   19
What if We Search in Chinese




IR, Spring 2012   NTUT CSIE   20
And More¡­
? ¼~¼sÄá¿Ë
? ¹þ·ð
? ̨ÒáÇò†T
?¡­
? And other languages¡­
? And other search engines¡­
? And social websites¡­

IR, Spring 2012   NTUT CSIE   21
In Google Trends




IR, Spring 2012     NTUT CSIE    22
And More¡­




IR, Spring 2012   NTUT CSIE   23
And Other Keywords¡­




IR, Spring 2012   NTUT CSIE   24
And Other Keywords¡­




IR, Spring 2012   NTUT CSIE   25
Palanteer ¨C TW Election




IR, Spring 2012   NTUT CSIE   26
IR, Spring 2012   NTUT CSIE   27
IR, Spring 2012   NTUT CSIE   28
What Is Information Retrieval?
? ¡°Information retrieval is a field concerned
  with the structure, analysis, organization,
  storage, searching, and retrieval of
  information.¡± (Salton, 1968)




IR, Spring 2012   NTUT CSIE          29
Goal
? Information retrieval (IR): a research field
  that targets at effectively and efficiently
  searching information in text and
  multimedia documents
? In this course, we will introduce the basic
  text and query models in IR, retrieval
  evaluation, indexing and searching, and
  applications for IR

IR, Spring 2012   NTUT CSIE        30
A Big Picture




IR, Spring 2012    NTUT CSIE      31
User
                                      Interface
                 user need                                         Text

                                  Text Operations

                       logical view                          Doc representation
                   Query
                                                       Indexing
                                                        Indexing
 user feedback    Expansion

            query                            inverted file

                  Retrieval
                  Retrieval                              Inverte
                                                         d Index
     retrieved docs                                                          Document
                                                                             Collection
                  Ranking
                  Ranking
 ranked docs
IR, Spring 2012               NTUT CSIE                              32
Topics
? Text IR
    ¨C Indexing and searching
    ¨C Query languages and operations
? Retrieval evaluation
? Modeling
    ¨C Boolean model
    ¨C Vector space model
    ¨C Probabilistic model
? Applications for IR
    ¨C Multimedia IR
    ¨C Web search
    ¨C Digital libraries

IR, Spring 2012           NTUT CSIE    33
Organization of the Textbook
? Basics in IR (focus)
    ¨C Inverted indexes for boolean queries (Ch.1-5)
    ¨C Term weighting and vector space model (Ch. 6-7)
    ¨C Evaluation in IR (Ch. 8)
? Advanced Topics
    ¨C   Relevance feedback (Ch. 9)
    ¨C   XML retrieval (Ch. 10)
    ¨C   Probabilistic IR (Ch. 11)
    ¨C   Language models (Ch. 12)
? Machine learning in IR (useful)
    ¨C Text classification (Ch. 13-15)
    ¨C Document clustering (Ch. 16-18)
? Web Search
    ¨C Web crawling and indexes (Ch. 19-20)
    ¨C Link analysis (Ch. 21)


IR, Spring 2012           NTUT CSIE                     34
Pointers to Other Topics
?   Cross-language IR
?   Image, video, and multimedia IR
?   Speech retrieval
?   Music retrieval
?   User interfaces
?   Parallel, distributed, and P2P IR
?   Digital libraries
?   Information science perspective
?   Logic-based approaches to IR
?   Natural language processing techniques

IR, Spring 2012   NTUT CSIE           35
Tentative Schedule
? Before midterm
    ¨C   Boolean retrieval (1 wk)
    ¨C   Indexing (2 wks)
    ¨C   Vector space model and evaluation (2 wk)
    ¨C   Relevance feedback (1 wk)
    ¨C   Probabilistic IR (2 wk)
? After midterm
    ¨C   Text classification (1-2 wk)
    ¨C   Document clustering (1-2 wk)
    ¨C   Web search (2 wks)
    ¨C   Advanced topics: CLIR, IE, ¡­ (2 wks)
    ¨C   Term Project Presentation (3 wks)
IR, Spring 2012      NTUT CSIE                 36
Generic Resources
? Wikipedia page on Information Retrieval:
  http://en.wikipedia.org/wiki/Information_re
? Information Retrieval Resources:
  http://www-
  csli.stanford.edu/~hinrich/information-
  retrieval.html
?


IR, Spring 2012      NTUT CSIE    37
Academic Resources
? Journals
    ¨C   ACM TOIS: Transactions on Information Systems
    ¨C   JASIST: Journal of the American Society of Information Sciences
    ¨C   IP&M: Information Processing and Management
    ¨C   IEEE TKDE: Transactions on Knowledge and Data Engineering
? Conferences
    ¨C ACM SIGIR: International Conference on Information Retrieval
    ¨C WWW: World Wide Web Conference
    ¨C ACM CIKM: Conference on Information Knowledge and
      Management
    ¨C JCDL: ACM/IEEE Joint Conference on Digital Libraries
    ¨C ACM WSDM: International Conference on Web Search and Data
      Mining
    ¨C TREC: Text Retrieval Conference

IR, Spring 2012         NTUT CSIE                     38
Thanks for Your Attention!




IR, Spring 2012   NTUT CSIE   39

More Related Content

00 intro

  • 1. Course Overview: An Introduction to Information Retrieval and Applications J. H. Wang Feb. 22, 2012
  • 2. Instructor & TA ? Instructor ¨C J. H. Wang ( ÍõÕýºÀ ) ¨C Assistant Professor, CSIE, NTUT ¨C Office: R1534, Technology Building ¨C E-mail: jhwang@csie.ntut.edu.tw ¨C Tel: ext. 4238 ¨C Office Hour: 9:00-12:00 am, every Tuesday and Wednesday ? TA ¨C Mr. Liu ( „¢å«Ö® ) ¨C R1424, Technology Building IR, Spring 2012 NTUT CSIE 2
  • 3. Course Description ? Course Web Page ¨C http://www.ntut.edu.tw/~jhwang/IR/ ? Time: 9:10-12:00am, Thu. ? Classroom: R1322, Technology Building ? Textbook: ¨C Christopher D. Manning, Prabhakar Raghavan and Hinrich Schuetze, Introduction to Information Retrieval, Cambridge University Press, 2008. ? Available online ? International Student Edition, imported by Kai-Fa ( é_°l ) Publishing ? Prerequisites: ¨C Basic knowledge of data structures and algorithms, linear algebra, and probability theory ¨C Programming experience is *required* for homeworks & projects IR, Spring 2012 NTUT CSIE 3
  • 4. Additional References ? References: ¨C Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search, Addison-Wesley, 2011. ? This is the second edition of their book Modern Information Retrieval in 1999. ( ÈAͨ ) ¨C Stefan Buettcher, Charles L.A. Clarke, and Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines, MIT Press, 2010. ¨C Bruce Croft, Donald Metzler, and Trevor Strohman, Search Engines: Information Retrieval in Practice, Addison-Wesley, 2010. ( È«ÈA ) IR, Spring 2012 NTUT CSIE 4
  • 5. More Books on IR ? Gerald Salton, Automatic information organization and retrieval, McGraw-Hill, 1968. ? Gerald Salton and M.J. McGill, Introduction to modern information retrieval, McGraw-Hill, 1983. ¨C Two classics, but out-of-print. ? C. J. van Rijsbergen, Information Retrieval, Butterworths, 1979. ¨C The classic. More than 40 years old, but still worth reading. ? K. Sparck Jones, P. Willett, Readings in Information Retrieval, Morgan Kaufmann, 1997. ¨C A collection of classical IR papers. (out of print) ? I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann, Managing Gigabytes, 2nd edition, 1999. ¨C The authority on index construction and compression. IR, Spring 2012 NTUT CSIE 5
  • 6. Grading Policy ? Homework assignments and programming exercises: 40% ? Mid-term exam: 25% ? Term project: 35% ¨C Including the proposal and final report IR, Spring 2012 NTUT CSIE 6
  • 7. Programming Exercises and Term Project ? About 3 programming exercises ¨C Team-based (at most 2 persons per team) ¨C You can either write your own code or reuse existing open source code ? The term project ¨C Either team-based system development (the same as programming exercises) ¨C Or academic paper presentation ? Only one person per team allowed ¨C A proposal is required before midterm (Apr. 12, 2012) IR, Spring 2012 NTUT CSIE 7
  • 8. About the Term Project ? The score you get depends on the difficulty and quality of your project ¨C For system development: ? System functions and correctness ¨C For academic paper presentation ? Quality and your presentation of the paper ? Major methods/experimental results *must* be presented ? Papers from top conferences are strongly suggested ¨C E.g. SIGIR, WWW, CIKM, WSDM, JCDL, ICMR, ¡­ ? Proposals are *required* for each team, and will counted in the score IR, Spring 2012 NTUT CSIE 8
  • 9. Online Submission ? Submission instructions ¨C Programs, project proposals, and project reports in electronic files must be submitted to the TA online at: ? http://140.124.183.39/ir/ ¨C Before submission: ? User name: Your student ID ? Please change your default password at your first login IR, Spring 2012 NTUT CSIE 9
  • 10. What this Course is NOT about ? This course will NOT tell you ¨C The tips and tricks of using search engines, although power users might have better ideas on how to improve them ? There¡¯re plenty of books and websites on that¡­ ¨C How to find books in libraries, although it¡¯s somewhat related to the basic IR concepts ¨C How to make money on the Web, although the currently largest search engine did it IR, Spring 2012 NTUT CSIE 10
  • 11. What¡¯s Information Retrieval IR, Spring 2012 NTUT CSIE 11
  • 12. On Wikipedia IR, Spring 2012 NTUT CSIE 12
  • 13. On Google Images IR, Spring 2012 NTUT CSIE 13
  • 14. On Google Video Search IR, Spring 2012 NTUT CSIE 14
  • 15. On Google News (TW) IR, Spring 2012 NTUT CSIE 15
  • 16. On Google News (US) IR, Spring 2012 NTUT CSIE 16
  • 17. On Blogs IR, Spring 2012 NTUT CSIE 17
  • 18. On Google Translate¡­ IR, Spring 2012 NTUT CSIE 18
  • 19. Or More Related Keywords ? NBA ? New York Knicks ? Linsanity ?¡­ IR, Spring 2012 NTUT CSIE 19
  • 20. What if We Search in Chinese IR, Spring 2012 NTUT CSIE 20
  • 21. And More¡­ ? ¼~¼sÄá¿Ë ? ¹þ·ð ? ̨ÒáÇò†T ?¡­ ? And other languages¡­ ? And other search engines¡­ ? And social websites¡­ IR, Spring 2012 NTUT CSIE 21
  • 22. In Google Trends IR, Spring 2012 NTUT CSIE 22
  • 23. And More¡­ IR, Spring 2012 NTUT CSIE 23
  • 24. And Other Keywords¡­ IR, Spring 2012 NTUT CSIE 24
  • 25. And Other Keywords¡­ IR, Spring 2012 NTUT CSIE 25
  • 26. Palanteer ¨C TW Election IR, Spring 2012 NTUT CSIE 26
  • 27. IR, Spring 2012 NTUT CSIE 27
  • 28. IR, Spring 2012 NTUT CSIE 28
  • 29. What Is Information Retrieval? ? ¡°Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.¡± (Salton, 1968) IR, Spring 2012 NTUT CSIE 29
  • 30. Goal ? Information retrieval (IR): a research field that targets at effectively and efficiently searching information in text and multimedia documents ? In this course, we will introduce the basic text and query models in IR, retrieval evaluation, indexing and searching, and applications for IR IR, Spring 2012 NTUT CSIE 30
  • 31. A Big Picture IR, Spring 2012 NTUT CSIE 31
  • 32. User Interface user need Text Text Operations logical view Doc representation Query Indexing Indexing user feedback Expansion query inverted file Retrieval Retrieval Inverte d Index retrieved docs Document Collection Ranking Ranking ranked docs IR, Spring 2012 NTUT CSIE 32
  • 33. Topics ? Text IR ¨C Indexing and searching ¨C Query languages and operations ? Retrieval evaluation ? Modeling ¨C Boolean model ¨C Vector space model ¨C Probabilistic model ? Applications for IR ¨C Multimedia IR ¨C Web search ¨C Digital libraries IR, Spring 2012 NTUT CSIE 33
  • 34. Organization of the Textbook ? Basics in IR (focus) ¨C Inverted indexes for boolean queries (Ch.1-5) ¨C Term weighting and vector space model (Ch. 6-7) ¨C Evaluation in IR (Ch. 8) ? Advanced Topics ¨C Relevance feedback (Ch. 9) ¨C XML retrieval (Ch. 10) ¨C Probabilistic IR (Ch. 11) ¨C Language models (Ch. 12) ? Machine learning in IR (useful) ¨C Text classification (Ch. 13-15) ¨C Document clustering (Ch. 16-18) ? Web Search ¨C Web crawling and indexes (Ch. 19-20) ¨C Link analysis (Ch. 21) IR, Spring 2012 NTUT CSIE 34
  • 35. Pointers to Other Topics ? Cross-language IR ? Image, video, and multimedia IR ? Speech retrieval ? Music retrieval ? User interfaces ? Parallel, distributed, and P2P IR ? Digital libraries ? Information science perspective ? Logic-based approaches to IR ? Natural language processing techniques IR, Spring 2012 NTUT CSIE 35
  • 36. Tentative Schedule ? Before midterm ¨C Boolean retrieval (1 wk) ¨C Indexing (2 wks) ¨C Vector space model and evaluation (2 wk) ¨C Relevance feedback (1 wk) ¨C Probabilistic IR (2 wk) ? After midterm ¨C Text classification (1-2 wk) ¨C Document clustering (1-2 wk) ¨C Web search (2 wks) ¨C Advanced topics: CLIR, IE, ¡­ (2 wks) ¨C Term Project Presentation (3 wks) IR, Spring 2012 NTUT CSIE 36
  • 37. Generic Resources ? Wikipedia page on Information Retrieval: http://en.wikipedia.org/wiki/Information_re ? Information Retrieval Resources: http://www- csli.stanford.edu/~hinrich/information- retrieval.html ? IR, Spring 2012 NTUT CSIE 37
  • 38. Academic Resources ? Journals ¨C ACM TOIS: Transactions on Information Systems ¨C JASIST: Journal of the American Society of Information Sciences ¨C IP&M: Information Processing and Management ¨C IEEE TKDE: Transactions on Knowledge and Data Engineering ? Conferences ¨C ACM SIGIR: International Conference on Information Retrieval ¨C WWW: World Wide Web Conference ¨C ACM CIKM: Conference on Information Knowledge and Management ¨C JCDL: ACM/IEEE Joint Conference on Digital Libraries ¨C ACM WSDM: International Conference on Web Search and Data Mining ¨C TREC: Text Retrieval Conference IR, Spring 2012 NTUT CSIE 38
  • 39. Thanks for Your Attention! IR, Spring 2012 NTUT CSIE 39