This document provides an overview and syllabus for a course on Introduction to Information Retrieval and Applications. The course will be taught on Thursdays from 9:10-12:00am in classroom R1322. It will cover topics such as indexing, vector space models, evaluation methods, relevance feedback, probabilistic models and applications like text classification, document clustering and web search. Students will complete programming exercises and a term project, and the course will include homework, a midterm exam and a final project.
2. Instructor & TA
? Instructor
¨C J. H. Wang ( ÍõÕýºÀ )
¨C Assistant Professor, CSIE, NTUT
¨C Office: R1534, Technology Building
¨C E-mail: jhwang@csie.ntut.edu.tw
¨C Tel: ext. 4238
¨C Office Hour: 9:00-12:00 am, every Tuesday and
Wednesday
? TA
¨C Mr. Liu ( „¢å«Ö® )
¨C R1424, Technology Building
IR, Spring 2012 NTUT CSIE 2
3. Course Description
? Course Web Page
¨C http://www.ntut.edu.tw/~jhwang/IR/
? Time: 9:10-12:00am, Thu.
? Classroom: R1322, Technology Building
? Textbook:
¨C Christopher D. Manning, Prabhakar Raghavan and Hinrich
Schuetze, Introduction to Information Retrieval, Cambridge
University Press, 2008.
? Available online
? International Student Edition, imported by Kai-Fa ( é_°l )
Publishing
? Prerequisites:
¨C Basic knowledge of data structures and algorithms, linear
algebra, and probability theory
¨C Programming experience is *required* for homeworks &
projects
IR, Spring 2012 NTUT CSIE 3
4. Additional References
? References:
¨C Ricardo Baeza-Yates and Berthier Ribeiro-Neto,
Modern Information Retrieval: The Concepts and
Technology behind Search, Addison-Wesley, 2011.
? This is the second edition of their book Modern Information
Retrieval in 1999. ( ÈAͨ )
¨C Stefan Buettcher, Charles L.A. Clarke, and Gordon V.
Cormack, Information Retrieval: Implementing and
Evaluating Search Engines, MIT Press, 2010.
¨C Bruce Croft, Donald Metzler, and Trevor Strohman,
Search Engines: Information Retrieval in Practice,
Addison-Wesley, 2010. ( È«ÈA )
IR, Spring 2012 NTUT CSIE 4
5. More Books on IR
? Gerald Salton, Automatic information organization and
retrieval, McGraw-Hill, 1968.
? Gerald Salton and M.J. McGill, Introduction to modern
information retrieval, McGraw-Hill, 1983.
¨C Two classics, but out-of-print.
? C. J. van Rijsbergen, Information Retrieval, Butterworths,
1979.
¨C The classic. More than 40 years old, but still worth reading.
? K. Sparck Jones, P. Willett, Readings in Information
Retrieval, Morgan Kaufmann, 1997.
¨C A collection of classical IR papers. (out of print)
? I.H. Witten, A. Moffat, T.C. Bell. Morgan Kaufmann,
Managing Gigabytes, 2nd edition, 1999.
¨C The authority on index construction and compression.
IR, Spring 2012 NTUT CSIE 5
6. Grading Policy
? Homework assignments and
programming exercises: 40%
? Mid-term exam: 25%
? Term project: 35%
¨C Including the proposal and final report
IR, Spring 2012 NTUT CSIE 6
7. Programming Exercises and Term
Project
? About 3 programming exercises
¨C Team-based (at most 2 persons per team)
¨C You can either write your own code or reuse existing
open source code
? The term project
¨C Either team-based system development (the same as
programming exercises)
¨C Or academic paper presentation
? Only one person per team allowed
¨C A proposal is required before midterm (Apr. 12, 2012)
IR, Spring 2012 NTUT CSIE 7
8. About the Term Project
? The score you get depends on the difficulty and
quality of your project
¨C For system development:
? System functions and correctness
¨C For academic paper presentation
? Quality and your presentation of the paper
? Major methods/experimental results *must* be presented
? Papers from top conferences are strongly suggested
¨C E.g. SIGIR, WWW, CIKM, WSDM, JCDL, ICMR, ¡
? Proposals are *required* for each team, and will counted in
the score
IR, Spring 2012 NTUT CSIE 8
9. Online Submission
? Submission instructions
¨C Programs, project proposals, and project
reports in electronic files must be submitted to
the TA online at:
? http://140.124.183.39/ir/
¨C Before submission:
? User name: Your student ID
? Please change your default password at your first
login
IR, Spring 2012 NTUT CSIE 9
10. What this Course is NOT about
? This course will NOT tell you
¨C The tips and tricks of using search engines,
although power users might have better ideas on how
to improve them
? There¡¯re plenty of books and websites on that¡
¨C How to find books in libraries,
although it¡¯s somewhat related to the basic IR
concepts
¨C How to make money on the Web,
although the currently largest search engine did it
IR, Spring 2012 NTUT CSIE 10
19. Or More Related Keywords
? NBA
? New York Knicks
? Linsanity
?¡
IR, Spring 2012 NTUT CSIE 19
20. What if We Search in Chinese
IR, Spring 2012 NTUT CSIE 20
21. And More¡
? ¼~¼sÄá¿Ë
? ¹þ·ð
? ̨ÒáÇò†T
?¡
? And other languages¡
? And other search engines¡
? And social websites¡
IR, Spring 2012 NTUT CSIE 21
29. What Is Information Retrieval?
? ¡°Information retrieval is a field concerned
with the structure, analysis, organization,
storage, searching, and retrieval of
information.¡± (Salton, 1968)
IR, Spring 2012 NTUT CSIE 29
30. Goal
? Information retrieval (IR): a research field
that targets at effectively and efficiently
searching information in text and
multimedia documents
? In this course, we will introduce the basic
text and query models in IR, retrieval
evaluation, indexing and searching, and
applications for IR
IR, Spring 2012 NTUT CSIE 30
32. User
Interface
user need Text
Text Operations
logical view Doc representation
Query
Indexing
Indexing
user feedback Expansion
query inverted file
Retrieval
Retrieval Inverte
d Index
retrieved docs Document
Collection
Ranking
Ranking
ranked docs
IR, Spring 2012 NTUT CSIE 32
33. Topics
? Text IR
¨C Indexing and searching
¨C Query languages and operations
? Retrieval evaluation
? Modeling
¨C Boolean model
¨C Vector space model
¨C Probabilistic model
? Applications for IR
¨C Multimedia IR
¨C Web search
¨C Digital libraries
IR, Spring 2012 NTUT CSIE 33
34. Organization of the Textbook
? Basics in IR (focus)
¨C Inverted indexes for boolean queries (Ch.1-5)
¨C Term weighting and vector space model (Ch. 6-7)
¨C Evaluation in IR (Ch. 8)
? Advanced Topics
¨C Relevance feedback (Ch. 9)
¨C XML retrieval (Ch. 10)
¨C Probabilistic IR (Ch. 11)
¨C Language models (Ch. 12)
? Machine learning in IR (useful)
¨C Text classification (Ch. 13-15)
¨C Document clustering (Ch. 16-18)
? Web Search
¨C Web crawling and indexes (Ch. 19-20)
¨C Link analysis (Ch. 21)
IR, Spring 2012 NTUT CSIE 34
35. Pointers to Other Topics
? Cross-language IR
? Image, video, and multimedia IR
? Speech retrieval
? Music retrieval
? User interfaces
? Parallel, distributed, and P2P IR
? Digital libraries
? Information science perspective
? Logic-based approaches to IR
? Natural language processing techniques
IR, Spring 2012 NTUT CSIE 35
36. Tentative Schedule
? Before midterm
¨C Boolean retrieval (1 wk)
¨C Indexing (2 wks)
¨C Vector space model and evaluation (2 wk)
¨C Relevance feedback (1 wk)
¨C Probabilistic IR (2 wk)
? After midterm
¨C Text classification (1-2 wk)
¨C Document clustering (1-2 wk)
¨C Web search (2 wks)
¨C Advanced topics: CLIR, IE, ¡ (2 wks)
¨C Term Project Presentation (3 wks)
IR, Spring 2012 NTUT CSIE 36
37. Generic Resources
? Wikipedia page on Information Retrieval:
http://en.wikipedia.org/wiki/Information_re
? Information Retrieval Resources:
http://www-
csli.stanford.edu/~hinrich/information-
retrieval.html
?
IR, Spring 2012 NTUT CSIE 37
38. Academic Resources
? Journals
¨C ACM TOIS: Transactions on Information Systems
¨C JASIST: Journal of the American Society of Information Sciences
¨C IP&M: Information Processing and Management
¨C IEEE TKDE: Transactions on Knowledge and Data Engineering
? Conferences
¨C ACM SIGIR: International Conference on Information Retrieval
¨C WWW: World Wide Web Conference
¨C ACM CIKM: Conference on Information Knowledge and
Management
¨C JCDL: ACM/IEEE Joint Conference on Digital Libraries
¨C ACM WSDM: International Conference on Web Search and Data
Mining
¨C TREC: Text Retrieval Conference
IR, Spring 2012 NTUT CSIE 38