The document provides an introduction to various mathematical concepts commonly used in information retrieval and summarizes several models including vector space models, inductive models like probabilistic and neural networks, and hybrid models. It gives examples of how vector space and probabilistic models can be used for applications like document clustering, targeted advertising, and part-of-speech tagging. It also notes some uncertainties and pitfalls with these models.
1 of 23
Download to read offline
More Related Content
PIRST Presentation
1. Mathematical Roadmap for IR
Introduction to the mathematical concepts
commonly used in IR/ST
- Abhay Shete, 42
2. Why bother about the maths ?
Lots of open source libraries out there
General approach for external software
Include it
Build it
Forget it !!
3. IR is fuzzy
Consider the following sentences
India beat Pakistan
Sachin has scored over 10,000 runs in Test cricket
Sachin Tendulkar was unable to run to the non-
striker's end
How do we humans interpret this information ?
How do machines interpret it ?
4. What excites me about IR ?
Its all about indulging and flirting with sexy
models !!!!
5. Models of interest
Vector space models
Inductive models
Probabilistic
Neural networks, decision trees
Hybrid vector space and inductive
Others...
6. Vector Space Model
Recall junior college mathematics
Dot-product, cross-product, projections, etc..
7. Vector Space Model
A1, A2 are articles about animals
P1 contains politics
To find similarity scores between
a) A1, A2 ==> assume score is S1
b) A1, P1 ==> assume score is S2
Intuitively S1 > S2
How do u create a model around this ?
8. Vector Space Model
A1 = {dog(1), cat(2), lion(1), rhino(1), tiger(2)}
A2 = {eagle(1), cat(1), tiger(1), fox, hyena
N Dimensional Space !!!
d1 = 1(dog) + 2(cat) + 1(lion) + 1(rhino) +
2(tiger) + 0(eagle) + 0(fox) + 0(hyena)
d2 = 0(dog) + 1(cat) + 0(lion) + 0(rhino) +
1(tiger) + 1(eagle) + 1(fox) + 1(hyena)
Similarity is measured by the angle between the
two vectors !!
9. Similarity measured by the angle between
two vectors !!
If X and Y are two n-dimensional vectors
<xi> and <yi>, the angle 慮 between them
satisfies:
X Y = |X| |Y| cos 慮
cos 慮 = X Y / (|X||Y|)
11. Linear Algebra
Vectors and matrices are equivalent
Important Linear Algebra concepts:
Singular Value Decomposition
EigenVectors
Latent Semantic Indexing (LSI)
Face recognition
12. Latent Semantic Indexing
Exact keyword match not required
No predfined semantic knowledge base
Demo
http://lsi.research.telcordia.com/lsi-bin/lsiQuery
How does it work ?
Fish Tank Analogy
13. Uncertainities, Pitfalls
Not suitable for fine grained search
Web Scale corpus !!
Does anybody find Google suggestions really
helpful ?
14. Face recognition application
EigenValues, covariance and SVD
Intuitive Understanding
Create a vector space from the pixels modelling the
face
Find average face vector
Find the specific features of every face by computing
the variance from the average vector
Take the top N specific features of this vector space
and create the classifier vector
16. Inductive Models
Require prior information (Training set) to
extrapolate to future unseen cases.
Create a function from the input training data
which is applied on test data to produce output.
17. Bag 1 Bag 2 Bag 3
Ball drawn at random from a bag
and found to be white. What are the
chances of it being drawn from Bag
3?
18. Bag 1 Bag 2 Bag 3
Ball drawn at random from a bag
and found to be green. What are
the chances that it is from Bag 2 ?
19. Any way to model this ?
Bayes Theorem !!!
20. Sachin, Pilot, Rock, band,
Sachin, cricket, political, career, took, crowd,
Tendulkar, drew, crowds, storm, elected
run, wicket, , year,
centre, grammy
crowd government,
cheered elections
Cricket Bag Politics Bag Music Bag
Test sentence:- Sachin's 10,000th run was
highly appreciated by the crowd
What chances this is from Bag Cricket ?
21. Other applications on probabilistic
models
Targeted Advertising Demo
POS Tagging
http://www.infogistics.com/posdemo.htm
Named Entity Recognition
22. Neural Networks
Demo video (5 minutes) Machine learns how to
steer the vehicle by observing the driver
After training, capable of steering the vehicle on
its own.
23. Uncertainities, Pitfalls
Depends on the training set
Training set may not be in line with the
application for which it is being used.
POS Tagging done on Wall Street Corpus
Like Applying the same for like analyzing teen
text !!
dove makes my skin smooth.
Can you deal with The Black Swan!!
Is inductive learning really needed ?
Combine with other features/approaches ?
More data better than a good algorithm.