�ݺ�ߣ

Mathematical Roadmap for IR

Introduction to the mathematical concepts
commonly used in IR/ST

- Abhay Shete, 42

Why bother about the maths ?

● Lots of open source libraries out there
● General approach for external software
– Include it
– Build it
– Forget it !!

IR is fuzzy
● Consider the following sentences
– India beat Pakistan
– Sachin has scored over 10,000 runs in Test cricket
– Sachin Tendulkar was unable to run to the non-
striker's end
● How do we humans interpret this information ?
● How do machines interpret it ?

What excites me about IR ?

● Its all about indulging and flirting with sexy
models !!!!

Models of interest

● Vector space models
● Inductive models
– Probabilistic
– Neural networks, decision trees
● Hybrid – vector space and inductive
● Others...

Vector Space Model
● Recall junior college mathematics
● Dot-product, cross-product, projections, etc..

Vector Space Model

● A1, A2 are articles about animals
● P1 contains politics
● To find similarity scores between
– a) A1, A2 ==> assume score is S1
– b) A1, P1 ==> assume score is S2
● Intuitively S1 > S2
● How do u create a model around this ?

Vector Space Model
● A1 = {dog(1), cat(2), lion(1), rhino(1), tiger(2)}
● A2 = {eagle(1), cat(1), tiger(1), fox, hyena
●
N Dimensional Space !!!
d1 = 1(dog) + 2(cat) + 1(lion) + 1(rhino) +
2(tiger) + 0(eagle) + 0(fox) + 0(hyena)
● d2 = 0(dog) + 1(cat) + 0(lion) + 0(rhino) +
1(tiger) + 1(eagle) + 1(fox) + 1(hyena)
● Similarity is measured by the angle between the
two vectors !!

Similarity measured by the angle between
two vectors !!

If X and Y are two n-dimensional vectors
<xi> and <yi>, the angle θ between them
satisfies:
X Y = |X| |Y| cos θ
cos θ = X Y / (|X||Y|)

Linear Algebra

● Vectors and matrices are equivalent
● Important Linear Algebra concepts:
– Singular Value Decomposition
– EigenVectors
● Latent Semantic Indexing (LSI)
● Face recognition

Latent Semantic Indexing

● Exact keyword match not required
● No predfined semantic knowledge base
● Demo
– http://lsi.research.telcordia.com/lsi-bin/lsiQuery
● How does it work ?
– Fish Tank Analogy

Uncertainities, Pitfalls

● Not suitable for fine grained search
● Web Scale corpus !!
● Does anybody find Google suggestions really
helpful ?

Face recognition application

● EigenValues, covariance and SVD
● Intuitive Understanding
– Create a vector space from the pixels modelling the
face
– Find “average” face vector
– Find the specific features of every face by computing
the variance from the average vector
– Take the top N specific features of this vector space
and create the “classifier” vector

Document Clustering

● Clusty.com demo
● Clustering approaches

Inductive Models

● Require prior information (Training set) to
extrapolate to future unseen cases.

● Create a “function” from the input training data
which is applied on test data to produce output.

Bag 1 Bag 2 Bag 3

Ball drawn at random from a bag
and found to be white. What are the
chances of it being drawn from Bag
3?

Bag 1 Bag 2 Bag 3

Ball drawn at random from a bag
and found to be green. What are
the chances that it is from Bag 2 ?

● Any way to model this ?
● Bayes Theorem !!!

Sachin, Pilot, Rock, band,
Sachin, cricket, political, career, took, crowd,
Tendulkar, drew, crowds, storm, elected
run, wicket, , year,
centre, grammy
crowd government,
cheered elections

Cricket Bag Politics Bag Music Bag

Test sentence:- Sachin's 10,000th run was
highly appreciated by the crowd

What chances this is from Bag “Cricket” ?

Other applications on probabilistic
models
● Targeted Advertising – Demo
● POS Tagging
http://www.infogistics.com/posdemo.htm
● Named Entity Recognition

Neural Networks

● Demo video (5 minutes) – Machine learns how to
steer the vehicle by observing the driver
● After training, capable of steering the vehicle on
its own.

Uncertainities, Pitfalls
● Depends on the training set
● Training set may not be in line with the
application for which it is being used.
– POS Tagging done on Wall Street Corpus
– “Like” Applying the same for “like” analyzing teen
text !!
– “dove makes my skin smooth.”
● Can you deal with “The Black Swan”!!
● Is inductive learning really needed ?
● Combine with other features/approaches ?
● More data better than a good algorithm.

�ݺ�ߣ

PIRST Presentation

More Related Content

PIRST Presentation