際際滷

際際滷Share a Scribd company logo
Mathematical Roadmap for IR

Introduction to the mathematical concepts
commonly used in IR/ST

               - Abhay   Shete, 42
Why bother about the maths ?

   Lots of open source libraries out there
   General approach for external software
       Include it
       Build it
       Forget it !!
IR is fuzzy
   Consider the following sentences
       India beat Pakistan
       Sachin has scored over 10,000 runs in Test cricket
       Sachin Tendulkar was unable to run to the non-
        striker's end
   How do we humans interpret this information ?
   How do machines interpret it ?
What excites me about IR ?

    Its all about indulging and flirting with sexy
     models !!!!
Models of interest

   Vector space models
   Inductive models
       Probabilistic
       Neural networks, decision trees
   Hybrid  vector space and inductive
   Others...
Vector Space Model
   Recall junior college mathematics
   Dot-product, cross-product, projections, etc..
Vector Space Model

   A1, A2 are articles about animals
   P1 contains politics
   To find similarity scores between
       a) A1, A2 ==> assume score is S1
       b) A1, P1 ==> assume score is S2
    Intuitively S1 > S2
    How do u create a model around this ?
Vector Space Model
   A1 = {dog(1), cat(2), lion(1), rhino(1), tiger(2)}
   A2 = {eagle(1), cat(1), tiger(1), fox, hyena

    N Dimensional Space !!!
     d1 = 1(dog) + 2(cat) + 1(lion) + 1(rhino) +
    2(tiger) + 0(eagle) + 0(fox) + 0(hyena)
    d2 = 0(dog) + 1(cat) + 0(lion) + 0(rhino) +
    1(tiger) + 1(eagle) + 1(fox) + 1(hyena)
    Similarity is measured by the angle between the
    two vectors !!
Similarity measured by the angle between
              two vectors !!


 If X and Y are two n-dimensional vectors
 <xi> and <yi>, the angle 慮 between them
 satisfies:
       X Y = |X| |Y| cos 慮
       cos 慮 = X Y / (|X||Y|)
News alerts demo...
Linear Algebra

   Vectors and matrices are equivalent
   Important Linear Algebra concepts:
       Singular Value Decomposition
       EigenVectors
            Latent Semantic Indexing (LSI)
            Face recognition
Latent Semantic Indexing

   Exact keyword match not required
   No predfined semantic knowledge base
   Demo
       http://lsi.research.telcordia.com/lsi-bin/lsiQuery
   How does it work ?
       Fish Tank Analogy
Uncertainities, Pitfalls

   Not suitable for fine grained search
   Web Scale corpus !!
   Does anybody find Google suggestions really
    helpful ?
Face recognition application

   EigenValues, covariance and SVD
   Intuitive Understanding
       Create a vector space from the pixels modelling the
        face
       Find average face vector
       Find the specific features of every face by computing
        the variance from the average vector
       Take the top N specific features of this vector space
        and create the classifier vector
Document Clustering

   Clusty.com demo
   Clustering approaches
Inductive Models

   Require prior information (Training set) to
    extrapolate to future unseen cases.

   Create a function from the input training data
    which is applied on test data to produce output.
Bag 1         Bag 2        Bag 3

Ball drawn at random from a bag
and found to be white. What are the
chances of it being drawn from Bag
3?
Bag 1         Bag 2          Bag 3


Ball drawn at random from a bag
and found to be green. What are
the chances that it is from Bag 2 ?
   Any way to model this ?
   Bayes Theorem !!!
Sachin, Pilot,         Rock, band,
Sachin, cricket,     political, career,   took, crowd,
  Tendulkar,         drew, crowds,        storm, elected
  run, wicket,                            , year,
                     centre,              grammy
  crowd              government,
  cheered            elections


 Cricket Bag          Politics Bag         Music Bag


  Test sentence:- Sachin's 10,000th run was
  highly appreciated by the crowd

  What chances this is from Bag Cricket ?
Other applications on probabilistic
models
   Targeted Advertising  Demo
   POS Tagging
    http://www.infogistics.com/posdemo.htm
   Named Entity Recognition
Neural Networks

   Demo video (5 minutes)  Machine learns how to
    steer the vehicle by observing the driver
   After training, capable of steering the vehicle on
    its own.
Uncertainities, Pitfalls
   Depends on the training set
   Training set may not be in line with the
    application for which it is being used.
       POS Tagging done on Wall Street Corpus
       Like Applying the same for like analyzing teen
        text !!
       dove makes my skin smooth.
   Can you deal with The Black Swan!!
   Is inductive learning really needed ?
   Combine with other features/approaches ?
   More data better than a good algorithm.

More Related Content

PIRST Presentation

  • 1. Mathematical Roadmap for IR Introduction to the mathematical concepts commonly used in IR/ST - Abhay Shete, 42
  • 2. Why bother about the maths ? Lots of open source libraries out there General approach for external software Include it Build it Forget it !!
  • 3. IR is fuzzy Consider the following sentences India beat Pakistan Sachin has scored over 10,000 runs in Test cricket Sachin Tendulkar was unable to run to the non- striker's end How do we humans interpret this information ? How do machines interpret it ?
  • 4. What excites me about IR ? Its all about indulging and flirting with sexy models !!!!
  • 5. Models of interest Vector space models Inductive models Probabilistic Neural networks, decision trees Hybrid vector space and inductive Others...
  • 6. Vector Space Model Recall junior college mathematics Dot-product, cross-product, projections, etc..
  • 7. Vector Space Model A1, A2 are articles about animals P1 contains politics To find similarity scores between a) A1, A2 ==> assume score is S1 b) A1, P1 ==> assume score is S2 Intuitively S1 > S2 How do u create a model around this ?
  • 8. Vector Space Model A1 = {dog(1), cat(2), lion(1), rhino(1), tiger(2)} A2 = {eagle(1), cat(1), tiger(1), fox, hyena N Dimensional Space !!! d1 = 1(dog) + 2(cat) + 1(lion) + 1(rhino) + 2(tiger) + 0(eagle) + 0(fox) + 0(hyena) d2 = 0(dog) + 1(cat) + 0(lion) + 0(rhino) + 1(tiger) + 1(eagle) + 1(fox) + 1(hyena) Similarity is measured by the angle between the two vectors !!
  • 9. Similarity measured by the angle between two vectors !! If X and Y are two n-dimensional vectors <xi> and <yi>, the angle 慮 between them satisfies: X Y = |X| |Y| cos 慮 cos 慮 = X Y / (|X||Y|)
  • 11. Linear Algebra Vectors and matrices are equivalent Important Linear Algebra concepts: Singular Value Decomposition EigenVectors Latent Semantic Indexing (LSI) Face recognition
  • 12. Latent Semantic Indexing Exact keyword match not required No predfined semantic knowledge base Demo http://lsi.research.telcordia.com/lsi-bin/lsiQuery How does it work ? Fish Tank Analogy
  • 13. Uncertainities, Pitfalls Not suitable for fine grained search Web Scale corpus !! Does anybody find Google suggestions really helpful ?
  • 14. Face recognition application EigenValues, covariance and SVD Intuitive Understanding Create a vector space from the pixels modelling the face Find average face vector Find the specific features of every face by computing the variance from the average vector Take the top N specific features of this vector space and create the classifier vector
  • 15. Document Clustering Clusty.com demo Clustering approaches
  • 16. Inductive Models Require prior information (Training set) to extrapolate to future unseen cases. Create a function from the input training data which is applied on test data to produce output.
  • 17. Bag 1 Bag 2 Bag 3 Ball drawn at random from a bag and found to be white. What are the chances of it being drawn from Bag 3?
  • 18. Bag 1 Bag 2 Bag 3 Ball drawn at random from a bag and found to be green. What are the chances that it is from Bag 2 ?
  • 19. Any way to model this ? Bayes Theorem !!!
  • 20. Sachin, Pilot, Rock, band, Sachin, cricket, political, career, took, crowd, Tendulkar, drew, crowds, storm, elected run, wicket, , year, centre, grammy crowd government, cheered elections Cricket Bag Politics Bag Music Bag Test sentence:- Sachin's 10,000th run was highly appreciated by the crowd What chances this is from Bag Cricket ?
  • 21. Other applications on probabilistic models Targeted Advertising Demo POS Tagging http://www.infogistics.com/posdemo.htm Named Entity Recognition
  • 22. Neural Networks Demo video (5 minutes) Machine learns how to steer the vehicle by observing the driver After training, capable of steering the vehicle on its own.
  • 23. Uncertainities, Pitfalls Depends on the training set Training set may not be in line with the application for which it is being used. POS Tagging done on Wall Street Corpus Like Applying the same for like analyzing teen text !! dove makes my skin smooth. Can you deal with The Black Swan!! Is inductive learning really needed ? Combine with other features/approaches ? More data better than a good algorithm.