際際滷

際際滷Share a Scribd company logo
K Nearest Neighbour Classifier
 
     Tejas Bubane (I-05)
 
     Shriyansh Jain (H-43)
 
     Mitesh Butala (J- 15)
 
     Gaurav Jagtap (H-42)



                             Project Guide: Asst. Prof. P.A. Bailke
                                                         VIT, Pune
TF-IDF Values
   Term Frequency (TF): Importance of the term within that document  raw
    frequency
    i.e. TF(d,t) = Number of occurrences of the term(t) in the document(d)

   Inverse Document Frequency (IDF): Importance of the term in the corpus

    IDF(t) = log(D/t)
       where, D = total number of documents
               t = number of documents in which the term has occurred

    word occurs in many documents  less useful  IDF value low (and vice-versa)

   TF-IDF(d,t) = TF(d,t)  IDF(t)
KNN - Introduction
   Learning by analogy  comparison with similar items from training set

   Training tuples described by n attributes  each document represents a
    point in n dimensional space

   Closeness defined in terms of distance metric
    eg. Euclidean distance, Cosine similarity, Manhattan distance







   Cos = 1 i.e. Angle = 0 documents are similar
   Cos = 0 i.e. Angle = 90 documents are not similar
KNN Algorithm

   Find cosine distance of query document with each document
    in the training set

   Find the k documents that are closest / nearest to the query document

   Class of query is the class of majority of the nearest neighbours
    (classes of each document in the training set are known)
Further Analysis of Classification
   Lazy Learner : Starts operation only after a query is provided
    eg. KNN (calculates TF-IDF values after receiving query)

   Eager Learner : Operates and keeps learning till query is received.
    eg. ANN (adjusts weights before receiving query)

   Supervised Learning : Labelled training data
    eg. Classification

   Unsupervised Learning : Find hidded structure in unlabelled data
    eg. Clusturing

   KNN is Supervised Learning Algorithm and follows Lazy Learning approach
Scaling KNN
   Vocabulary  Set of all words occurring in all documents

   Large data set  Drastic increase in the vocabulary  difficult to handle

   Feature Selection  Relation between terms in vocabulary and classes
    Remove words which are less related (below threshold) to all classes
    Reduce vocabulary to make it manageable

   eg. Chi-square test

More Related Content

Knn

  • 1. K Nearest Neighbour Classifier Tejas Bubane (I-05) Shriyansh Jain (H-43) Mitesh Butala (J- 15) Gaurav Jagtap (H-42) Project Guide: Asst. Prof. P.A. Bailke VIT, Pune
  • 2. TF-IDF Values Term Frequency (TF): Importance of the term within that document raw frequency i.e. TF(d,t) = Number of occurrences of the term(t) in the document(d) Inverse Document Frequency (IDF): Importance of the term in the corpus IDF(t) = log(D/t) where, D = total number of documents t = number of documents in which the term has occurred word occurs in many documents less useful IDF value low (and vice-versa) TF-IDF(d,t) = TF(d,t) IDF(t)
  • 3. KNN - Introduction Learning by analogy comparison with similar items from training set Training tuples described by n attributes each document represents a point in n dimensional space Closeness defined in terms of distance metric eg. Euclidean distance, Cosine similarity, Manhattan distance Cos = 1 i.e. Angle = 0 documents are similar Cos = 0 i.e. Angle = 90 documents are not similar
  • 4. KNN Algorithm Find cosine distance of query document with each document in the training set Find the k documents that are closest / nearest to the query document Class of query is the class of majority of the nearest neighbours (classes of each document in the training set are known)
  • 5. Further Analysis of Classification Lazy Learner : Starts operation only after a query is provided eg. KNN (calculates TF-IDF values after receiving query) Eager Learner : Operates and keeps learning till query is received. eg. ANN (adjusts weights before receiving query) Supervised Learning : Labelled training data eg. Classification Unsupervised Learning : Find hidded structure in unlabelled data eg. Clusturing KNN is Supervised Learning Algorithm and follows Lazy Learning approach
  • 6. Scaling KNN Vocabulary Set of all words occurring in all documents Large data set Drastic increase in the vocabulary difficult to handle Feature Selection Relation between terms in vocabulary and classes Remove words which are less related (below threshold) to all classes Reduce vocabulary to make it manageable eg. Chi-square test