�ݺ�ߣ

K Nearest Neighbour Classifier
●
Tejas Bubane (I-05)
●
Shriyansh Jain (H-43)
●
Mitesh Butala (J- 15)
●
Gaurav Jagtap (H-42)

Project Guide: Asst. Prof. P.A. Bailke
VIT, Pune

TF-IDF Values
● Term Frequency (TF): Importance of the term within that document – raw
frequency
i.e. TF(d,t) = Number of occurrences of the term(t) in the document(d)

● Inverse Document Frequency (IDF): Importance of the term in the corpus

IDF(t) = log(D/t)
where, D = total number of documents
t = number of documents in which the term has occurred

word occurs in many documents – less useful – IDF value low (and vice-versa)

● TF-IDF(d,t) = TF(d,t) × IDF(t)

KNN - Introduction
● Learning by analogy – comparison with similar items from training set

● Training tuples described by n attributes – each document represents a
point in n dimensional space

● Closeness defined in terms of distance metric
eg. Euclidean distance, Cosine similarity, Manhattan distance

●

● Cos = 1 i.e. Angle = 0 documents are similar
● Cos = 0 i.e. Angle = 90 documents are not similar

KNN Algorithm

● Find cosine distance of query document with each document
in the training set

● Find the k documents that are closest / nearest to the query document

● Class of query is the class of majority of the nearest neighbours
(classes of each document in the training set are known)

Further Analysis of Classification
● Lazy Learner : Starts operation only after a query is provided
eg. KNN (calculates TF-IDF values after receiving query)

● Eager Learner : Operates and keeps “learning” till query is received.
eg. ANN (adjusts weights before receiving query)

● Supervised Learning : Labelled training data
eg. Classification

● Unsupervised Learning : Find hidded structure in unlabelled data
eg. Clusturing

● KNN is Supervised Learning Algorithm and follows Lazy Learning approach

Scaling KNN
● Vocabulary – Set of all words occurring in all documents

● Large data set – Drastic increase in the vocabulary – difficult to handle

● Feature Selection – Relation between terms in vocabulary and classes
Remove words which are less related (below threshold) to all classes
Reduce vocabulary to make it manageable

● eg. Chi-square test

�ݺ�ߣ

Knn

More Related Content

Knn