This document analyzes the performance of the KNN and K-Means classifiers on internet advertisement data. KNN performs classification based on the majority class of the K nearest neighbors. K-Means partitions data into K clusters based on distance to cluster means. The document finds that KNN achieved 95.6% accuracy and K-Means achieved 93.5% accuracy on the advertisement data based on features like URLs, text, and images. It also discusses improving the data and developing a system to remove unwanted advertisements from web pages using these classifiers.
1 of 10
Download to read offline
More Related Content
Performance analysis of KNN & K-Means using internet advertisements data
1. MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
1
Machine Learning
Name: Muhammad Gulraj
Muhammad.gulraj@yahoo.com
Performance analysis of KNN & K-Means using internet
advertisements data
2. MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
2
Abstract
Advertisements are undeniably the main source of earning over the internet. In this article we
have analyzed the performance of two different classifiers, namely K nearest neighbor (KNN),
and K-Means using internet advertisements data [1]. The classifiers are used to identify
advertisement and non-advertisement data and their performance is evaluated based on the
results.
K-nearest neighbor:
In machine learning and pattern recognition K - Nearest neighbor KNN is an algorithm (non-
parametric) that can be used for classification and regression. It is among the simplest
algorithms in machine learning. The output of the KNN depends on whether the algorithm is
used for regression or classification.
KNN classify an object using the majority vote/neighbor. The object is assigned to the class
which has the most members near to the object.
3. MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
3
KNN is instance based learning or lazy learning. KNN is simple but powerful algorithm in which
no training is required and new training example can be added easily. However it is slow and
expensive having computation complexity of O (md). KNN has a slow run time performance but
it can be improved significantly by removing redundant data, computing approximate distance or
using pre-sort training examples into fast data structures.
KNN can be used in handwritten character classification, content based image retrieval,
intrusion detection and fault detection.
Lets consider a simple example of an object which have many sea bass and salmon as
neighbors. Lets assume that K = 3 which means that we will consider 3 nearest neighbors of
the object. From the below image it is clear that the object have 2 sea bass and 1 salmon, so
the KNN algorithm will classify the object as sea bass.
KNN Algorithm:
KNN Algorithm is very simple. The training period consist of storing all instances and labels
(Class labels). If feature selection has to be performed then n-fold cross validation should be
used on training set. If we want to test a new instance X given a set Y, the following steps are
needed.
i. Compute the distance of X from each instance in set Y
ii. Sort the distance in increasing order and pick the highest K elements.
iii. Find the most repeated class in K nearest neighbor.
4. MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
4
K- Means Clustering:
K means clustering is a popular method used for vector quantization. K means clustering is
basically the partition of X observations into K clusters. In K means clustering each observation
belongs to the cluster having the nearest mean. K means clustering is an NP hard problem
(Computationally difficult). Efficient heuristic algorithms can be used which converge rapidly to a
local minimum.
Assume a set of observations x1, x2, x3, x4 . Xn, every observation is a vector of d-
dimensions. The main aim of K means clustering is to partition the n observations into k sets
(clusters) where (k<=n). S = S1, S2 , Sk (to minimize the sum of squares within-cluster).
亮i is mean of points in Si.
K-Means Algorithm:
The algorithm of K means clustering is as follows.
i. Specify K (the number of clusters)
ii. Select K points randomly as cluster centers.
5. MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
5
iii. Assign each object/instance to closest cluster center.
iv. Find the centroid or mean for every cluster and use it as new cluster center.
v. Now reassign all the objects/instances to the closest cluster center.
vi. Iterate until the center of the clusters doesnt change any more.
Training Data:
The data [1] that is used for training the classifiers are already pre-classified into an ad and non-
ad along with some test data.
Every image that is inside in <a> tag is a candidate ad. None <a> anchor tag images are
mostly non-ad. Udest is the URL to which the anchor <a> points while Uimg is images
URL.
The geometric information of the image is captured using aspect ratio, height and width.
If any of the geometric features is missing then it is represented by (?).
Another binary feature is local which is 1 when Uimg (data.host.com/image.jpg) and 0 if
Uimg (elsewhere.com/picture.gif).
Alt text that is used in anchor tag is alternative text which can be used as a feature in
the data set.
Another featured that is used in the data set is the text inside the <A> anchor tag.
The whole set of features can be represented by:
6. MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
6
More Sets of features can be provided by the base URL (Ubase), Destination URL (Udest)
and image URL (Uimg). Phrases that have normally one word are ignored if they are
members of low information terminologies such as html, jpeg, gif, www and http etc.
This procedure generates a number of encodings (1558 features for encoding), one for
each K = Maximum phrase length and M = Minimum phrase count.
Improving the existing data:
7. MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
7
The existing data can be improved by using the words/phrases that are used at the end of
anchor tag. For example
<a href=www.example.com/sales.html>
and <img src=/slideshow/performance-analysis-of-knn-kmeans-using-internet-advertisements-data/62057391/www.example.com/mobile.jpg>
The above links have sales.html and sale.jpg as the last words, if we separate the low
information word like jpg and html, we can use the words sale and mobile as features.
Learning Rules:
The learning classifier algorithms should have following properties.
The algorithms should not be over sensitive to the missing information
The learning algorithms should be insensitive to irrelevant information and must scale
with the number of features.
Some URLs or phrases might change over the span of time, so it will be better to
relearn the classifier from scratch, Ideally the learning algorithm would be incremental
(as in the case of K nearest neighbor)
If aspect ratio is greater than 4.5 and alt contain to but does not contain click+here and
Udest does not contain http+www then the instance is an advertisement (ad)
If Ubase doesnt contain messier while Udest contains redirect+cgi then instance is an
advertisement (ad).
Evaluation:
We conducted several experiments using the cross validation methodology. The data is
randomly divided into 89% training and a test set containing the remaining data. The results are
8. MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
8
cross validated 12 times after invoking the rules mentioned above on the data. The learned
rules accuracy is found out to be 95.6% for K nearest neighbor K-NN and 93.5% for K-Means
classifier. To further understand the limitations we have measured a learning curve for both the
classifiers.
Learning curves for K-NN and K-Means:
. To further understand the limitations we have measured a learning curve for both the
classifiers. To calculate a learning curve we provide the classifier with 10, 20, 30, 40, ., 90,
100 percent training data, after which we calculate 11 fold cross validated accuracy on the
remainder. The figure shows the result which 95.6% for KNN and 93.5% for K means classifier.
9. MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
9
Related Work:
There are different systems/softwares available over internet, which are used for removing
unwanted advertisements from web pages, for example muffin (muffin.doit.org), web-filter
(www.uni-paderborn.de) and By-proxy (www.besiex.org/Byproxy). These systems rely on
different filtering rules. As the internet grows, good learning algorithms and classifiers are
required for the scalability of these systems.
Future Work:
A system/software can be developed using this work to remove unwanted and annoying
advertisements from web pages while using internet. The data is very old (1998) which can be
improved as internet and web tools are more advanced.
10. MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
10
References
1. Andrew Ng (2013), an online course for Machine learning, Stanford University,
Stanford, https://class.coursera.org/ml-004/class.
2. Duda and Hart, Pattern Classification (2001-2002), Wiley, New York.
3. http://en.wikipedia.org
4. http://www.mathworks.com/help/stats/hidden-markov-models-hmm.html
5. A Ranking-based KNN Approach for Multi-Label Classification (2012), Tsung-Hsien
Chiang National Taiwan University
6. Selection of K in K-means clustering, D T Pham, S S Dimov, and C D Nguyen
Manufacturing Engineering Centre, Cardiff University, Cardiff, UK