際際滷

際際滷Share a Scribd company logo
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
1
Machine Learning
Name: Muhammad Gulraj
Muhammad.gulraj@yahoo.com
Performance analysis of KNN & K-Means using internet
advertisements data
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
2
Abstract
Advertisements are undeniably the main source of earning over the internet. In this article we
have analyzed the performance of two different classifiers, namely K  nearest neighbor (KNN),
and K-Means using internet advertisements data [1]. The classifiers are used to identify
advertisement and non-advertisement data and their performance is evaluated based on the
results.
K-nearest neighbor:
In machine learning and pattern recognition K - Nearest neighbor KNN is an algorithm (non-
parametric) that can be used for classification and regression. It is among the simplest
algorithms in machine learning. The output of the KNN depends on whether the algorithm is
used for regression or classification.
KNN classify an object using the majority vote/neighbor. The object is assigned to the class
which has the most members near to the object.
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
3
KNN is instance based learning or lazy learning. KNN is simple but powerful algorithm in which
no training is required and new training example can be added easily. However it is slow and
expensive having computation complexity of O (md). KNN has a slow run time performance but
it can be improved significantly by removing redundant data, computing approximate distance or
using pre-sort training examples into fast data structures.
KNN can be used in handwritten character classification, content based image retrieval,
intrusion detection and fault detection.
Lets consider a simple example of an object which have many sea bass and salmon as
neighbors. Lets assume that K = 3 which means that we will consider 3 nearest neighbors of
the object. From the below image it is clear that the object have 2 sea bass and 1 salmon, so
the KNN algorithm will classify the object as sea bass.
KNN Algorithm:
KNN Algorithm is very simple. The training period consist of storing all instances and labels
(Class labels). If feature selection has to be performed then n-fold cross validation should be
used on training set. If we want to test a new instance X given a set Y, the following steps are
needed.
i. Compute the distance of X from each instance in set Y
ii. Sort the distance in increasing order and pick the highest K elements.
iii. Find the most repeated class in K nearest neighbor.
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
4
K- Means Clustering:
K means clustering is a popular method used for vector quantization. K means clustering is
basically the partition of X observations into K clusters. In K means clustering each observation
belongs to the cluster having the nearest mean. K means clustering is an NP hard problem
(Computationally difficult). Efficient heuristic algorithms can be used which converge rapidly to a
local minimum.
Assume a set of observations x1, x2, x3, x4 . Xn, every observation is a vector of d-
dimensions. The main aim of K means clustering is to partition the n observations into k sets
(clusters) where (k<=n). S = S1, S2 ,  Sk (to minimize the sum of squares within-cluster).
亮i is mean of points in Si.
K-Means Algorithm:
The algorithm of K means clustering is as follows.
i. Specify K (the number of clusters)
ii. Select K points randomly as cluster centers.
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
5
iii. Assign each object/instance to closest cluster center.
iv. Find the centroid or mean for every cluster and use it as new cluster center.
v. Now reassign all the objects/instances to the closest cluster center.
vi. Iterate until the center of the clusters doesnt change any more.
Training Data:
The data [1] that is used for training the classifiers are already pre-classified into an ad and non-
ad along with some test data.
 Every image that is inside in <a> tag is a candidate ad. None <a> anchor tag images are
mostly non-ad. Udest is the URL to which the anchor <a> points while Uimg is images
URL.
 The geometric information of the image is captured using aspect ratio, height and width.
If any of the geometric features is missing then it is represented by (?).
 Another binary feature is local which is 1 when Uimg (data.host.com/image.jpg) and 0 if
Uimg (elsewhere.com/picture.gif).
 Alt text that is used in anchor tag is alternative text which can be used as a feature in
the data set.
 Another featured that is used in the data set is the text inside the <A> anchor tag.
 The whole set of features can be represented by:
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
6
 More Sets of features can be provided by the base URL (Ubase), Destination URL (Udest)
and image URL (Uimg). Phrases that have normally one word are ignored if they are
members of low information terminologies such as html, jpeg, gif, www and http etc.
 This procedure generates a number of encodings (1558 features for encoding), one for
each K = Maximum phrase length and M = Minimum phrase count.
Improving the existing data:
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
7
The existing data can be improved by using the words/phrases that are used at the end of
anchor tag. For example
 <a href=www.example.com/sales.html>
 and <img src=/slideshow/performance-analysis-of-knn-kmeans-using-internet-advertisements-data/62057391/www.example.com/mobile.jpg>
The above links have sales.html and sale.jpg as the last words, if we separate the low
information word like jpg and html, we can use the words sale and mobile as features.
Learning Rules:
The learning classifier algorithms should have following properties.
 The algorithms should not be over sensitive to the missing information
 The learning algorithms should be insensitive to irrelevant information and must scale
with the number of features.
 Some URLs or phrases might change over the span of time, so it will be better to
relearn the classifier from scratch, Ideally the learning algorithm would be incremental
(as in the case of K nearest neighbor)
 If aspect ratio is greater than 4.5 and alt contain to but does not contain click+here and
Udest does not contain http+www then the instance is an advertisement (ad)
 If Ubase doesnt contain messier while Udest contains redirect+cgi then instance is an
advertisement (ad).
Evaluation:
We conducted several experiments using the cross validation methodology. The data is
randomly divided into 89% training and a test set containing the remaining data. The results are
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
8
cross validated 12 times after invoking the rules mentioned above on the data. The learned
rules accuracy is found out to be 95.6% for K nearest neighbor K-NN and 93.5% for K-Means
classifier. To further understand the limitations we have measured a learning curve for both the
classifiers.
Learning curves for K-NN and K-Means:
. To further understand the limitations we have measured a learning curve for both the
classifiers. To calculate a learning curve we provide the classifier with 10, 20, 30, 40, ., 90,
100 percent training data, after which we calculate 11 fold cross validated accuracy on the
remainder. The figure shows the result which 95.6% for KNN and 93.5% for K means classifier.
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
9
Related Work:
There are different systems/softwares available over internet, which are used for removing
unwanted advertisements from web pages, for example muffin (muffin.doit.org), web-filter
(www.uni-paderborn.de) and By-proxy (www.besiex.org/Byproxy). These systems rely on
different filtering rules. As the internet grows, good learning algorithms and classifiers are
required for the scalability of these systems.
Future Work:
A system/software can be developed using this work to remove unwanted and annoying
advertisements from web pages while using internet. The data is very old (1998) which can be
improved as internet and web tools are more advanced.
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
10
References
1. Andrew Ng (2013), an online course for Machine learning, Stanford University,
Stanford, https://class.coursera.org/ml-004/class.
2. Duda and Hart, Pattern Classification (2001-2002), Wiley, New York.
3. http://en.wikipedia.org
4. http://www.mathworks.com/help/stats/hidden-markov-models-hmm.html
5. A Ranking-based KNN Approach for Multi-Label Classification (2012), Tsung-Hsien
Chiang National Taiwan University
6. Selection of K in K-means clustering, D T Pham, S S Dimov, and C D Nguyen
Manufacturing Engineering Centre, Cardiff University, Cardiff, UK

More Related Content

Performance analysis of KNN & K-Means using internet advertisements data

  • 1. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 1 Machine Learning Name: Muhammad Gulraj Muhammad.gulraj@yahoo.com Performance analysis of KNN & K-Means using internet advertisements data
  • 2. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 2 Abstract Advertisements are undeniably the main source of earning over the internet. In this article we have analyzed the performance of two different classifiers, namely K nearest neighbor (KNN), and K-Means using internet advertisements data [1]. The classifiers are used to identify advertisement and non-advertisement data and their performance is evaluated based on the results. K-nearest neighbor: In machine learning and pattern recognition K - Nearest neighbor KNN is an algorithm (non- parametric) that can be used for classification and regression. It is among the simplest algorithms in machine learning. The output of the KNN depends on whether the algorithm is used for regression or classification. KNN classify an object using the majority vote/neighbor. The object is assigned to the class which has the most members near to the object.
  • 3. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 3 KNN is instance based learning or lazy learning. KNN is simple but powerful algorithm in which no training is required and new training example can be added easily. However it is slow and expensive having computation complexity of O (md). KNN has a slow run time performance but it can be improved significantly by removing redundant data, computing approximate distance or using pre-sort training examples into fast data structures. KNN can be used in handwritten character classification, content based image retrieval, intrusion detection and fault detection. Lets consider a simple example of an object which have many sea bass and salmon as neighbors. Lets assume that K = 3 which means that we will consider 3 nearest neighbors of the object. From the below image it is clear that the object have 2 sea bass and 1 salmon, so the KNN algorithm will classify the object as sea bass. KNN Algorithm: KNN Algorithm is very simple. The training period consist of storing all instances and labels (Class labels). If feature selection has to be performed then n-fold cross validation should be used on training set. If we want to test a new instance X given a set Y, the following steps are needed. i. Compute the distance of X from each instance in set Y ii. Sort the distance in increasing order and pick the highest K elements. iii. Find the most repeated class in K nearest neighbor.
  • 4. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 4 K- Means Clustering: K means clustering is a popular method used for vector quantization. K means clustering is basically the partition of X observations into K clusters. In K means clustering each observation belongs to the cluster having the nearest mean. K means clustering is an NP hard problem (Computationally difficult). Efficient heuristic algorithms can be used which converge rapidly to a local minimum. Assume a set of observations x1, x2, x3, x4 . Xn, every observation is a vector of d- dimensions. The main aim of K means clustering is to partition the n observations into k sets (clusters) where (k<=n). S = S1, S2 , Sk (to minimize the sum of squares within-cluster). 亮i is mean of points in Si. K-Means Algorithm: The algorithm of K means clustering is as follows. i. Specify K (the number of clusters) ii. Select K points randomly as cluster centers.
  • 5. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 5 iii. Assign each object/instance to closest cluster center. iv. Find the centroid or mean for every cluster and use it as new cluster center. v. Now reassign all the objects/instances to the closest cluster center. vi. Iterate until the center of the clusters doesnt change any more. Training Data: The data [1] that is used for training the classifiers are already pre-classified into an ad and non- ad along with some test data. Every image that is inside in <a> tag is a candidate ad. None <a> anchor tag images are mostly non-ad. Udest is the URL to which the anchor <a> points while Uimg is images URL. The geometric information of the image is captured using aspect ratio, height and width. If any of the geometric features is missing then it is represented by (?). Another binary feature is local which is 1 when Uimg (data.host.com/image.jpg) and 0 if Uimg (elsewhere.com/picture.gif). Alt text that is used in anchor tag is alternative text which can be used as a feature in the data set. Another featured that is used in the data set is the text inside the <A> anchor tag. The whole set of features can be represented by:
  • 6. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 6 More Sets of features can be provided by the base URL (Ubase), Destination URL (Udest) and image URL (Uimg). Phrases that have normally one word are ignored if they are members of low information terminologies such as html, jpeg, gif, www and http etc. This procedure generates a number of encodings (1558 features for encoding), one for each K = Maximum phrase length and M = Minimum phrase count. Improving the existing data:
  • 7. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 7 The existing data can be improved by using the words/phrases that are used at the end of anchor tag. For example <a href=www.example.com/sales.html> and <img src=/slideshow/performance-analysis-of-knn-kmeans-using-internet-advertisements-data/62057391/www.example.com/mobile.jpg> The above links have sales.html and sale.jpg as the last words, if we separate the low information word like jpg and html, we can use the words sale and mobile as features. Learning Rules: The learning classifier algorithms should have following properties. The algorithms should not be over sensitive to the missing information The learning algorithms should be insensitive to irrelevant information and must scale with the number of features. Some URLs or phrases might change over the span of time, so it will be better to relearn the classifier from scratch, Ideally the learning algorithm would be incremental (as in the case of K nearest neighbor) If aspect ratio is greater than 4.5 and alt contain to but does not contain click+here and Udest does not contain http+www then the instance is an advertisement (ad) If Ubase doesnt contain messier while Udest contains redirect+cgi then instance is an advertisement (ad). Evaluation: We conducted several experiments using the cross validation methodology. The data is randomly divided into 89% training and a test set containing the remaining data. The results are
  • 8. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 8 cross validated 12 times after invoking the rules mentioned above on the data. The learned rules accuracy is found out to be 95.6% for K nearest neighbor K-NN and 93.5% for K-Means classifier. To further understand the limitations we have measured a learning curve for both the classifiers. Learning curves for K-NN and K-Means: . To further understand the limitations we have measured a learning curve for both the classifiers. To calculate a learning curve we provide the classifier with 10, 20, 30, 40, ., 90, 100 percent training data, after which we calculate 11 fold cross validated accuracy on the remainder. The figure shows the result which 95.6% for KNN and 93.5% for K means classifier.
  • 9. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 9 Related Work: There are different systems/softwares available over internet, which are used for removing unwanted advertisements from web pages, for example muffin (muffin.doit.org), web-filter (www.uni-paderborn.de) and By-proxy (www.besiex.org/Byproxy). These systems rely on different filtering rules. As the internet grows, good learning algorithms and classifiers are required for the scalability of these systems. Future Work: A system/software can be developed using this work to remove unwanted and annoying advertisements from web pages while using internet. The data is very old (1998) which can be improved as internet and web tools are more advanced.
  • 10. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 10 References 1. Andrew Ng (2013), an online course for Machine learning, Stanford University, Stanford, https://class.coursera.org/ml-004/class. 2. Duda and Hart, Pattern Classification (2001-2002), Wiley, New York. 3. http://en.wikipedia.org 4. http://www.mathworks.com/help/stats/hidden-markov-models-hmm.html 5. A Ranking-based KNN Approach for Multi-Label Classification (2012), Tsung-Hsien Chiang National Taiwan University 6. Selection of K in K-means clustering, D T Pham, S S Dimov, and C D Nguyen Manufacturing Engineering Centre, Cardiff University, Cardiff, UK