�ݺ�ߣ

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 1
K-Means Clustering
- Nikita

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Machine Learning
 Supervised Learning - The output datasets are provided which are used to train the
machine and get the desired outputs
 Categorized into "regression" and "classification“ problems. In a regression problem, we are
trying to predict results within a continuous output, meaning that we are trying to map input
variables to some continuous function. In a classification problem, we are instead trying to
predict results in a discrete output. In other words, we are trying to map input variables
into discrete categories.
 Unsupervised Learning –
 Allows us to approach problems with little or no idea what our results should look like. We can
derive structure from data where we don't necessarily know the effect of the variables.
 No datasets are provided, instead the data is clustered into different classes .
Oracle Confidential – Internal/Restricted/Highly Restricted 2

What is Clustering?
 Clustering is the classification of objects into different groups, or
more precisely, the partitioning of a data set into subsets (clusters), so
that the data in each subset (ideally) share some common trait - often
according to some defined distance measure.
 Clustering is unsupervised classification: no predefined classes

Partitional clustering
 Partitional algorithms determine all clusters at once. They include:
 K-means and derivatives
 Fuzzy c-means clustering
 QT clustering algorithm

What Is Good Clustering?
 A good clustering method will produce high quality clusters with
high intra-class similarity
low inter-class similarity
 The quality of a clustering result depends on both the similarity measure used
by the method and its implementation.
 The quality of a clustering method is also measured by its ability to discover
some or all of the hidden patterns.

Common Distance measures:
 Distance measure will determine how the similarity of two elements is
calculated and it will influence the shape of the clusters. They include:
1. The Euclidean distance (also called 2-norm distance) is given by:
2. The Manhattan distance (also called taxicab norm or 1-norm) is given by:

K-MEANS CLUSTERING
 The k-means algorithm is an algorithm to cluster n objects based on attributes
into k partitions, where k < n.
 K-means algorithm is the simplest partitioning method for clustering analysis
and widely used in data mining applications.
 Each cluster is represented by the centre of the cluster and the algorithm
converges to stable centriods of clusters.
 It is similar to the expectation-maximization algorithm for mixtures of Gaussians
in that they both attempt to find the centers of natural clusters in the data.

K-means Algorithm
Given the cluster number K, the K-means algorithm is carried out in three steps
after initialization:
Initialisation: set seed points (randomly)
1) Assign each object to the cluster of the nearest seed point measured with a
specific distance metric
2) Compute new seed points as the centroids of the clusters of the current
partition (the centroid is the centre, i.e., mean point, of the cluster)
3) Go back to Step 1), stop when no more new assignment (i.e., membership in
each cluster no longer changes)

The K-Means Clustering Method
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K
object as initial
cluster center
Assign
each
objects
to most
similar
center
Update
the
cluster
means
Update
the
cluster
means
reassignreassign

An algorithm for partitioning (or clustering) N data points into K
disjoint subsets Sj containing data points so as to minimize the sum-of-
squares criterion
where xn is a vector representing the the nth data point and uj is the
geometric centroid of the data points in Sj.

Example - Problem
Medicin
e
Weight pH-
Index
A 1 1
B 2 1
C 4 3
D 5 4
A B
C
D
Suppose we have 4 types of medicines and each has two attributes (pH and weight
index). Our goal is to group these objects into K=2 group of medicine.

Example
Step 1: Use initial seed points for partitioning
Bc,Ac 21 
24.4)14()25(),(
5)14()15(),(
22
2
22
1


cDd
cDd
Assign each object to the cluster
with the nearest seed point

Example
Step 2: Compute new centroids of the current partition
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
)
3
8
,
3
11
(
3
431
,
3
542
)1,1(
2
1






 


c
c

Example
Step 2: Renew membership based on new centroids
Compute the distance of all
objects to the new centroids
Assign the membership to objects

Example
Step 3: Repeat the first two steps until its convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
)
2
1
3,
2
1
4(
2
43
,
2
54
)1,
2
1
1(
2
11
,
2
21
2
1





 






 

c
c

Example
Step 3: Repeat the first two steps until its convergence
Compute the distance of all
objects to the new centroids
Stop due to no new assignment
Membership in each cluster no
longer change

Relevant Issues
Efficient in computation - O(tKn), where n is number of objects, K is number of clusters,
and t is number of iterations. Normally, K, t << n.
Local optimum - sensitive to initial seed points , converge to a local optimum: maybe an
unwanted solution
Other problems
 Need to specify K, the number of clusters, in advance
 Unable to handle noisy data and outliers (K-Medoids algorithm)
 Not suitable for discovering clusters with non-convex shapes
 Applicable only when mean is defined, then what about categorical data? (K-mode algorithm)
 how to evaluate the K-mean performance?

K-means++ - An initialization procedure for K-means
 This approach acknowledges that there is probably a better choice of initial centroid
locations than simple random assignment. Specifically, K-means tends to perform better
when centroids are seeded in such a way that doesn't clump them together in space.
Step 1: Choose one of your data points at random as an initial centroid.
Step 2: Calculate D(x), the distance between your initial centroid and all other data points,
x.
Step 3: Choose your next centroid from the remaining datapoints with probability
proportional to D(x).D(x) (the probability of each point is based on its distance to the
closest centroid to that point.)
Step 3: Repeat until all centroids have been assigned.
Note: D(x) should be updated as more centroids are added. It should be set to be the
distance between a data point and the nearest centroid.

How can we choose a "good" K for K-means clustering?
 Choose the number of clusters by visually inspecting your data points
 Elbow Method:
 First of all, compute the sum of squared error (SSE) for some values of k (for example 2,
4, 6, 8, etc.). The SSE is defined as the sum of the squared distance between each member
of the cluster and its centroid.
Mathematically:
 If you plot k against the SSE, you will see that the error decreases as k gets larger; this is
because when the number of clusters increases, they should be smaller, so distortion is
also smaller. The idea of the elbow method is to choose the k at which the SSE decreases
abruptly.

This produces an "elbow effect" in the graph, as you can see
in the following picture:
 In this case, k=6 is the value that the
Elbow method has selected. Take into
account that the Elbow method is an
heuristic and, as such, it may or may not
work well in your particular case. Sometimes,
there are more than one elbow, or no elbow
at all. In those situations you usually end up
calculating the best k by evaluating how well
k-means performs in the context of the
particular clustering problem you are trying
to solve.

How can we choose a "good" K for K-means clustering?
 Silhoutte Coefficient –
 Silhouette analysis can be used to study the separation distance between the
resulting clusters. The silhouette plot displays a measure of how close each point in
one cluster is to points in the neighboring clusters and thus provides a way to
assess parameters like number of clusters visually.
 The Silhouette Coefficient is calculated using the mean intra-cluster distance (a)
and the mean nearest-cluster distance (b) for each sample. The Silhouette
Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a
sample and the nearest cluster that the sample is not a part of. Note that
Silhouette Coefficent is only defined if number of labels is 2 <= n_labels <=
n_samples - 1.

Feature Scaling – Pre-processing step to improve K-Means
 Feature scaling is a method used to standardize the range of independent
variables or features of data. In data processing, it is also known as data
normalization and is generally performed during the data preprocessing step.
 Since the range of values of raw data varies widely, in some machine
learning algorithms, objective functions will not work properly without
normalization. For example, the majority of classifiers calculate the distance
between two points by the Euclidean distance. If one of the features has a broad
range of values, the distance will be governed by this particular feature. Therefore,
the range of all features should be normalized so that each feature contributes
approximately proportionately to the final distance..

Curse of Dimensionality
 As the number of dimensions tend to infinity the distance between
any two points in the dataset converges. This means the maximum
distance and minimum distance between any two points of your
dataset will be the same.
 Reduce number of features

Summary
 Its performance is determined by initialization and appropriate
distance measure
 There are several variants of K-means to overcome its weaknesses
 K-Medoids: resistance to noise and/or outliers
 K-Modes: extension to categorical data clustering analysis
 CLARA: extension to deal with large data sets

�ݺ�ߣ

Kmeans

More Related Content

Kmeans

Editor's Notes