The document discusses the K-means clustering algorithm. It begins by explaining that K-means is an unsupervised learning algorithm that partitions observations into K clusters by minimizing the within-cluster sum of squares. It then provides details on how K-means works, including initializing cluster centers, assigning observations to the nearest center, recalculating centers, and repeating until convergence. The document also discusses evaluating the number of clusters K, dealing with issues like local optima and sensitivity to initialization, and techniques for improving K-means such as K-means++ initialization and feature scaling.
1 of 25
Downloaded 310 times
More Related Content
Kmeans
1. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential Internal/Restricted/Highly Restricted 1
K-Means Clustering
- Nikita
2. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
Machine Learning
Supervised Learning - The output datasets are provided which are used to train the
machine and get the desired outputs
Categorized into "regression" and "classification problems. In a regression problem, we are
trying to predict results within a continuous output, meaning that we are trying to map input
variables to some continuous function. In a classification problem, we are instead trying to
predict results in a discrete output. In other words, we are trying to map input variables
into discrete categories.
Unsupervised Learning
Allows us to approach problems with little or no idea what our results should look like. We can
derive structure from data where we don't necessarily know the effect of the variables.
No datasets are provided, instead the data is clustered into different classes .
Oracle Confidential Internal/Restricted/Highly Restricted 2
3. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
What is Clustering?
Clustering is the classification of objects into different groups, or
more precisely, the partitioning of a data set into subsets (clusters), so
that the data in each subset (ideally) share some common trait - often
according to some defined distance measure.
Clustering is unsupervised classification: no predefined classes
Oracle Confidential Internal/Restricted/Highly Restricted 3
4. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
Partitional clustering
Partitional algorithms determine all clusters at once. They include:
K-means and derivatives
Fuzzy c-means clustering
QT clustering algorithm
Oracle Confidential Internal/Restricted/Highly Restricted 4
5. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
What Is Good Clustering?
A good clustering method will produce high quality clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the similarity measure used
by the method and its implementation.
The quality of a clustering method is also measured by its ability to discover
some or all of the hidden patterns.
Oracle Confidential Internal/Restricted/Highly Restricted 5
6. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
Common Distance measures:
Distance measure will determine how the similarity of two elements is
calculated and it will influence the shape of the clusters. They include:
1. The Euclidean distance (also called 2-norm distance) is given by:
2. The Manhattan distance (also called taxicab norm or 1-norm) is given by:
Oracle Confidential Internal/Restricted/Highly Restricted 6
7. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
K-MEANS CLUSTERING
The k-means algorithm is an algorithm to cluster n objects based on attributes
into k partitions, where k < n.
K-means algorithm is the simplest partitioning method for clustering analysis
and widely used in data mining applications.
Each cluster is represented by the centre of the cluster and the algorithm
converges to stable centriods of clusters.
It is similar to the expectation-maximization algorithm for mixtures of Gaussians
in that they both attempt to find the centers of natural clusters in the data.
Oracle Confidential Internal/Restricted/Highly Restricted 7
8. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
K-means Algorithm
Given the cluster number K, the K-means algorithm is carried out in three steps
after initialization:
Initialisation: set seed points (randomly)
1) Assign each object to the cluster of the nearest seed point measured with a
specific distance metric
2) Compute new seed points as the centroids of the clusters of the current
partition (the centroid is the centre, i.e., mean point, of the cluster)
3) Go back to Step 1), stop when no more new assignment (i.e., membership in
each cluster no longer changes)
Oracle Confidential Internal/Restricted/Highly Restricted 8
9. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
The K-Means Clustering Method
Oracle Confidential Internal/Restricted/Highly Restricted 9
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K
object as initial
cluster center
Assign
each
objects
to most
similar
center
Update
the
cluster
means
Update
the
cluster
means
reassignreassign
10. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
An algorithm for partitioning (or clustering) N data points into K
disjoint subsets Sj containing data points so as to minimize the sum-of-
squares criterion
where xn is a vector representing the the nth data point and uj is the
geometric centroid of the data points in Sj.
Oracle Confidential Internal/Restricted/Highly Restricted 10
11. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
Example - Problem
Oracle Confidential Internal/Restricted/Highly Restricted 11
Medicin
e
Weight pH-
Index
A 1 1
B 2 1
C 4 3
D 5 4
A B
C
D
Suppose we have 4 types of medicines and each has two attributes (pH and weight
index). Our goal is to group these objects into K=2 group of medicine.
12. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
Example
Oracle Confidential Internal/Restricted/Highly Restricted 12
Step 1: Use initial seed points for partitioning
Bc,Ac 21 緒
24.4)14()25(),(
5)14()15(),(
22
2
22
1
緒
緒
cDd
cDd
Assign each object to the cluster
with the nearest seed point
13. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
Example
Oracle Confidential Internal/Restricted/Highly Restricted 13
Step 2: Compute new centroids of the current partition
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
)
3
8
,
3
11
(
3
431
,
3
542
)1,1(
2
1
c
c
14. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
Example
Oracle Confidential Internal/Restricted/Highly Restricted 14
Step 2: Renew membership based on new centroids
Compute the distance of all
objects to the new centroids
Assign the membership to objects
15. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
Example
Oracle Confidential Internal/Restricted/Highly Restricted 15
Step 3: Repeat the first two steps until its convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
)
2
1
3,
2
1
4(
2
43
,
2
54
)1,
2
1
1(
2
11
,
2
21
2
1
緒
緒
c
c
16. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
Example
Oracle Confidential Internal/Restricted/Highly Restricted 16
Step 3: Repeat the first two steps until its convergence
Compute the distance of all
objects to the new centroids
Stop due to no new assignment
Membership in each cluster no
longer change
17. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
Relevant Issues
Efficient in computation - O(tKn), where n is number of objects, K is number of clusters,
and t is number of iterations. Normally, K, t << n.
Local optimum - sensitive to initial seed points , converge to a local optimum: maybe an
unwanted solution
Other problems
Need to specify K, the number of clusters, in advance
Unable to handle noisy data and outliers (K-Medoids algorithm)
Not suitable for discovering clusters with non-convex shapes
Applicable only when mean is defined, then what about categorical data? (K-mode algorithm)
how to evaluate the K-mean performance?
Oracle Confidential Internal/Restricted/Highly Restricted 17
18. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
K-means++ - An initialization procedure for K-means
This approach acknowledges that there is probably a better choice of initial centroid
locations than simple random assignment. Specifically, K-means tends to perform better
when centroids are seeded in such a way that doesn't clump them together in space.
Step 1: Choose one of your data points at random as an initial centroid.
Step 2: Calculate D(x), the distance between your initial centroid and all other data points,
x.
Step 3: Choose your next centroid from the remaining datapoints with probability
proportional to D(x).D(x) (the probability of each point is based on its distance to the
closest centroid to that point.)
Step 3: Repeat until all centroids have been assigned.
Note: D(x) should be updated as more centroids are added. It should be set to be the
distance between a data point and the nearest centroid.
Oracle Confidential Internal/Restricted/Highly Restricted 18
19. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
How can we choose a "good" K for K-means clustering?
Choose the number of clusters by visually inspecting your data points
Elbow Method:
First of all, compute the sum of squared error (SSE) for some values of k (for example 2,
4, 6, 8, etc.). The SSE is defined as the sum of the squared distance between each member
of the cluster and its centroid.
Mathematically:
If you plot k against the SSE, you will see that the error decreases as k gets larger; this is
because when the number of clusters increases, they should be smaller, so distortion is
also smaller. The idea of the elbow method is to choose the k at which the SSE decreases
abruptly.
Oracle Confidential Internal/Restricted/Highly Restricted 19
20. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
This produces an "elbow effect" in the graph, as you can see
in the following picture:
In this case, k=6 is the value that the
Elbow method has selected. Take into
account that the Elbow method is an
heuristic and, as such, it may or may not
work well in your particular case. Sometimes,
there are more than one elbow, or no elbow
at all. In those situations you usually end up
calculating the best k by evaluating how well
k-means performs in the context of the
particular clustering problem you are trying
to solve.
Oracle Confidential Internal/Restricted/Highly Restricted 20
21. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
How can we choose a "good" K for K-means clustering?
Silhoutte Coefficient
Silhouette analysis can be used to study the separation distance between the
resulting clusters. The silhouette plot displays a measure of how close each point in
one cluster is to points in the neighboring clusters and thus provides a way to
assess parameters like number of clusters visually.
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a)
and the mean nearest-cluster distance (b) for each sample. The Silhouette
Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a
sample and the nearest cluster that the sample is not a part of. Note that
Silhouette Coefficent is only defined if number of labels is 2 <= n_labels <=
n_samples - 1.
Oracle Confidential Internal/Restricted/Highly Restricted 21
22. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
Feature Scaling Pre-processing step to improve K-Means
Feature scaling is a method used to standardize the range of independent
variables or features of data. In data processing, it is also known as data
normalization and is generally performed during the data preprocessing step.
Since the range of values of raw data varies widely, in some machine
learning algorithms, objective functions will not work properly without
normalization. For example, the majority of classifiers calculate the distance
between two points by the Euclidean distance. If one of the features has a broad
range of values, the distance will be governed by this particular feature. Therefore,
the range of all features should be normalized so that each feature contributes
approximately proportionately to the final distance..
Oracle Confidential Internal/Restricted/Highly Restricted 22
23. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
Curse of Dimensionality
As the number of dimensions tend to infinity the distance between
any two points in the dataset converges. This means the maximum
distance and minimum distance between any two points of your
dataset will be the same.
Reduce number of features
Oracle Confidential Internal/Restricted/Highly Restricted 23
24. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. |
Summary
Its performance is determined by initialization and appropriate
distance measure
There are several variants of K-means to overcome its weaknesses
K-Medoids: resistance to noise and/or outliers
K-Modes: extension to categorical data clustering analysis
CLARA: extension to deal with large data sets
Oracle Confidential Internal/Restricted/Highly Restricted 24
25. Copyright 息 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential Internal/Restricted/Highly Restricted 25
Editor's Notes
#4: Clustering
->the process of grouping a set of objects into classes of similar objects
-> the task is to create groups and assign data point to each group
#5: Partitioning Clustering Approach-
a typical clustering analysis approach via iteratively partitioning training data set to learn a partition of the given data space
learning a partition on a data set to produce several non-empty clusters (usually, the number of clusters given in advance)
in principle, optimal partition achieved via minimising the sum of squared distance to its representative object in each cluster
#7: 3.The maximum norm is given by:
D(x,y) = max |xi - yi| where 1<i<p
4. The Mahalanobis distance corrects data for different scales and correlations in the variables.
5. Inner product space: The angle between two vectors can be used as a distance measure when clustering high dimensional data
6. Hamming distance (sometimes edit distance) measures the minimum number of substitutions required to change one member into another.
#8: Points are near to the centroid of the cluster
->What if clusters are overlapping?
- Hard to tell which cluster is right
- Maye we should try to remain uncertain
->What if cluster has a non-circular shape?
Gaussian mixture models:
- Clusters modeled as Gaussians (not by mean)
- EM algo assign data to cluster with some probablity
-Gives probabality model of x!
EM algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.
Points are near to the centroid of the cluster
Probabilistic Clustering EM with gaussian distribution
#18: It is relatively efficient and fast. It computes result at O(tkn), where n is number of objects or points, k is number of clusters and t is number of iterations
Used on acoustic data in speech understanding to convert waveforms into one of k categories (known as Vector Quantization or Image Segmentation).
Also used for choosing color palettes on old fashioned graphical display devices and Image Quantization.
K cluster - . Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments.
-It is sensitive to initial condition. Different initial condition may produce different result of cluster. The algorithm may be trapped in the local optimum
#23: The simplest method is rescaling the range of features to scale the range in [0, 1] or [1, 1]. Selecting the target range depends on the nature of the data.
we can handle various types of data, e.g. audio signals and pixel values for image data, and this data can include multipledimensions. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the enumerator) and unit-variance
- The reason is that normalization gives the same importance to all the variables.
The standard example is considering age (in year) and height (in cm). The age may range in [18 50], while the height may range in [130 180] (I am just making up numbers). If you use the classical Euclidean distance, the height will have dispoportionately more importance in its computation with respect to the age.
#24: Clustering high-dimensional datais thecluster analysisof data with anywhere from a few dozen to many thousands ofdimensions. Suchhigh-dimensional spacesof data are often encountered in areas such asmedicine, whereDNA microarraytechnology can produce a large number of measurements at once, and the clustering oftext documents, where, if a word-frequency vector is used, the number of dimensions equals thesize of the vocabulary.
#25: -K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. Use Hamming distance instead of Euclidean distance , i.e. if we two categorical values are same then make the distance 0 or else 1.
-k-medoids minimizes the sum of dissimilarities between points labeled to be in a cluster and a point designated as the center of that cluster.
- Always a member of data points
K-medoids is also a partitioning technique of clustering that clusters the data set ofnobjects intokclusters withkknowna priori.
https://en.wikipedia.org/wiki/K-medoids
The most common realisation ofk-medoid clustering is thePartitioning Around Medoids (PAM)algorithm. PAM uses a greedy search which may not find the optimum solution, but it is faster than exhaustive search[citation needed]. It works as follows:
Initialize: randomly select[citation needed](without replacement)kof thendata points as the medoids
Associate each data point to the closest medoid.
While the cost of the configuration decreases:
For each medoidm, for each non-medoid data pointo:
Swapmando, recompute the cost (sum of distances of points to their medoid)
If the total cost of the configuration increased in the previous step, undo the swap
Other algorithms than PAM have been suggested in the literature, including the followingVoronoi iterationmethod:[2]
Select initial medoids
Iterate while the cost decreases:
In each cluster, make the point that minimizes the sum of distances within the cluster the medoid
Reassign each point to the cluster defined by the closest medoid determined in the previous step.