際際滷

際際滷Share a Scribd company logo
Large Scale Data Clustering
Algorithms
Vahid Mirjalili
Data Scientist
Feb 11th 2016
Outline
1. Overview of clustering algorithms and validation
2. Fast and accurate k-means clustering for large datasets
3. Clustering based on landmark points
4. Spectral relaxation for k-means clustering
5. Proposed methods for microbial community detection
2
Part 1:
Overview of
Data Clustering Algorithms
Jain, Anil K. "Data clustering: 50 years beyond K-means." Pattern recognition letters 31.8 (2010): 651-666.
3
Data clustering
Goal: discover natural groupings among given data points
Unsupervised learning (unlabeled data)
Exploratory analysis (without any pre-specified model/hypothesis)
Usages
Gain insight from the underlying structure of data (salient features, anomaly detection, etc)
Identify degree of similarity between points (infer phylogenetic relationships)
Data Compression (summarizing data by cluster prototypes, removing redundant patterns)
4
Applications
Wide range of applications: computer vision, document clustering, gene clustering,
customer/product groups
An example application for Computer Visions: image
segmentation and separating the background
5
Different clustering algorithms
Literature contains 1000
clustering algorithms
Different criteria to divide
clustering algorithms
Soft vs. Hard
Clustering
Prototype vs.
Density based
Partitional vs.
Hierarchical
Clustering
6
Partitional vs. Hierarchical
1. Partitional algorithms (k-means)
Partition the data space
Finds all clusters simultaneously
2. Hierarchical algorithms
Generate nested cluster hierarchy
Agglomerative (bottom-up)
Divisive (top-down)
Distance between clusters:
Single-linkage, complete linkage,
average-linkage
7
K-means clustering
Objective function:
1. Select K initial cluster centroids
2. Assign data points to the nearest
cluster centroid
3. Update the new cluster centroids
Figure curtesy: Data clustering: 50 years beyond k-means, Anil Jain
8
K-means pros and cons
+ Simple/easy to implement
+ Order of linear complexity O(N  Iterations)
- Results highly dependent on initialization
- Prone to local minima
- Sensitive to outliers, and clusters sizes
- Globular shaped clusters
- Requiring multiple passes
- Not applicable to categorical data
Local minima
Non-globular
clusters
Outliers
9
K-means extensions
K-means++ To improve the initialization process
X-means To find the optimal number of clusters without prior knowledge
Kernel K-means To form arbitrary/non-globular shaped clusters
Fuzzy c-means Multiple cluster assignment (membership degree)
K-medians More robust to outliers (median of each feature)
K-medoids More robust to outliers, different distance metrics, categorical data
Bisecting K-means and many more ...
10
Kernel K-means vs. K-means
Pyclust: Open Source Data Clustering Pckage
11
Bisecting K-means
12
Pyclust: Open Source Data Clustering Pckage
Other approaches in data clustering
Prototype-based methods
 Clusters are formed based on similarity to a prototype
 K-means, k-medians, k-medoids, 
Density based methods (clusters are high density regions
separated by low density regions)
 Jarvis-Patrick Algorithm: similairty between patterns defined
as the number of common neighbors
 DBSCAN (MinPts, 竜-neighborhood)
Identify types noise points/border points/core points
13
DBSCAN pros and cons
+ No need to know number of clusters apriori
+ Identify arbitrary shaped clusters
+ Robust to noise and outliers
- Sensitivity to the parameters
- Problem with high dimensional data
(subspace clustering)
14
Clustering Validation
1. Internal Validity Indexes
 Assessing clustering quality based on how the data itself fit in the clustering
structure
 Silhouette
 Stability and Admissibility analyses
 Test the sensitivity of an algorithm to changes in data while keeping the structures intact
 Convext admissiblity, cluster prportion, omission, and monotone admissibility
15
  clustersotherallwithofitydissimilaraveragesmallest:)(
clustersamethewithinofitydissimilaraverage:)(
)()(max
)()(
)(
iib
iia
ibia
iaib
iS
Clustering Validation
2. Relative Indexes
Assessing how similar two clustering solutions are
3. External Indexes
Comparing a clustering solution with ground-truth labels/clusters
Purity, Rand Index (RI), Normalized Mutual Information, F硫-score, MCC, 
16
 
RePr
Re.Pr1
Re
Pr
2
2









F
FNTP
TP
FPTP
TP
   ワ緒
k
k
j
c
N
jmax
1
C,Purity 
Purity is not a reliable
measure by itself
Part 2:
Fast and Accurate K-means
for Large Datasets
Shindler, Michael, Alex Wong, and Adam W. Meyerson. "Fast and accurate k-means for large datasets."
Advances in neural information processing systems. 2011
17
Motivation
Goal: Clustering Large Datasets
Data cannot be stored in main memory
Streaming model (sequential access)
Facility Location problem:
desired facility cost is given
without prior knowledge of k
 Original K-means requires multiple passes through data
 not suitable for big data / streaming data
18
Well Clusterable Data
An instance of k-means is called -separable if reducing the number of clusters
increases the cost of optimal k-means clustering by 1/2
K=3
Jk=3(C)
K=2
Jk=2(C)
)(
)(
3
22
CJ
CJ
k
k


緒
20
K-means++ Seeding Procedure (non-streaming)
Let S be the set of already selected seeds
1. Initially choose a point uniformly at random
2. Select a new point randomly with probability according to
3. Repeat until
 An improved version allows more than k centers
Advantage: avoiding local minima traps
K-means++: The advantages of careful seeding, N. Ailon et al.
j
jSp
Spd
),(
),(2
min
kS ||
kS ||
22
Proposed Algorithm
Initialize
Guess a small facility cost
Find the closest facility point
Decide whether to create a new facility
point, or assign it to the closest
cluster facility (add the
weight=contribution to facility cost)
Number of facility points overflow:
increase f
Merge (wegihted) facility points
23
Approximate Nearest Neighbor
Finding nearest facility point is the most time consuming part
1. Construct a random vector
2. Store the facility points in order of
3. For a new point , find the two facilities
such that
4. Compute the distances to and
This approximation can increase the approximation ratio by a constant factor
w
iyw.iy
x
)(log... 1 Oywxwyw ii o 
iy 1iy
iyw.
1. iyww
24
Algorithm Analysis
Determine better facility cost
Better approximation ratio (17) much less than previous work by Braverman
Running time
Running time with approximate nearest neighbor
25
)log( nnk
))loglog(log( nkn
Part 3:
Active Clustering of
Biological Sequences
Voevodski, Konstantin, et al. "Active clustering of biological sequences." The Journal of Machine Learning
Research 13.1 (2012): 203-225.
26
Motivation
BLAST Sequence Query
 Create a hash table of all the words
 Pairwise alignment of subsequences in the same bucket
No need to calculate the distances of a new query sequence
to all the database sequences
Previous clustering algorithm for gene sequences require all pair-wise calculations
Goal: develop a clustering algorithm without computing all pair distances
Query
sequence
Hash Table
27
Landmark Clustering Algorithm
Input:
Dataset S
Desired number of clusters k
Probability of performance guarantee
Objective function
propertystability ),1( ワ
わ1
Main Procedures:
1. Landmark Selection
2. Expanding Landmarks
3. Cluster Assignment
ワワ 
緒
k
i Cx
i
i
cxdC
1
),()(
28
...},,{ 21 llL 緒
 nnLO log||:TimeRunning
Landmark Selection
Distance Calculations
L
|)|.|(| LSO
29
Expand-Landmarks
Landmark l is working if
}),(|{ rlsdSsBr
l o
Ball B around landmark l
minsBl 
30
Cluster Assignment
Construct graph using working landmark,
 Nodes represent (working) landmarks
 Edges represent the overlapping balls
Find the connected components of graph
Clustered points are the set of points in these balls
The number of clusters is
BG
}Comp,...,Comp{)(Components 1 mBG 
)(Components BG
31
Performance Analysis
Number of required landmarks
Active landmark selection
Uniform selection (degrading performance)
Good points:
Landmark spread property: any set of good points must have a landmark
closer than
With high probability (1-隆), landmark spread property is satisfied
Based on landmark spread property, Expand-Landmarks correctly
cluster most of the points in a cluster core
- Assumption of large clusters, doesnt capture smaller clusters 32
critcrit dxwxwdxw 17)()(and)( 2 鰹
)(xw
)(2 xw
critd
 /1lnkO
 /ln kkO
Part 4:
Spectral Relaxation of
K-means Clustering
Zha, Hongyuan, et al. "Spectral relaxation for k-means clustering." Advances in neural information
processing systems. 2001.
33
Motivation
K-means is prone to local minima
Reformulate k-means objective function as a trace maximization problem
Different approaches to tackle this local minima issue:
 Improving the initialization process (K-means++)
 Relaxing constraints in the objective function (spectral relaxation
method)
34
Derivation
Data matrix D Cost of a partitioning
Cost for an single cluster

Maximizing  minimizing
ワワ 

k
i
s
s
i
i
s
i
mdss
1 1
2)(
)(
 22
1
2)(
/
Fi
T
siF
T
ii
s
s
i
i
si seeIDemDmdss i
i
緒緒 ワ
ワ 







緒

k
i i
i
T
i
i
T
i
T
i
k
i
i
s
e
DD
s
e
DDssss )(trace)(
1

)(trace)(trace)( XDDXDDss TTT
緒 matrixindicatorlorthonormakniswhere X
)(trace DXDX TT
)(ss











1...1
......
1...1
e
35
Theorem
For a symmetric matrix with eigenvalues n oo ...21nnH 
)(tracemax...21 YHYT
IYY
n
k
T

緒 
36
Cluster Assignment
Global Gram matrix
Gram matrix for cluster i:
Eigenvalue decomposition largest eigenvalue
Err
DD
DD
DD
DD
k
T
k
T
T
T
















...00
............
0...0
0...0
22
11
i
T
i DD
ii yiiii
T
i yyDD 
n
njjj
T
yyy
yyDD
21
21 ...  鰹鰹鰹
DDT
37
Cluster Assignment
Method1: apply k-means to the matrix of k largest eigenvectors of global gram
matrix (pivoted k-means)
Method2: (pivoted QR decomposition of )
Davis-Khan sin(慮) Theorem: matrixorthogonaliswhere)( kkVErrOVYY kk 器
38
 
kcluster1cluster
,...,....,..., 111111 1 kkskks
T
k vyvyvyvyY k

T
kY
Part 5:
Application for
Microbial Community Detection
39
Large genetic sequence datasets
Goal: cluster in streaming model / limited passes
Landmark selection algorithm (one pass)
Expand landmarks and assigning sequences to the nearest landmark
Finding the nearest landmark: A hashing scheme to find the nearest
landmark
Require choice of hyper-parameters
Assumption of -separability, and large clusters
40
Hashing Scheme for nearest neighbor search
Shindlers approximate nearest neighbor numeric random vector
Create random sequences
Hash function: Levenshtein (edit) distance between sequence x, and random
sequences
41
New sequence
x
Hash Table
xww

.
Closest
landmark
mrrr ...,,, 21
),( ii rxEdith
Acknowledgements
Department of Computer Science and Engineering,
Michigan State University
My friend Sebastian Raschka,
Data Scientist and author of Python Machine Learning
Please visit http://vahidmirjalili.com
42

More Related Content

Viewers also liked (13)

Trajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus AlgorithmTrajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus Algorithm
Iv叩n Sanchez Vera
Beyond TrueTime
Beyond TrueTimeBeyond TrueTime
Beyond TrueTime
Vahid Mirjalili
Data preprocessing and unsupervised learning methods in Bioinformatics
Data preprocessing and unsupervised learning methods in BioinformaticsData preprocessing and unsupervised learning methods in Bioinformatics
Data preprocessing and unsupervised learning methods in Bioinformatics
Elena S端gis
Clustering
ClusteringClustering
Clustering
M Rizwan Aqeel
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesSpatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Centre of Geographic Sciences (COGS)
Spanner
SpannerSpanner
Spanner
Nafeez Abrar
K-Means Algorithm
K-Means AlgorithmK-Means Algorithm
K-Means Algorithm
Carlos Castillo (ChaTo)
Try Cloud Spanner
Try Cloud SpannerTry Cloud Spanner
Try Cloud Spanner
Simon Su
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
Benjamin Bengfort
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
guru_prasadg
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
butest
Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
DoiT International
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
petitegeek
Trajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus AlgorithmTrajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus Algorithm
Iv叩n Sanchez Vera
Data preprocessing and unsupervised learning methods in Bioinformatics
Data preprocessing and unsupervised learning methods in BioinformaticsData preprocessing and unsupervised learning methods in Bioinformatics
Data preprocessing and unsupervised learning methods in Bioinformatics
Elena S端gis
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesSpatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Centre of Geographic Sciences (COGS)
Try Cloud Spanner
Try Cloud SpannerTry Cloud Spanner
Try Cloud Spanner
Simon Su
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
Benjamin Bengfort
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
guru_prasadg
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
butest
Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
DoiT International
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
petitegeek

Similar to Large Scale Data Clustering: an overview (20)

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
mqasimsheikh5
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
Houw Liong The
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
Houw Liong The
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
engrasi
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
Aman Jatain
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Subrata Kumer Paul
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
Anil Yadav
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
NaveenKumar5162
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
NaveenKumar5162
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Intel速 Software
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
Vahid Mirjalili
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
JoonyoungJayGwak
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
Krish_ver2
dm_clustering2.ppt
dm_clustering2.pptdm_clustering2.ppt
dm_clustering2.ppt
Bhuvanya Raghunathan
Chapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptxChapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptx
Amy Aung
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimization
Alexander Decker
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
IJERD Editor
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
IJCSIS Research Publications
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
mqasimsheikh5
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
Houw Liong The
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
Houw Liong The
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
engrasi
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Subrata Kumer Paul
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
Anil Yadav
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
NaveenKumar5162
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
NaveenKumar5162
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Intel速 Software
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
Vahid Mirjalili
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
Krish_ver2
Chapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptxChapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptx
Amy Aung
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimization
Alexander Decker
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
IJERD Editor
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
IJCSIS Research Publications

Recently uploaded (20)

Getting the Public on Side: How to Make Reforms Acceptable by Design- launch ...
Getting the Public on Side: How to Make Reforms Acceptable by Design- launch ...Getting the Public on Side: How to Make Reforms Acceptable by Design- launch ...
Getting the Public on Side: How to Make Reforms Acceptable by Design- launch ...
StatsCommunications
Selzy: Simplifying Email Marketing for Maximum Growth
Selzy: Simplifying Email Marketing for Maximum GrowthSelzy: Simplifying Email Marketing for Maximum Growth
Selzy: Simplifying Email Marketing for Maximum Growth
Selzy
Human-ai Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Human-ai Collaboration: Balancing Agentic AI and Autonomy in Hybrid SystemsHuman-ai Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Human-ai Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
ijccsa
Blinkit SQL Analysis SQL Project case study.pptx
Blinkit SQL Analysis SQL Project case study.pptxBlinkit SQL Analysis SQL Project case study.pptx
Blinkit SQL Analysis SQL Project case study.pptx
Prince Goel
Quantitative Presentation_Final.....pptx
Quantitative Presentation_Final.....pptxQuantitative Presentation_Final.....pptx
Quantitative Presentation_Final.....pptx
lenny lopez
Big-O notations, Algorithm and complexity analaysis
Big-O notations, Algorithm and complexity analaysisBig-O notations, Algorithm and complexity analaysis
Big-O notations, Algorithm and complexity analaysis
drsomya2019
data-analysis lectures for students - begginer level
data-analysis lectures for students - begginer leveldata-analysis lectures for students - begginer level
data-analysis lectures for students - begginer level
gemdimash
GenAI-powered assistants compared in a real case - 2025-03-18
GenAI-powered assistants compared in a real case - 2025-03-18GenAI-powered assistants compared in a real case - 2025-03-18
GenAI-powered assistants compared in a real case - 2025-03-18
Alessandra Bilardi
Drillingis_optimizedusingartificialneural.pptx
Drillingis_optimizedusingartificialneural.pptxDrillingis_optimizedusingartificialneural.pptx
Drillingis_optimizedusingartificialneural.pptx
singhsanjays2107
Know Your Nation In Numbers myIndia-2006
Know Your Nation In Numbers myIndia-2006Know Your Nation In Numbers myIndia-2006
Know Your Nation In Numbers myIndia-2006
sahimbarc
presentacion early classification on MTS.pdf
presentacion  early classification on MTS.pdfpresentacion  early classification on MTS.pdf
presentacion early classification on MTS.pdf
faiber13
Varsha 1-C , 41 - Varsha.pdfbjijgd5yjnjj
Varsha 1-C , 41 - Varsha.pdfbjijgd5yjnjjVarsha 1-C , 41 - Varsha.pdfbjijgd5yjnjj
Varsha 1-C , 41 - Varsha.pdfbjijgd5yjnjj
itzmetanya05
PYTHON 1 system guessing game for python.pptx
PYTHON 1 system guessing game for python.pptxPYTHON 1 system guessing game for python.pptx
PYTHON 1 system guessing game for python.pptx
palmakyonna
22 Nov RECSA AFRICA REGIONAL SECURITY ANALYSIS.pptx
22 Nov RECSA AFRICA REGIONAL SECURITY ANALYSIS.pptx22 Nov RECSA AFRICA REGIONAL SECURITY ANALYSIS.pptx
22 Nov RECSA AFRICA REGIONAL SECURITY ANALYSIS.pptx
Edward252793
Apple-Technology-A-Case-Study (all about technology ppt)
Apple-Technology-A-Case-Study (all about technology ppt)Apple-Technology-A-Case-Study (all about technology ppt)
Apple-Technology-A-Case-Study (all about technology ppt)
Immortal43
Agile Infinity: When the Customer Is an Abstract Concept
Agile Infinity: When the Customer Is an Abstract ConceptAgile Infinity: When the Customer Is an Abstract Concept
Agile Infinity: When the Customer Is an Abstract Concept
Loic Merckel
Predicting-Training-Needs-with-Machine-Learning.pptx
Predicting-Training-Needs-with-Machine-Learning.pptxPredicting-Training-Needs-with-Machine-Learning.pptx
Predicting-Training-Needs-with-Machine-Learning.pptx
Access Business Management Conferencing International
Financial Ratios and CAMEL Presentation.ppt
Financial Ratios and CAMEL Presentation.pptFinancial Ratios and CAMEL Presentation.ppt
Financial Ratios and CAMEL Presentation.ppt
PrinceAyangbesanOlam
Employee data login and attendance for region
Employee data login and attendance for regionEmployee data login and attendance for region
Employee data login and attendance for region
nagom47355
Digital Marketing Canvas for Charlotte Hornets
Digital Marketing Canvas for Charlotte HornetsDigital Marketing Canvas for Charlotte Hornets
Digital Marketing Canvas for Charlotte Hornets
DylanLee69
Getting the Public on Side: How to Make Reforms Acceptable by Design- launch ...
Getting the Public on Side: How to Make Reforms Acceptable by Design- launch ...Getting the Public on Side: How to Make Reforms Acceptable by Design- launch ...
Getting the Public on Side: How to Make Reforms Acceptable by Design- launch ...
StatsCommunications
Selzy: Simplifying Email Marketing for Maximum Growth
Selzy: Simplifying Email Marketing for Maximum GrowthSelzy: Simplifying Email Marketing for Maximum Growth
Selzy: Simplifying Email Marketing for Maximum Growth
Selzy
Human-ai Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Human-ai Collaboration: Balancing Agentic AI and Autonomy in Hybrid SystemsHuman-ai Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Human-ai Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
ijccsa
Blinkit SQL Analysis SQL Project case study.pptx
Blinkit SQL Analysis SQL Project case study.pptxBlinkit SQL Analysis SQL Project case study.pptx
Blinkit SQL Analysis SQL Project case study.pptx
Prince Goel
Quantitative Presentation_Final.....pptx
Quantitative Presentation_Final.....pptxQuantitative Presentation_Final.....pptx
Quantitative Presentation_Final.....pptx
lenny lopez
Big-O notations, Algorithm and complexity analaysis
Big-O notations, Algorithm and complexity analaysisBig-O notations, Algorithm and complexity analaysis
Big-O notations, Algorithm and complexity analaysis
drsomya2019
data-analysis lectures for students - begginer level
data-analysis lectures for students - begginer leveldata-analysis lectures for students - begginer level
data-analysis lectures for students - begginer level
gemdimash
GenAI-powered assistants compared in a real case - 2025-03-18
GenAI-powered assistants compared in a real case - 2025-03-18GenAI-powered assistants compared in a real case - 2025-03-18
GenAI-powered assistants compared in a real case - 2025-03-18
Alessandra Bilardi
Drillingis_optimizedusingartificialneural.pptx
Drillingis_optimizedusingartificialneural.pptxDrillingis_optimizedusingartificialneural.pptx
Drillingis_optimizedusingartificialneural.pptx
singhsanjays2107
Know Your Nation In Numbers myIndia-2006
Know Your Nation In Numbers myIndia-2006Know Your Nation In Numbers myIndia-2006
Know Your Nation In Numbers myIndia-2006
sahimbarc
presentacion early classification on MTS.pdf
presentacion  early classification on MTS.pdfpresentacion  early classification on MTS.pdf
presentacion early classification on MTS.pdf
faiber13
Varsha 1-C , 41 - Varsha.pdfbjijgd5yjnjj
Varsha 1-C , 41 - Varsha.pdfbjijgd5yjnjjVarsha 1-C , 41 - Varsha.pdfbjijgd5yjnjj
Varsha 1-C , 41 - Varsha.pdfbjijgd5yjnjj
itzmetanya05
PYTHON 1 system guessing game for python.pptx
PYTHON 1 system guessing game for python.pptxPYTHON 1 system guessing game for python.pptx
PYTHON 1 system guessing game for python.pptx
palmakyonna
22 Nov RECSA AFRICA REGIONAL SECURITY ANALYSIS.pptx
22 Nov RECSA AFRICA REGIONAL SECURITY ANALYSIS.pptx22 Nov RECSA AFRICA REGIONAL SECURITY ANALYSIS.pptx
22 Nov RECSA AFRICA REGIONAL SECURITY ANALYSIS.pptx
Edward252793
Apple-Technology-A-Case-Study (all about technology ppt)
Apple-Technology-A-Case-Study (all about technology ppt)Apple-Technology-A-Case-Study (all about technology ppt)
Apple-Technology-A-Case-Study (all about technology ppt)
Immortal43
Agile Infinity: When the Customer Is an Abstract Concept
Agile Infinity: When the Customer Is an Abstract ConceptAgile Infinity: When the Customer Is an Abstract Concept
Agile Infinity: When the Customer Is an Abstract Concept
Loic Merckel
Financial Ratios and CAMEL Presentation.ppt
Financial Ratios and CAMEL Presentation.pptFinancial Ratios and CAMEL Presentation.ppt
Financial Ratios and CAMEL Presentation.ppt
PrinceAyangbesanOlam
Employee data login and attendance for region
Employee data login and attendance for regionEmployee data login and attendance for region
Employee data login and attendance for region
nagom47355
Digital Marketing Canvas for Charlotte Hornets
Digital Marketing Canvas for Charlotte HornetsDigital Marketing Canvas for Charlotte Hornets
Digital Marketing Canvas for Charlotte Hornets
DylanLee69

Large Scale Data Clustering: an overview

  • 1. Large Scale Data Clustering Algorithms Vahid Mirjalili Data Scientist Feb 11th 2016
  • 2. Outline 1. Overview of clustering algorithms and validation 2. Fast and accurate k-means clustering for large datasets 3. Clustering based on landmark points 4. Spectral relaxation for k-means clustering 5. Proposed methods for microbial community detection 2
  • 3. Part 1: Overview of Data Clustering Algorithms Jain, Anil K. "Data clustering: 50 years beyond K-means." Pattern recognition letters 31.8 (2010): 651-666. 3
  • 4. Data clustering Goal: discover natural groupings among given data points Unsupervised learning (unlabeled data) Exploratory analysis (without any pre-specified model/hypothesis) Usages Gain insight from the underlying structure of data (salient features, anomaly detection, etc) Identify degree of similarity between points (infer phylogenetic relationships) Data Compression (summarizing data by cluster prototypes, removing redundant patterns) 4
  • 5. Applications Wide range of applications: computer vision, document clustering, gene clustering, customer/product groups An example application for Computer Visions: image segmentation and separating the background 5
  • 6. Different clustering algorithms Literature contains 1000 clustering algorithms Different criteria to divide clustering algorithms Soft vs. Hard Clustering Prototype vs. Density based Partitional vs. Hierarchical Clustering 6
  • 7. Partitional vs. Hierarchical 1. Partitional algorithms (k-means) Partition the data space Finds all clusters simultaneously 2. Hierarchical algorithms Generate nested cluster hierarchy Agglomerative (bottom-up) Divisive (top-down) Distance between clusters: Single-linkage, complete linkage, average-linkage 7
  • 8. K-means clustering Objective function: 1. Select K initial cluster centroids 2. Assign data points to the nearest cluster centroid 3. Update the new cluster centroids Figure curtesy: Data clustering: 50 years beyond k-means, Anil Jain 8
  • 9. K-means pros and cons + Simple/easy to implement + Order of linear complexity O(N Iterations) - Results highly dependent on initialization - Prone to local minima - Sensitive to outliers, and clusters sizes - Globular shaped clusters - Requiring multiple passes - Not applicable to categorical data Local minima Non-globular clusters Outliers 9
  • 10. K-means extensions K-means++ To improve the initialization process X-means To find the optimal number of clusters without prior knowledge Kernel K-means To form arbitrary/non-globular shaped clusters Fuzzy c-means Multiple cluster assignment (membership degree) K-medians More robust to outliers (median of each feature) K-medoids More robust to outliers, different distance metrics, categorical data Bisecting K-means and many more ... 10
  • 11. Kernel K-means vs. K-means Pyclust: Open Source Data Clustering Pckage 11
  • 12. Bisecting K-means 12 Pyclust: Open Source Data Clustering Pckage
  • 13. Other approaches in data clustering Prototype-based methods Clusters are formed based on similarity to a prototype K-means, k-medians, k-medoids, Density based methods (clusters are high density regions separated by low density regions) Jarvis-Patrick Algorithm: similairty between patterns defined as the number of common neighbors DBSCAN (MinPts, 竜-neighborhood) Identify types noise points/border points/core points 13
  • 14. DBSCAN pros and cons + No need to know number of clusters apriori + Identify arbitrary shaped clusters + Robust to noise and outliers - Sensitivity to the parameters - Problem with high dimensional data (subspace clustering) 14
  • 15. Clustering Validation 1. Internal Validity Indexes Assessing clustering quality based on how the data itself fit in the clustering structure Silhouette Stability and Admissibility analyses Test the sensitivity of an algorithm to changes in data while keeping the structures intact Convext admissiblity, cluster prportion, omission, and monotone admissibility 15 clustersotherallwithofitydissimilaraveragesmallest:)( clustersamethewithinofitydissimilaraverage:)( )()(max )()( )( iib iia ibia iaib iS
  • 16. Clustering Validation 2. Relative Indexes Assessing how similar two clustering solutions are 3. External Indexes Comparing a clustering solution with ground-truth labels/clusters Purity, Rand Index (RI), Normalized Mutual Information, F硫-score, MCC, 16 RePr Re.Pr1 Re Pr 2 2 F FNTP TP FPTP TP ワ緒 k k j c N jmax 1 C,Purity Purity is not a reliable measure by itself
  • 17. Part 2: Fast and Accurate K-means for Large Datasets Shindler, Michael, Alex Wong, and Adam W. Meyerson. "Fast and accurate k-means for large datasets." Advances in neural information processing systems. 2011 17
  • 18. Motivation Goal: Clustering Large Datasets Data cannot be stored in main memory Streaming model (sequential access) Facility Location problem: desired facility cost is given without prior knowledge of k Original K-means requires multiple passes through data not suitable for big data / streaming data 18
  • 19. Well Clusterable Data An instance of k-means is called -separable if reducing the number of clusters increases the cost of optimal k-means clustering by 1/2 K=3 Jk=3(C) K=2 Jk=2(C) )( )( 3 22 CJ CJ k k 緒 20
  • 20. K-means++ Seeding Procedure (non-streaming) Let S be the set of already selected seeds 1. Initially choose a point uniformly at random 2. Select a new point randomly with probability according to 3. Repeat until An improved version allows more than k centers Advantage: avoiding local minima traps K-means++: The advantages of careful seeding, N. Ailon et al. j jSp Spd ),( ),(2 min kS || kS || 22
  • 21. Proposed Algorithm Initialize Guess a small facility cost Find the closest facility point Decide whether to create a new facility point, or assign it to the closest cluster facility (add the weight=contribution to facility cost) Number of facility points overflow: increase f Merge (wegihted) facility points 23
  • 22. Approximate Nearest Neighbor Finding nearest facility point is the most time consuming part 1. Construct a random vector 2. Store the facility points in order of 3. For a new point , find the two facilities such that 4. Compute the distances to and This approximation can increase the approximation ratio by a constant factor w iyw.iy x )(log... 1 Oywxwyw ii o iy 1iy iyw. 1. iyww 24
  • 23. Algorithm Analysis Determine better facility cost Better approximation ratio (17) much less than previous work by Braverman Running time Running time with approximate nearest neighbor 25 )log( nnk ))loglog(log( nkn
  • 24. Part 3: Active Clustering of Biological Sequences Voevodski, Konstantin, et al. "Active clustering of biological sequences." The Journal of Machine Learning Research 13.1 (2012): 203-225. 26
  • 25. Motivation BLAST Sequence Query Create a hash table of all the words Pairwise alignment of subsequences in the same bucket No need to calculate the distances of a new query sequence to all the database sequences Previous clustering algorithm for gene sequences require all pair-wise calculations Goal: develop a clustering algorithm without computing all pair distances Query sequence Hash Table 27
  • 26. Landmark Clustering Algorithm Input: Dataset S Desired number of clusters k Probability of performance guarantee Objective function propertystability ),1( ワ わ1 Main Procedures: 1. Landmark Selection 2. Expanding Landmarks 3. Cluster Assignment ワワ 緒 k i Cx i i cxdC 1 ),()( 28 ...},,{ 21 llL 緒 nnLO log||:TimeRunning
  • 28. Expand-Landmarks Landmark l is working if }),(|{ rlsdSsBr l o Ball B around landmark l minsBl 30
  • 29. Cluster Assignment Construct graph using working landmark, Nodes represent (working) landmarks Edges represent the overlapping balls Find the connected components of graph Clustered points are the set of points in these balls The number of clusters is BG }Comp,...,Comp{)(Components 1 mBG )(Components BG 31
  • 30. Performance Analysis Number of required landmarks Active landmark selection Uniform selection (degrading performance) Good points: Landmark spread property: any set of good points must have a landmark closer than With high probability (1-隆), landmark spread property is satisfied Based on landmark spread property, Expand-Landmarks correctly cluster most of the points in a cluster core - Assumption of large clusters, doesnt capture smaller clusters 32 critcrit dxwxwdxw 17)()(and)( 2 鰹 )(xw )(2 xw critd /1lnkO /ln kkO
  • 31. Part 4: Spectral Relaxation of K-means Clustering Zha, Hongyuan, et al. "Spectral relaxation for k-means clustering." Advances in neural information processing systems. 2001. 33
  • 32. Motivation K-means is prone to local minima Reformulate k-means objective function as a trace maximization problem Different approaches to tackle this local minima issue: Improving the initialization process (K-means++) Relaxing constraints in the objective function (spectral relaxation method) 34
  • 33. Derivation Data matrix D Cost of a partitioning Cost for an single cluster Maximizing minimizing ワワ k i s s i i s i mdss 1 1 2)( )( 22 1 2)( / Fi T siF T ii s s i i si seeIDemDmdss i i 緒緒 ワ ワ 緒 k i i i T i i T i T i k i i s e DD s e DDssss )(trace)( 1 )(trace)(trace)( XDDXDDss TTT 緒 matrixindicatorlorthonormakniswhere X )(trace DXDX TT )(ss 1...1 ...... 1...1 e 35
  • 34. Theorem For a symmetric matrix with eigenvalues n oo ...21nnH )(tracemax...21 YHYT IYY n k T 緒 36
  • 35. Cluster Assignment Global Gram matrix Gram matrix for cluster i: Eigenvalue decomposition largest eigenvalue Err DD DD DD DD k T k T T T ...00 ............ 0...0 0...0 22 11 i T i DD ii yiiii T i yyDD n njjj T yyy yyDD 21 21 ... 鰹鰹鰹 DDT 37
  • 36. Cluster Assignment Method1: apply k-means to the matrix of k largest eigenvectors of global gram matrix (pivoted k-means) Method2: (pivoted QR decomposition of ) Davis-Khan sin(慮) Theorem: matrixorthogonaliswhere)( kkVErrOVYY kk 器 38 kcluster1cluster ,...,....,..., 111111 1 kkskks T k vyvyvyvyY k T kY
  • 37. Part 5: Application for Microbial Community Detection 39
  • 38. Large genetic sequence datasets Goal: cluster in streaming model / limited passes Landmark selection algorithm (one pass) Expand landmarks and assigning sequences to the nearest landmark Finding the nearest landmark: A hashing scheme to find the nearest landmark Require choice of hyper-parameters Assumption of -separability, and large clusters 40
  • 39. Hashing Scheme for nearest neighbor search Shindlers approximate nearest neighbor numeric random vector Create random sequences Hash function: Levenshtein (edit) distance between sequence x, and random sequences 41 New sequence x Hash Table xww . Closest landmark mrrr ...,,, 21 ),( ii rxEdith
  • 40. Acknowledgements Department of Computer Science and Engineering, Michigan State University My friend Sebastian Raschka, Data Scientist and author of Python Machine Learning Please visit http://vahidmirjalili.com 42

Editor's Notes

  • #18: Eliezer de Souza da Silva