�ݺ�ߣ

“A Comparative Study between Clustering Algorithms”
Pattern Discovery for Categorical Cross-Cultural Data in
the Market Research Domain
September, 2015
Supervisor : Reviewer: - Industry Partner:
Professor: Plamen Angelov Professor: Nigel Davies Bonamy Finch
Author:
Ahmed Hamada

INDUSTRY PARTNER
+ 50
Customers

THE CHALLENGE
Cross-cultural attitudinal segmentation studies using rating scales are
seriously a challengeable tasks within the market research domain as there are
a lot of shared views with fuzzy boundaries in these studies, unlike clustering
on demographics. The dilemma of having meaningful clusters that can
realistically reflect the respondents segments with good geometrical cluster
properties is also a demanding subject in the market research domain

GAP ANALYSIS
• 76% used K-means as a partitioning method for their segmentation
• 93% of the segmentation studies Euclidean distance.
• More 60% of the examined market research studies didn’t include an
evaluation criteria for the developed clusters
In a multi variate survey study, studying 243 market segmentation
publications in the tourism domain (Dolnicar, 2003)

K-MEANS PROBLEMS
Data
Dimensionality
• Distances between
points become
relatively uniform,
therefore the
concept of the
nearest neighbour
of a point becomes
meaningless
Dissimilarity
Measure
• it isn't just about
distances, but
about computing
the mean. But
there is no
reasonable mean
on categorical data
Non-Convex
Shaped Clusters
• In Euclidean space,
an object is convex
if for every pair of
points within the
object, every point
on the straight line
segment that joins
them is also within
the object
Local Minima
• differentiating the
objective function
w.r.t. to the
centroids, to find a
local minimum.
More paths and
more initiation
points can result in
a global minima

EXPERIMENTS
PARTITIONING METHODS
HIERARCHICAL
METHODS
K-means K-modes ROCK
Kernel
K-means
K-meansonrawdata
K-meanson
standardizedrows
MCAonrawdata
+K-means
KernelK-meanson
rawdata
KernelK-meanson
standardizedrows
K-modesonrawdata
ROCKonrawdata
Euclidean Distance
Matching Measure
Arbitrary
shaped
clusters
Non-convex
shaped
clusters
21experiments
7 X 3

DETERMINING THE NUMBER OF CLUSTERS
______________________________________________
Gap Statistic for 10 clusters
_____________________________________________
Within Sum of Squares for 10 clusters
? 5, 6 & 7
Clusters
Models

7-CLUSTER MODEL GEOMETRICAL
COMPARISON
117,604
87,232
1,644
283,904
224,892
0
100,000
200,000
300,000
K-means K-means on
standardised
rows
MCA + K-
means
Kernel K-
means
Kernel K-
means on
standardised
rows
21% 18%
59%
0.04% 0.05%0%
20%
40%
60%
80%
K-means K-means on
standardised
rows
MCA + K-
means
Kernel K-
means
Kernel K-
means on
standardised
rows
Within cluster sum of squares Cluster closeness index

INTERNAL MEASURES COMPARISON
0.102
0.05
0.09 0.08 0.07
0.125
0.05
0.1
0
0.05
0.109
0.05
0.1
0.08 0.05
0
0.1
0.2
0.3
0.4
5 clusters 6 clusters 7 clusters
0.05
0.03
-0.02
-0.01 -0.01
0.05
0.03
-0.02
0.01 0.01
0.04
0.03
-0.03
-0.01
0.02
-0.1
0
0.1
0.2
5 clusters 6 clusters 7 clusters
Dunn index Silhouette measure

INDUSTRY EVALUATION
Algorithm K-means on standardised rows Kernel K-means on standardised
rows
No. Clusters 5 6 7 5 6 7
Response Bias
Freedom
1 79% 86% 79% 70% 59% 58%
2 81% 77% 67% 93% 61% 71%
3 90% 61% 79% 77% 64% 75%
4 72% 81% 71% 74% 79% 83%
5 80% 70% 75% 79% 67% 67%
6 71% 71% 61% 79%
7 71% 79%
Reportability 1 71% 62% 67% 62% 76% 71%
2 38% 19% 19% 90% 24% 19%
3 19% 29% 81% 48% 81% 71%
4 43% 52% 29% 71% 33% 62%
5 62% 52% 43% 10% 33% 43%
6 71% 57% 33% 43%
7 62% 52%

5-CLUSTERS MODEL SCATTER PLOT MATRIX FOR THE
FIRST 4 VARIABLES
K-means on standardised rows Kernel K-means on standardised rows

CONCLUSION
1. The results of this research revealed that the standardisation of the
respondents developed better segments from the pragmatic point
of view.
2. From the overall evaluation analysis, the results of the 5 clusters
model using the K-means and the kernel K-means on standardised
rows revealed more meaningful segments than the other methods.
3. The results illustrated that the ROCK algorithm and the application
of MCA then K-means was not suitable for multiscale categorical
data and resulted in meaningless clusters.

FURTHER RESEARCH
• Evaluate the stability of the classification accuracy using different
algorithms
• Study other clustering methods available in the literature
• Evaluate the same algorithms on various cross-cultural multiscale
data sets and test the hypothesis whether the multi-scaled data (i.e.
Likert scale) develop better clusters from the geometrical point of
view.
• Evaluate the clustering algorithms on a different type of response
scales rather than using the multi point biased response scales

�ݺ�ߣ

Disseration_ppt

More Related Content

Disseration_ppt