�ݺ�ߣ

Modeling Decisions for Artificial Intelligence (MDAI)
2012, Girona, Spain

Introduction

? We present a benchmarking of two distinct algorithms for
extracting community structure from On-line Social Networks
(OSNs), considering how we can representatively sample an OSN
graph while maintaining its community structure.

? We do this by extracting the community structure from the
original and sampled versions of five well-known benchmarking
datasets and comparing the results.
? We assume there is NO a priori knowledge about the
expected result.
? A supervised sampling is performed.

Extraction of the community structure

Algorithm 1: Newman��s algorithm

? Extracts the communities by successively dividing the graph
into components, using Freeman��s betweenness centrality
measure until modularity Q is maximized.

? Modularity (Q): Is the measure used to quantify the quality of
the community partitions ��on the fly��. Usual range: [0.3 - 0.7].

? We have implemented it in Python (using NetworkX library).

Extraction of the community structure

Algorithm 2: Blondel��s method

1. The method looks for smaller communities by optimizing
modularity locally.
2. Then it aggregates nodes of the same community and builds
a new network whose nodes are communities.
Steps 1 and 2 are repeated until modularity Q is maximized.

? The default version was used from the Gephi graph
processing software.

Filtering / Sampling process

2-step process: In order to obtain a subset of a complete graph we
apply a process consisting of filtering and sampling.

? First step: Filtering (seed node selection).
Consists of filtering the graph nodes based on their degree
or their clustering coefficient. Filtering thresholds are user
defined.

? Goal: Identify hub nodes and dense regions of the
graph.

? Second step: Sampling.
We apply a sampling at 1 hop to obtain all the neighbours
connected to each seed node.

? Goal: Maintain core community structure.

Datasets used

Graph Statistics

Karate Dolphins GrQc Enron Facebook

#Nodes 34 62 5242 10630 31720

#Edges 78 159 14496 164837 80592

Avg. degree 4.59 5.13 5.530 31.013 5.081

Clust. coef. 0.57 0.26 0.529 0.383 0.079

Avg. path
2.408 3.356 6.049 3.160 6.432
length

Diameter 5 8 17 20 9

Sample criteria

? Sampling requires user supervision

? Sampling % desired: [ 10%-20% ]

? Only large graphs datasets (3) are sampled

Resulting sample
Filter Value Sample
size

ArXiv-GrQc Degree ��30 17.91% All neighbors

Enron Clustering coef. =1 20.83% All neighbors

Facebook Clustering coef. ��0.5 10.75% All neighbors

Sampling statistics

? Graph statistics for sampled Vs original datasets (sss / ooo)

GrQc Enron Facebook
Degree >= 30 Clust.Coef = 1 Clust.Coef >= 0.5

#Nodes 939 / 5242 2218 / 10630 3410 / 31720

#Edges 5715 / 14446 14912 / 164387 6561 / 80592

Avg. degree 12.17 / 5,53 12.315 / 31,013 3.848 / 5,081

Clust. coef. 0.698 / 0,529 0.761 / 0,383 0.632 / 0,079

Avg. path length 4.51 / 6,049 3.143 / 3,160 8.388 / 6,432

Diameter 10 / 17 7 / 20 27 / 9

Sampling statistics

? Indicator: Clustering coefficient shows a common pattern, increasing
in sampled datasets. This serves as an indicator that the core is
preferentially included in the samples.
GrQc Enron Facebook
Degree >= 30 Clust.Coef = 1 Clust.Coef >= 0.5

#Nodes 939 / 5242 2218 / 10630 3410 / 31720

#Edges 5715 / 14446 14912 / 164387 6561 / 80592

Avg. degree 12.17 / 5,53 12.315 / 31,013 3.848 / 5,081

Clust. coef. 0.698 / 0,529 0.761 / 0,383 0.632 / 0,079

Avg. path length 4.51 / 6,049 3.143 / 3,160 8.388 / 6,432

Diameter 10 / 17 7 / 20 27 / 9

Empirical Tests and Results

1. First, we evaluate Newman��s algorithm with the sampled datasets.

Stop Original or
Iteration Q Communities Sampled

Karate 4 0.494 5 O
Dolphins 5 0.591 6 O
GrQc 56 0.777 57 S
Enron 865 0.421 869 S
Enron Early* 51 0.325 56 S
Facebook 40 0.870 190 S

Empirical Tests and Results

2. Blondel��s method allows us to extract the communities from the
original dataset, given it��s greater execution speed in comparison with
Newman��s method.

Original Sampled

Q C Q C
GrQc 0.856 390 0.789 11
Enron 0.491 43 0.560 68
Facebook 0.681 1105 0.519 33

? How to compare nodes community matching?
? NMI : Normalized Mutual Information

Normalized Mutual Information

? After labeling the communities, we match the nodes inside every
corresponding community in the sampled and original datasets.

? Purity: 100% means that all nodes in same communities are matched in
both datasets.

? We compare the Top N communities ( N =10 )

o Handicap
Newman��s and Blondel��s methods are stochastic and non-deterministic
? Give slightly different results in each execution.

Normalized Mutual Information

? After labeling the communities, we match the nodes inside every
corresponding community in the sampled and original datasets.

? Purity: 100% means that all nodes in same communities are matched in
both datasets.

? We compare the Top N communities ( N =10 )

o Handicap
Newman��s and Blondel��s methods are stochastic and non-deterministic
? Give slightly different results in each execution.
NMI sampled
NMI orig. Vs. NMI orig Vs. orig. Net loss
Vs. sampled (B)
sampled (A) (C) (C- A)

GrQc 0.66559 0.82544 0.77301 0.10742

Enron 0.69069 0.86903 0.82012 0.12943

Facebook 0.58996 0.73249 0.69215 0.10219

Newman��s Vs. Blondel��s

? In terms of modularity (Q) and number of communities (C)

Blondel��s Blondel��s Newman��s
Original Sampled Sampled

Q C Q C Q C
GrQc 0.856 390 0.789 11 0.777 57
Enron 0.491 43 0.560 68 0.325 56
Facebook 0.681 1105 0.519 33 0.870 190

? The best modularity values are dataset dependent.

Newman��s Vs. Blondel��s

? In terms of modularity (Q) and number of communities (C)

Blondel��s Blondel��s Newman��s
Original Sampled Sampled

Q C Q C Q C
GrQc 0.856 390 0.789 11 0.777 57
Enron 0.491 43 0.560 68 0.325 56
Facebook 0.681 1105 0.519 33 0.870 190

? The methods may give distinct results in terms of the number of
communities and modularity values.

Newman��s (NG) Vs. Blondel��s (BN)

? In terms of NMI: Normalized Mutual Information.
Comparing Top N communities

NMI BN Vs. NG NMI NG Vs. BN NMI orig Vs. orig. Net loss
(A) (B) (C) (C- Avg (A,B))

GrQc 0.69116 0.87243 0.77301 -0.00878

Enron 0.31313 0.68796 0.82012 0.31958

Enron Early 0.83437 0.44320 0.82012 0.18133

Facebook 0.62056 0.54551 0.69215 0.10911

? Results show significant differences between the assignment of the
nodes between methods.

Conclusions

? We��ve benchmarked 5 statistically and topologically distinct datasets

? Applying 2 community structure algorithms
? Sampling original datasets

? Results indicate that is possible to identify the principal communities
for large complex datasets using sampling.

? It maintains the key facets of the community structure of a
dataset (NMI statistic shows high correspondence is maintained)
? Significantly reduces of the dataset size (80-90%)

? However, a difference is found in the assignment of nodes to
communities between different executions and methods, due to their
stochastic nature.

Thank you for
your attention :-)

�ݺ�ߣ

Analysis of on line social networks (OSNs) represented as graphs- extraction of an approximation of community structure using sampling

Recommended

More Related Content

Similar to Analysis of on line social networks (OSNs) represented as graphs- extraction of an approximation of community structure using sampling (20)

Recently uploaded (20)

Analysis of on line social networks (OSNs) represented as graphs- extraction of an approximation of community structure using sampling