TH?SE pour obtenir le grade de docteur d¨¦livr¨¦ par TELECOM ParisTech. Sp¨¦cialit¨¦ ? INFORMATIQUE et RESEAUX ?.
1 of 36
Download to read offline
More Related Content
Design and analysis of distributed k-nearest neighbors graph algorithms
1. Design and analysis of
distributed k-nearest
neighbors graph algorithms
Thibault Debatty
2018-10-05
2. Design and analysis of distributed k-nearest neighbors graph algorithms 2
k-nn graph
¡ñ Edge to k most similar nodes
¡ñ Directed graph
¡ñ Similarity assumed to be
symmetrical
3. Design and analysis of distributed k-nearest neighbors graph algorithms 3
k-nn graph
Is it useful for analyzing large
datasets ?
E.g.: triage (SPAM), proxy logs, etc.
4. Design and analysis of distributed k-nearest neighbors graph algorithms 4
k-nn graph
Challenges:
¡ñ Interactive analysis
¡ñ Avoid algorithms that require O(nx)
where x>1
¡ñ Distributed algorithms
¡ñ Similarity is not always a metric
(e.g. text datasets)
5. Design and analysis of distributed k-nearest neighbors graph algorithms 5
Outline
¡ñ Fundamental algorithms:
¨C Building from text datasets
¨C Online building
¨C Partitioning
¡ñ Applications:
¨C Text clustering
¨C APT detection
6. Design and analysis of distributed k-nearest neighbors graph algorithms 6
Building from text datasets
¡ñ Has to be built from a dataset
(like an index)
¡ñ Exact graph: O(n?) similarities
7. Design and analysis of distributed k-nearest neighbors graph algorithms 7
Building from text datasets
¡ñ We focus on large datasets
=> n? is not acceptable!
¡ñ Some algorithms if similarity is
metric
¡ñ But text similarity is not always
metric
(e.g. Jaro-Winkler)
8. Design and analysis of distributed k-nearest neighbors graph algorithms 8
Building from text datasets
¡ñ NN-Descent
Build an approximate graph
Compute O(n1.14) similarities
¡ñ BUT: iterative!
9. Design and analysis of distributed k-nearest neighbors graph algorithms 9
Building from text datasets
NNCTPH
¡ñ Hash using modified hashing
function
CTPH / ssdeep / spamsum
¡ñ Build subgraphs in parallel
¡ñ Merge subgraphs
Single iteration!
10. Design and analysis of distributed k-nearest neighbors graph algorithms 10
Building from text datasets
11. Design and analysis of distributed k-nearest neighbors graph algorithms 11
Building from text datasets
¡ñ Experimental evaluation:
¨C Apache Hadoop MapReduce
¨C SPAM dataset
¨C Jaro-Winkler string similarity
(not metric)
12. Design and analysis of distributed k-nearest neighbors graph algorithms 12
Building from text datasets
13. Design and analysis of distributed k-nearest neighbors graph algorithms 13
Building from text datasets
¡ñ Thibault Debatty, Pietro Michiardi, Olivier Thonnard,
and Wim Mees. Scalable graph building from text data. In
Proceedings of the 3rd International Workshop on Big
Data, Streams and Heterogeneous Source Mining:
Algorithms, Systems, Programming Models and
Applications, BigMine ¡¯14, 2014
¡ñ Thibault Debatty, Pietro Michiardi, Olivier Thonnard,
and Wim Mees. Building k-nn graphs from large text data.
In Proc. of IEEE BigData, 2014
¡ñ Implementation of NN-Descent and NNCTPH on GitHub
¡ñ java-string-similarity library on GitHub and Maven
¡ñ Java-lsh library on GitHub and Maven
14. Design and analysis of distributed k-nearest neighbors graph algorithms 14
Online building
¡ñ Given a distributed graph:
¨C Add nodes
¨C Remove nodes
¨C Search nearest neighbors of query node
¡ñ Requires k-medoids partitioning of
graph
15. Design and analysis of distributed k-nearest neighbors graph algorithms 15
Partitioning
¡ñ k-medoids clustering
¡ñ CLARANS is slow to converge
¡ñ Two faster methods:
¨C Inspired by Simulated Annealing
¨C Heuristic
¡ñ Impact of partitioning when we
perform distributed search
16. Design and analysis of distributed k-nearest neighbors graph algorithms 16
Text clustering
¡ñ Text dataset with Jaro-Winkler
similarity (not a metric)
¡ñ Steps:
¨C Build (approximate) k-nn graph
¨C Prune
¨C Compute connected components
17. Design and analysis of distributed k-nearest neighbors graph algorithms 17
APT Detection
¡ñ Advanced => no signatures
¡ñ Persistent => limited activity
¡ñ Threats
¡ñ Need a C2 channel
18. Design and analysis of distributed k-nearest neighbors graph algorithms 18
APT Detection
19. Design and analysis of distributed k-nearest neighbors graph algorithms 19
APT Detection
Here:
APT relying on HTTP
=> proxy logs
20. Design and analysis of distributed k-nearest neighbors graph algorithms 20
APT Detection
How hard can that be?
21. Design and analysis of distributed k-nearest neighbors graph algorithms 21
APT Detection
22. Design and analysis of distributed k-nearest neighbors graph algorithms 22
APT Detection
Displaying a page requires multiple
HTTP requests
=> link each request to its parent
using the logs from the proxy
23. Design and analysis of distributed k-nearest neighbors graph algorithms 23
APT Detection
24. Design and analysis of distributed k-nearest neighbors graph algorithms 24
APT Detection
25. Design and analysis of distributed k-nearest neighbors graph algorithms 25
APT Detection
weight is higher if:
¡ñ Requests are close in time
¡ñ Requests belong to the same domain
¡ñ Same sequence repeats
26. Design and analysis of distributed k-nearest neighbors graph algorithms 26
APT Detection
After pruning the weighted graph,
the APT remains isolated!
27. Design and analysis of distributed k-nearest neighbors graph algorithms 27
APT Detection
weight is higher if:
¡ñ Requests are close in time
¡ñ Requests belong to the same domain
¡ñ Same sequence repeats
29. Design and analysis of distributed k-nearest neighbors graph algorithms 29
APT Detection
30. Design and analysis of distributed k-nearest neighbors graph algorithms 30
APT Detection
¡ñ Experimental evaluation
¨C Proxy logs of real network
¨C Simulated APT traffic
¨C Rank suspicious domains
¡ñ Results
¨C High detection / false alarm ratio
¨C Without prior knowledge about APT
31. Design and analysis of distributed k-nearest neighbors graph algorithms 31
APT Detection
¡ñ False positives:
¨C Content Delivery Networks (CDN)
¨C Advertising domains
¨C Javascript library delivery
¨C Websites with very few visits
=> same behavior as APT
32. Design and analysis of distributed k-nearest neighbors graph algorithms 32
Conclusion
k-nn graph is an interesting tool to
analyze large datasets, but
¡ñ Only if approximation is acceptable
¡ñ Other possibilities exist
33. Design and analysis of distributed k-nearest neighbors graph algorithms 33
Perspectives...
¡ñ Broaden to other graph-like
structures:
¨C (Hierarchical) Small World Network
graphs
¨C Asymmetrical graphs
¡ñ Broaden to other applications
(clustering, nn search)
¡ñ Predict the magnitude of
approximation
34. Design and analysis of distributed k-nearest neighbors graph algorithms 34
Questions...
35. Design and analysis of distributed k-nearest neighbors graph algorithms 35
TH?SE
pour obtenir le grade de docteur d¨¦livr¨¦ par
TELECOM ParisTech
Sp¨¦cialit¨¦ ? INFORMATIQUE et RESEAUX ?
pr¨¦sent¨¦e et soutenue publiquement par
Thibault DEBATTY
le 5 octobre 2018
36. Design and analysis of distributed k-nearest neighbors graph algorithms 36
Directeurs de th¨¨se
¡ñ Pietro MICHIARDI
¡ñ Wim MEES
Jury
¡ñ M. Giovanni NEGLIA, Charg¨¦ de Recherche HDR, INRIA
¡ñ M. Jean-Michel DRICOT, Professeur associ¨¦,
Universit¨¦ Libre de Bruxelles
¡ñ M. Marc DACIER, Professeur, Eurecom
¡ñ Mme Elena BARALIS, Professeur, Politecnico di
Torino