ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Design and analysis of
distributed k-nearest
neighbors graph algorithms
Thibault Debatty
2018-10-05
Design and analysis of distributed k-nearest neighbors graph algorithms 2
k-nn graph
¡ñ Edge to k most similar nodes
¡ñ Directed graph
¡ñ Similarity assumed to be
symmetrical
Design and analysis of distributed k-nearest neighbors graph algorithms 3
k-nn graph
Is it useful for analyzing large
datasets ?
E.g.: triage (SPAM), proxy logs, etc.
Design and analysis of distributed k-nearest neighbors graph algorithms 4
k-nn graph
Challenges:
¡ñ Interactive analysis
¡ñ Avoid algorithms that require O(nx)
where x>1
¡ñ Distributed algorithms
¡ñ Similarity is not always a metric
(e.g. text datasets)
Design and analysis of distributed k-nearest neighbors graph algorithms 5
Outline
¡ñ Fundamental algorithms:
¨C Building from text datasets
¨C Online building
¨C Partitioning
¡ñ Applications:
¨C Text clustering
¨C APT detection
Design and analysis of distributed k-nearest neighbors graph algorithms 6
Building from text datasets
¡ñ Has to be built from a dataset
(like an index)
¡ñ Exact graph: O(n?) similarities
Design and analysis of distributed k-nearest neighbors graph algorithms 7
Building from text datasets
¡ñ We focus on large datasets
=> n? is not acceptable!
¡ñ Some algorithms if similarity is
metric
¡ñ But text similarity is not always
metric
(e.g. Jaro-Winkler)
Design and analysis of distributed k-nearest neighbors graph algorithms 8
Building from text datasets
¡ñ NN-Descent
Build an approximate graph
Compute O(n1.14) similarities
¡ñ BUT: iterative!
Design and analysis of distributed k-nearest neighbors graph algorithms 9
Building from text datasets
NNCTPH
¡ñ Hash using modified hashing
function
CTPH / ssdeep / spamsum
¡ñ Build subgraphs in parallel
¡ñ Merge subgraphs
Single iteration!
Design and analysis of distributed k-nearest neighbors graph algorithms 10
Building from text datasets
Design and analysis of distributed k-nearest neighbors graph algorithms 11
Building from text datasets
¡ñ Experimental evaluation:
¨C Apache Hadoop MapReduce
¨C SPAM dataset
¨C Jaro-Winkler string similarity
(not metric)
Design and analysis of distributed k-nearest neighbors graph algorithms 12
Building from text datasets
Design and analysis of distributed k-nearest neighbors graph algorithms 13
Building from text datasets
¡ñ Thibault Debatty, Pietro Michiardi, Olivier Thonnard,
and Wim Mees. Scalable graph building from text data. In
Proceedings of the 3rd International Workshop on Big
Data, Streams and Heterogeneous Source Mining:
Algorithms, Systems, Programming Models and
Applications, BigMine ¡¯14, 2014
¡ñ Thibault Debatty, Pietro Michiardi, Olivier Thonnard,
and Wim Mees. Building k-nn graphs from large text data.
In Proc. of IEEE BigData, 2014
¡ñ Implementation of NN-Descent and NNCTPH on GitHub
¡ñ java-string-similarity library on GitHub and Maven
¡ñ Java-lsh library on GitHub and Maven
Design and analysis of distributed k-nearest neighbors graph algorithms 14
Online building
¡ñ Given a distributed graph:
¨C Add nodes
¨C Remove nodes
¨C Search nearest neighbors of query node
¡ñ Requires k-medoids partitioning of
graph
Design and analysis of distributed k-nearest neighbors graph algorithms 15
Partitioning
¡ñ k-medoids clustering
¡ñ CLARANS is slow to converge
¡ñ Two faster methods:
¨C Inspired by Simulated Annealing
¨C Heuristic
¡ñ Impact of partitioning when we
perform distributed search
Design and analysis of distributed k-nearest neighbors graph algorithms 16
Text clustering
¡ñ Text dataset with Jaro-Winkler
similarity (not a metric)
¡ñ Steps:
¨C Build (approximate) k-nn graph
¨C Prune
¨C Compute connected components
Design and analysis of distributed k-nearest neighbors graph algorithms 17
APT Detection
¡ñ Advanced => no signatures
¡ñ Persistent => limited activity
¡ñ Threats
¡ñ Need a C2 channel
Design and analysis of distributed k-nearest neighbors graph algorithms 18
APT Detection
Design and analysis of distributed k-nearest neighbors graph algorithms 19
APT Detection
Here:
APT relying on HTTP
=> proxy logs
Design and analysis of distributed k-nearest neighbors graph algorithms 20
APT Detection
How hard can that be?
Design and analysis of distributed k-nearest neighbors graph algorithms 21
APT Detection
Design and analysis of distributed k-nearest neighbors graph algorithms 22
APT Detection
Displaying a page requires multiple
HTTP requests
=> link each request to its parent
using the logs from the proxy
Design and analysis of distributed k-nearest neighbors graph algorithms 23
APT Detection
Design and analysis of distributed k-nearest neighbors graph algorithms 24
APT Detection
Design and analysis of distributed k-nearest neighbors graph algorithms 25
APT Detection
weight is higher if:
¡ñ Requests are close in time
¡ñ Requests belong to the same domain
¡ñ Same sequence repeats
Design and analysis of distributed k-nearest neighbors graph algorithms 26
APT Detection
After pruning the weighted graph,
the APT remains isolated!
Design and analysis of distributed k-nearest neighbors graph algorithms 27
APT Detection
weight is higher if:
¡ñ Requests are close in time
¡ñ Requests belong to the same domain
¡ñ Same sequence repeats
Design and analysis of distributed k-nearest neighbors graph algorithms 28
APT Detection
¡ñ Batch: build graphs
¡ñ Interactive (web interface):
¨C Merge
¨C Prune
¨C Cluster
¨C Filter
¡ñ Approximate k-nn graph
(time and memory)
Design and analysis of distributed k-nearest neighbors graph algorithms 29
APT Detection
Design and analysis of distributed k-nearest neighbors graph algorithms 30
APT Detection
¡ñ Experimental evaluation
¨C Proxy logs of real network
¨C Simulated APT traffic
¨C Rank suspicious domains
¡ñ Results
¨C High detection / false alarm ratio
¨C Without prior knowledge about APT
Design and analysis of distributed k-nearest neighbors graph algorithms 31
APT Detection
¡ñ False positives:
¨C Content Delivery Networks (CDN)
¨C Advertising domains
¨C Javascript library delivery
¨C Websites with very few visits
=> same behavior as APT
Design and analysis of distributed k-nearest neighbors graph algorithms 32
Conclusion
k-nn graph is an interesting tool to
analyze large datasets, but
¡ñ Only if approximation is acceptable
¡ñ Other possibilities exist
Design and analysis of distributed k-nearest neighbors graph algorithms 33
Perspectives...
¡ñ Broaden to other graph-like
structures:
¨C (Hierarchical) Small World Network
graphs
¨C Asymmetrical graphs
¡ñ Broaden to other applications
(clustering, nn search)
¡ñ Predict the magnitude of
approximation
Design and analysis of distributed k-nearest neighbors graph algorithms 34
Questions...
Design and analysis of distributed k-nearest neighbors graph algorithms 35
TH?SE
pour obtenir le grade de docteur d¨¦livr¨¦ par
TELECOM ParisTech
Sp¨¦cialit¨¦ ? INFORMATIQUE et RESEAUX ?
pr¨¦sent¨¦e et soutenue publiquement par
Thibault DEBATTY
le 5 octobre 2018
Design and analysis of distributed k-nearest neighbors graph algorithms 36
Directeurs de th¨¨se
¡ñ Pietro MICHIARDI
¡ñ Wim MEES
Jury
¡ñ M. Giovanni NEGLIA, Charg¨¦ de Recherche HDR, INRIA
¡ñ M. Jean-Michel DRICOT, Professeur associ¨¦,
Universit¨¦ Libre de Bruxelles
¡ñ M. Marc DACIER, Professeur, Eurecom
¡ñ Mme Elena BARALIS, Professeur, Politecnico di
Torino

More Related Content

Design and analysis of distributed k-nearest neighbors graph algorithms

  • 1. Design and analysis of distributed k-nearest neighbors graph algorithms Thibault Debatty 2018-10-05
  • 2. Design and analysis of distributed k-nearest neighbors graph algorithms 2 k-nn graph ¡ñ Edge to k most similar nodes ¡ñ Directed graph ¡ñ Similarity assumed to be symmetrical
  • 3. Design and analysis of distributed k-nearest neighbors graph algorithms 3 k-nn graph Is it useful for analyzing large datasets ? E.g.: triage (SPAM), proxy logs, etc.
  • 4. Design and analysis of distributed k-nearest neighbors graph algorithms 4 k-nn graph Challenges: ¡ñ Interactive analysis ¡ñ Avoid algorithms that require O(nx) where x>1 ¡ñ Distributed algorithms ¡ñ Similarity is not always a metric (e.g. text datasets)
  • 5. Design and analysis of distributed k-nearest neighbors graph algorithms 5 Outline ¡ñ Fundamental algorithms: ¨C Building from text datasets ¨C Online building ¨C Partitioning ¡ñ Applications: ¨C Text clustering ¨C APT detection
  • 6. Design and analysis of distributed k-nearest neighbors graph algorithms 6 Building from text datasets ¡ñ Has to be built from a dataset (like an index) ¡ñ Exact graph: O(n?) similarities
  • 7. Design and analysis of distributed k-nearest neighbors graph algorithms 7 Building from text datasets ¡ñ We focus on large datasets => n? is not acceptable! ¡ñ Some algorithms if similarity is metric ¡ñ But text similarity is not always metric (e.g. Jaro-Winkler)
  • 8. Design and analysis of distributed k-nearest neighbors graph algorithms 8 Building from text datasets ¡ñ NN-Descent Build an approximate graph Compute O(n1.14) similarities ¡ñ BUT: iterative!
  • 9. Design and analysis of distributed k-nearest neighbors graph algorithms 9 Building from text datasets NNCTPH ¡ñ Hash using modified hashing function CTPH / ssdeep / spamsum ¡ñ Build subgraphs in parallel ¡ñ Merge subgraphs Single iteration!
  • 10. Design and analysis of distributed k-nearest neighbors graph algorithms 10 Building from text datasets
  • 11. Design and analysis of distributed k-nearest neighbors graph algorithms 11 Building from text datasets ¡ñ Experimental evaluation: ¨C Apache Hadoop MapReduce ¨C SPAM dataset ¨C Jaro-Winkler string similarity (not metric)
  • 12. Design and analysis of distributed k-nearest neighbors graph algorithms 12 Building from text datasets
  • 13. Design and analysis of distributed k-nearest neighbors graph algorithms 13 Building from text datasets ¡ñ Thibault Debatty, Pietro Michiardi, Olivier Thonnard, and Wim Mees. Scalable graph building from text data. In Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, BigMine ¡¯14, 2014 ¡ñ Thibault Debatty, Pietro Michiardi, Olivier Thonnard, and Wim Mees. Building k-nn graphs from large text data. In Proc. of IEEE BigData, 2014 ¡ñ Implementation of NN-Descent and NNCTPH on GitHub ¡ñ java-string-similarity library on GitHub and Maven ¡ñ Java-lsh library on GitHub and Maven
  • 14. Design and analysis of distributed k-nearest neighbors graph algorithms 14 Online building ¡ñ Given a distributed graph: ¨C Add nodes ¨C Remove nodes ¨C Search nearest neighbors of query node ¡ñ Requires k-medoids partitioning of graph
  • 15. Design and analysis of distributed k-nearest neighbors graph algorithms 15 Partitioning ¡ñ k-medoids clustering ¡ñ CLARANS is slow to converge ¡ñ Two faster methods: ¨C Inspired by Simulated Annealing ¨C Heuristic ¡ñ Impact of partitioning when we perform distributed search
  • 16. Design and analysis of distributed k-nearest neighbors graph algorithms 16 Text clustering ¡ñ Text dataset with Jaro-Winkler similarity (not a metric) ¡ñ Steps: ¨C Build (approximate) k-nn graph ¨C Prune ¨C Compute connected components
  • 17. Design and analysis of distributed k-nearest neighbors graph algorithms 17 APT Detection ¡ñ Advanced => no signatures ¡ñ Persistent => limited activity ¡ñ Threats ¡ñ Need a C2 channel
  • 18. Design and analysis of distributed k-nearest neighbors graph algorithms 18 APT Detection
  • 19. Design and analysis of distributed k-nearest neighbors graph algorithms 19 APT Detection Here: APT relying on HTTP => proxy logs
  • 20. Design and analysis of distributed k-nearest neighbors graph algorithms 20 APT Detection How hard can that be?
  • 21. Design and analysis of distributed k-nearest neighbors graph algorithms 21 APT Detection
  • 22. Design and analysis of distributed k-nearest neighbors graph algorithms 22 APT Detection Displaying a page requires multiple HTTP requests => link each request to its parent using the logs from the proxy
  • 23. Design and analysis of distributed k-nearest neighbors graph algorithms 23 APT Detection
  • 24. Design and analysis of distributed k-nearest neighbors graph algorithms 24 APT Detection
  • 25. Design and analysis of distributed k-nearest neighbors graph algorithms 25 APT Detection weight is higher if: ¡ñ Requests are close in time ¡ñ Requests belong to the same domain ¡ñ Same sequence repeats
  • 26. Design and analysis of distributed k-nearest neighbors graph algorithms 26 APT Detection After pruning the weighted graph, the APT remains isolated!
  • 27. Design and analysis of distributed k-nearest neighbors graph algorithms 27 APT Detection weight is higher if: ¡ñ Requests are close in time ¡ñ Requests belong to the same domain ¡ñ Same sequence repeats
  • 28. Design and analysis of distributed k-nearest neighbors graph algorithms 28 APT Detection ¡ñ Batch: build graphs ¡ñ Interactive (web interface): ¨C Merge ¨C Prune ¨C Cluster ¨C Filter ¡ñ Approximate k-nn graph (time and memory)
  • 29. Design and analysis of distributed k-nearest neighbors graph algorithms 29 APT Detection
  • 30. Design and analysis of distributed k-nearest neighbors graph algorithms 30 APT Detection ¡ñ Experimental evaluation ¨C Proxy logs of real network ¨C Simulated APT traffic ¨C Rank suspicious domains ¡ñ Results ¨C High detection / false alarm ratio ¨C Without prior knowledge about APT
  • 31. Design and analysis of distributed k-nearest neighbors graph algorithms 31 APT Detection ¡ñ False positives: ¨C Content Delivery Networks (CDN) ¨C Advertising domains ¨C Javascript library delivery ¨C Websites with very few visits => same behavior as APT
  • 32. Design and analysis of distributed k-nearest neighbors graph algorithms 32 Conclusion k-nn graph is an interesting tool to analyze large datasets, but ¡ñ Only if approximation is acceptable ¡ñ Other possibilities exist
  • 33. Design and analysis of distributed k-nearest neighbors graph algorithms 33 Perspectives... ¡ñ Broaden to other graph-like structures: ¨C (Hierarchical) Small World Network graphs ¨C Asymmetrical graphs ¡ñ Broaden to other applications (clustering, nn search) ¡ñ Predict the magnitude of approximation
  • 34. Design and analysis of distributed k-nearest neighbors graph algorithms 34 Questions...
  • 35. Design and analysis of distributed k-nearest neighbors graph algorithms 35 TH?SE pour obtenir le grade de docteur d¨¦livr¨¦ par TELECOM ParisTech Sp¨¦cialit¨¦ ? INFORMATIQUE et RESEAUX ? pr¨¦sent¨¦e et soutenue publiquement par Thibault DEBATTY le 5 octobre 2018
  • 36. Design and analysis of distributed k-nearest neighbors graph algorithms 36 Directeurs de th¨¨se ¡ñ Pietro MICHIARDI ¡ñ Wim MEES Jury ¡ñ M. Giovanni NEGLIA, Charg¨¦ de Recherche HDR, INRIA ¡ñ M. Jean-Michel DRICOT, Professeur associ¨¦, Universit¨¦ Libre de Bruxelles ¡ñ M. Marc DACIER, Professeur, Eurecom ¡ñ Mme Elena BARALIS, Professeur, Politecnico di Torino