This document outlines Oscar Mauricio Caicedo's research into applying machine learning techniques to networking problems. It discusses using unsupervised learning for heavy hitter threshold estimation and using supervised learning for multipath routing in software-defined data center networks. Specifically, it examines using k-means clustering to analyze network traffic and determine optimal thresholds for classifying heavy hitter flows, and proposes a proactive approach called NELLY that uses incremental learning at the server-side to identify elephant flows for multipath routing.
1 of 65
Download to read offline
More Related Content
Walking on AI/ML for Networking
2. Walking on AI/ML for
Networking
Ph.D Oscar Mauricio Caicedo
Grupo de Ingenier鱈a Telem叩tica
Universidad del Cauca
Colombia
3. Outline
Unsupervised Learning for Heavy-Hitters Threshold Estimation
Supervised Learning for Multipath Routing in Software-Defined Data Center
Networks
Reinforcement Learning for Dynamic and Intelligent Routing
Research Direction
3
6. Duration time between the first and last packet of the flow
Size total number of bytes transmitted in the flow
Rate given by the size divided by the duration
Burstiness characterization of burst in a flow over time t
(Duque-Torres et al., 2019)
Heavy-Hitters Classification Schemes
6
7. Outlier and anomaly detection
Load balancing
Alleviating network congestion
(Duque-Torres et al., 2019)
Need for Heavy-Hitter Flows Detection
HH detections is far from being
obvious
Threshold estimation is a major
challenge
7
8. (Duque-Torres et al., 2019)
Knowledge Discovery Process
Threshold Estimation based on Knowledge Discovery
Process
8
9. (Duque-Torres et al., 2019)
Can the Knowledge Discovery Process (KDP) to shed new light
on threshold estimation in HH detection?
Can we determine an optimal threshold or set of thresholds
that would optimally separate the flows into HHs and non-HHs
in a variety of network and traffic conditions?
Threshold Estimation based on Knowledge Discovery
Process
9
10. Duration tortoises and dragonflies
Sizes elephants and mice
Rate cheetahs and snails
Burstiness porcupines and stingrays
Understanding the Problem
10
12. Work Citations Year Appeared In Threshold Value
Curtis et al. 187 2011 IEEE INFOCOM Size
128KB, 1MB and
100MB
Chiesa et al. 29 2016 IEEE/ACM Trans. Netw. Bandwidth 10%
Moshref et al. 47 2013 ACM HotSDN Bandwidth 10%
Liu et al. 19 2013 IEEE Commun. Lett. Rate 1-10 Mbps
Trestain et al. 14 2013 IFIP/IEEE IM Rate Not specified
Lin et al. 9 2014 IEEE GLOBECOM Size 100 MB
Bi et al. 8 2013 IEEE GLOBECOM Size Adaptive
Understanding the Problem
(Duque-Torres et al., 2019)
12
13. TCP
(Duque-Torres et al., 2019)
Understanding the Data
Dataset Count Average
UNIV1
UNIV2
CAIDA2016
CAIDA2018
6886055
6092
9017007
8225731
710.74
198.84
517.31
1021.20
UDP
Count Average
1453303
9869 978
920417
1618524
390.74
758.24
433.62
674.91
TOTAL
Count Average
9999617
9 99956
9994197
9982640
582.60
749.34
509.13
959
Datasets
13
14. Feature name Feature description Direction
min_ps Packet with minimum size in the flow F & B
max_ps Packet with maximum size in the flow F & B
avg_ps Average packet size F & B
std_ps Standard deviation of packet size F & B
min_piat Minimum packet inter-arrival time F & B
max_piat Maximum packet inter-arrival time F & B
avg_piat Average packet inter-arrival time F & B
std_piat Stand. dev. of packet inter-arrival time F & B
octTotCount Number of transmitted bytes in the flow F & B
pktTotCount Number of transmitted pkts in the flow F & B
flowDur Duration of the flow (in seconds)
Data Preparation
(Duque-Torres et al., 2019)
14
15. K-Means was used due to its simplicity, speed, and accuracy
The optimal number of clusters was defined by Silhouette method
Data Mining
Value Interpretation
Si = 1 The samples are very well clustered
0.5 < Si < 0.7 The samples have a reasonable structure for clustering
Si = 0.5 The samples are probably placed in the wrong cluster
0.26 < Si < 0.5 The samples do not have a substantial structure for clustering
Si = 0 The samples lies between two cluster
Si < 0 The samples are placed in the wrong cluster
(Duque-Torres et al., 2019)
15
17. Data Mining
K-means along the
two first principal
components
k = 2
(Duque-Torres et al., 2019)
Dataset Class I Class II Total
UNIV1
UNIV2
CAIDA2016
CAIDA2018
293 962
18 056
451 696
563 869
8
1
9
3
293 970
18 057
451 705
563 872
17
18. Data Mining
K-means along the
two first principal
components
k = 4
(Duque-Torres et al., 2019)
18
20. Silhouette Analysis - Ambiguous Flows
(Class I, Ambiguous Flows) (Duque-Torres et al., 2019)
Evaluating the Discovered Knowledge
20
21. (Duque-Torres et al., 2019)
Ambiguous Flows (Class I)
Evaluating the Discovered Knowledge
max min average
11 394
18 180
7 601
14 258
1
2
1
1
13.05
108.81
14.7
13.53
281 168
28 4 648
28 465
44 251
149 047
108 230
20775
14 784
1 275
17 376
834
1 356.28
Flow Size [MB]
K Dataset flows
I
UNIV1
UNIV2
CAIDA2016
CAIDA2018
293 002
17 979
449 904
562 273
II
UNIV1
UNIV2
CAIDA2016
CAIDA2018
707
36
1 147
1 199
max min average
0.456
6.5
0.324
0.922
0.000032
0.000064
0.000028
0.000029
0.00629
0.0214
0.00660
0.00558
145.39
28.87
1.34
3.734
1.88
107.610
31.14
21.45
0.306
178.96
37.60
34.08
Number of Packets Flow Duration [s]
max min average
2 643.73
455.32
20.20
22.40
0
0
0
0
6.09
33.87
3.29
2.36
2 643.05
453.71
20.20
22.403
0.379
16.06
0.05
0.030
0.379
339.58
15.65
15.52
For TCP traffic
Flow size = 6 kB
Packet count = 14 21
22. There is no threshold that can
unambiguously classify HH flows
Significance of network and traffic
characteristics
Flow size = 6 kB
Packet count = 14
(Duque-Torres et al., 2019)
Evaluating the Discovered Knowledge
22
23. Open challenges
The time required to collect enough flow details for
reliable classification requires further research
Investigate the usefulness of per-flow packet size
distributions for accurate classification
Develop a threshold-less approach for HH
classification
Evaluating the Discovered Knowledge
23
25. Supervised Learning
Learning approach Training dataset Problems aimed
Supervised Labeled
Classification and regression
Semi-supervised Incomplete labels
Classification Regression
(Boutaba et al., 2018) 25
26. Data Center Networks (DCNs)
Significant bandwidth capacity
Dynamic traffic
Coexistence of many kind of flows
Small, short-lived flows mice
Few large, long-lived flows elephants
(Dixit et al., 2013) (Jain et al., 2009)
Multipath Routing in Software-Defined Data Center
Networks
26
27. Equal-Cost MultiPath (ECMP)
The most prevalent multipath routing mechanism in DCNs
Each switch splits flows over the output ports applying a
function on packet headers
E.g., a hash function on the 5-tuple header
(Dixit et al., 2013) (Jain et al., 2009)
Multipath Routing in Software-Defined Data Center
Networks
27
28. Input Port
Output 1
Output 2
ABCD
hash
AD
BC
Hot-spot
A + D << B + C
Flows
(Dixit et al., 2013) (Jain et al., 2009)
ECMP causes hot-spots in DCNs low throughput and high
delay
Multipath Routing in Software-Defined Data Center
Networks
28
29. Software-Defined Networking (SDN) for facing ECMP
limitations
A global view of the network
A logically centralized controller to dynamically make and
install routing decisions
A DCN using SDN SDDCN
(Boutaba et al., 2018)(Tang et al., 2017) (Chao et al., 2016) (Poupart et al., 2016) (Curtis et al., 2011a-b) (Al-Fares et al., 2010)
Multipath Routing in Software-Defined Data Center
Networks
29
30. Reactive SDN-based multipath routing
Dynamically rescheduling of elephant flows
Mouse flows are handle by default routing (e.g., ECMP)
Discriminate flows using static thresholds
Hot-spots may occur before elephants are detected
(Boutaba et al., 2018)(Tang et al., 2017) (Chao et al., 2016) (Poupart et al., 2016) (Curtis et al., 2011a-b) (Al-Fares et al., 2010)
Multipath Routing in Software-Defined Data Center
Networks
30
31. Proactive SDN/ML-based multicast routing
ML-based methods at the controller for proactively identifying
elephants
Central collection
Per-flow data
Sampling-based data
(Boutaba et al., 2018)(Tang et al., 2017) (Chao et al., 2016) (Poupart et al., 2016) (Curtis et al., 2011a-b) (Al-Fares et al., 2010)
Multipath Routing in Software-Defined Data Center
Networks
31
32. Proactive SDN/ML-based multicast routing - controller side
Central collection
Per-flow data
Heavy traffic overhead and poor scalability
Sampling-based data
Delayed and inaccurate flow information
(Boutaba et al., 2018)(Tang et al., 2017) (Chao et al., 2016) (Poupart et al., 2016) (Curtis et al., 2011a-b) (Al-Fares et al., 2010)
Multipath Routing in Software-Defined Data Center
Networks
32
33. NELLY (Network Elephant Learner and anaLYzer)
(Estrada-Solano et al., 2019)
Incremental learning at
server-side
Proactive, accurately and timely
elephant detection
Adaptation to varying traffic
conditions
Low control overhead
Constantly updated model with
limited resources
NELLY
33
34. (Estrada-Solano et al., 2019)
NELLY (Network Elephant Learner and anaLYzer)
Architecture - High level view
34
37. NELLY (Network Elephant Learner and anaLYzer)
(Estrada-Solano et al., 2019)
Packet traces UNI1 UNI2
Duration 65 min 158 min
Packets 19.85 M 100 M
IPv4 % of total traffic 98.98% (mostly TCP) 99.9% (mostly UDP)
IPv4 flows 1.02 M (TCP and UDP) 1.04 M (mostly UDP)
Datasets
37
41. Comparison
NELLY (Network Elephant Learner and anaLYzer)
(Estrada-Solano et al., 2019)
Works
Learning
approach
Elephants
detection
False
elephants
Table
occupancy
Control
overhead
Detection
time
Network
modifications*
Poupart et al., 2016 Incremental Very high Very low Very high High Very short None
Tang et al., 2017 Batch High Very low Medium Medium Medium
Hardware of
switches
Chao et al., 2016
Batch and
incremental
Very high High Low Medium Medium None
Curtis et al., 2011 None Perfect None None Very low Long
Software in
servers
NELLY Incremental Very high Very low None Very low Very short
Software in
servers
* Assuming OpenFlow-based networks
41
42. Open challenges
NELLY implementation as an in-kernel software
Evaluate impact cost in servers
Processing and memory requirements
Threshold definition
NELLY (Network Elephant Learner and anaLYzer)
(Estrada-Solano et al., 2019)
42
43. Open challenges
ML-based elephant rescheduler at the Controller
Predict the rate and duration of elephants
Regression or multi-classification?
How to define the classes?
Reschedule elephants based on prediction
Algorithm for allocating elephants to links
NELLY (Network Elephant Learner and anaLYzer)
43
46. Network traffic routing is fundamental in networking
Selecting a path for packet transmission
Selection criteria
Cost minimization
Maximization of link utilization
Quality of Service (QoS) provisioning
Network Routing Problem
46
47. The efficient selection of the best path through a network is still
an important issue to be addressed and improved
STATIC and DYNAMIC routing
Dynamic routing
Distance vector protocols
Link state protocols
Network Routing Problem
47
48. Distance vector protocols
Routing information is not covered by the router itself
Delay in routing information
Network overhead
Link state protocols
More memory and processing power
Understanding its operation is critical
Network Routing Problem
48
49. Dynamic routing
More complexity
Overhead on the network
Very rigid
Doesnt learn from past decisions
Network Routing Problem
49
53. Routing based on Knowledge-Defined Networking
(Casas, 2019)
Q-learning
Architecture Operation
53
54. Routing based on Knowledge-Defined Networking
Open challenges
The convergence time
Testing in real large networks
Multi-controllers
Probing of statistics collection based on RL
Collaborative learning for considering sub-networks?
Demo
54
56. 5G Scaling Problem
Mobile traffic is expected to grow exponentially
The complexity, heterogeneity and scale of networks have grown
far beyond the limits of manual administration
NFV is a 5G foundation
Scaling VNFs pertaining to Network Services and Network Slices
is a big challenge
56
57. Scaling in NFV brings cost reduction compared to oversize
capacity for workload peak
Horizontal scaling dynamically increases/reduces the number of
VNF instances
Vertical scaling increases/reduces the number of resources
allocated to one or more VNFs
Most existing solutions scale NFs assuming they operate
isolatedly
5G Scaling Problem
57
58. Work Title Technique
Carella et al.
(2016)
An extensible autoscaling engine for software-based
network functions
Threshold rules
Dutta et al. (2016) QoE-Aware elasticity support in cloud-native 5G systems Threshold rules
Bilal et al. (2016) Dynamic cloud resource scheduling in virtualized 5G
mobile systems
Time series
Alawe et al. (2018) Improving traffic forecasting for 5G core network
scalability: A machine learning approach
Neural networks
Tang et al. (2015) Efficient auto-scaling approach in the telco cloud using
self-learning algorithm
Q-Learning
Alawe et al. (2018) On the scalability of 5G core network Control theory
5G Scaling Problem
Solutions focused on scaling NFs individually -(Demo)
58
59. Solutions that consider scaling NFs as part of compound network
services
Work Title Observation
Wang et al. (2016) Online VNF scaling in data centers It assumes that all NFs must
be scaled
Zhang et al. (2016) Co-scaler: cooperative scaling of software-defined
VNF service function chain
It focuses on flow migration
and scales all NFs
Rankothge et al.
(2017)
Optimizing resource allocation for virtualized
network functions in a cloud center using genetic
algorithms
It scales only NF that is
chosen randomly
Prados-Garzon et al.
(2018)
A queuing based dynamic auto scaling algorithm for
the LTE EPC control plane
It is specific for LTE vEPC
5G Scaling Problem
59
63. Monitoring link
Commercia
l
office
Commercial
office
Data processing
center
Data link
Performance degradation
(e.g., delays, congestion, and loss)
Monitoring performance parameters of the
network as link usage, packet loss and delay
To assure the QoS
Accurate and timely statistics
Service provider
Up-to-date view of network
Probing Problem