�ݺ�ߣ

Walking on AI/ML for Networking

Walking on AI/ML for
Networking
Ph.D Oscar Mauricio Caicedo
Grupo de Ingeniería Telemática
Universidad del Cauca
Colombia

Outline
● Unsupervised Learning for Heavy-Hitters Threshold Estimation
● Supervised Learning for Multipath Routing in Software-Defined Data Center
Networks
● Reinforcement Learning for Dynamic and Intelligent Routing
● Research Direction
3

Unsupervised Learning for Heavy-Hitters Threshold
Estimation
4

Unsupervised Learning
Learning approach Training dataset Problems aimed
Unsupervised Unlabeled Clustering
Clustering
(Boutaba et al., 2018)
5

● Duration – time between the first and last packet of the flow
● Size – total number of bytes transmitted in the flow
● Rate – given by the size divided by the duration
● Burstiness – characterization of burst in a flow over time t
(Duque-Torres et al., 2019)
Heavy-Hitters Classification Schemes
6

● Outlier and anomaly detection
● Load balancing
● Alleviating network congestion
Need for Heavy-Hitter Flows Detection
● HH detections is far from being
obvious
● Threshold estimation is a major
challenge
7

Knowledge Discovery Process
Threshold Estimation based on Knowledge Discovery
Process
8

● Can the Knowledge Discovery Process (KDP) to shed new light
on threshold estimation in HH detection?
● Can we determine an optimal threshold or set of thresholds
that would optimally separate the flows into HHs and non-HHs
in a variety of network and traffic conditions?
Threshold Estimation based on Knowledge Discovery
Process
9

● Duration → tortoises and dragonflies
● Sizes → elephants and mice
● Rate → cheetahs and snails
● Burstiness → porcupines and stingrays
Understanding the Problem
10

11

Work Citations Year Appeared In Threshold Value
Curtis et al. 187 2011 IEEE INFOCOM Size
128KB, 1MB and
100MB
Chiesa et al. 29 2016 IEEE/ACM Trans. Netw. Bandwidth 10%
Moshref et al. 47 2013 ACM HotSDN Bandwidth 10%
Liu et al. 19 2013 IEEE Commun. Lett. Rate 1-10 Mbps
Trestain et al. 14 2013 IFIP/IEEE IM Rate Not specified
Lin et al. 9 2014 IEEE GLOBECOM Size 100 MB
Bi et al. 8 2013 IEEE GLOBECOM Size Adaptive
12

TCP
Understanding the Data
Dataset Count Average
UNIV1
UNIV2
CAIDA2016
CAIDA2018
6886055
6092
9017007
8225731
710.74
198.84
517.31
1021.20
UDP
Count Average
1453303
9869 978
920417
1618524
390.74
758.24
433.62
674.91
TOTAL
Count Average
9999617
9 99956
9994197
9982640
582.60
749.34
509.13
959
Datasets
13

Feature name Feature description Direction
min_ps Packet with minimum size in the flow F & B
max_ps Packet with maximum size in the flow F & B
avg_ps Average packet size F & B
std_ps Standard deviation of packet size F & B
min_piat Minimum packet inter-arrival time F & B
max_piat Maximum packet inter-arrival time F & B
avg_piat Average packet inter-arrival time F & B
std_piat Stand. dev. of packet inter-arrival time F & B
octTotCount Number of transmitted bytes in the flow F & B
pktTotCount Number of transmitted pkts in the flow F & B
flowDur Duration of the flow (in seconds)
Data Preparation
14

● K-Means was used due to its simplicity, speed, and accuracy
● The optimal number of clusters was defined by Silhouette method
Data Mining
Value Interpretation
Si = 1 The samples are very well clustered
0.5 < Si < 0.7 The samples have a reasonable structure for clustering
Si = 0.5 The samples are probably placed in the wrong cluster
0.26 < Si < 0.5 The samples do not have a substantial structure for clustering
Si = 0 The samples lies between two cluster
Si < 0 The samples are placed in the wrong cluster
15

Data Mining
Silhouette Analysis
16

Data Mining
● K-means along the
two first principal
components
● k = 2
Dataset Class I Class II Total
UNIV1
UNIV2
CAIDA2016
CAIDA2018
293 962
18 056
451 696
563 869
8
1
9
3
293 970
18 057
451 705
563 872
17

Data Mining
● K-means along the
two first principal
components
● k = 4
18

max min average
11 394
18 180
33 043
14 258
1
2
1
1
16.09
103.81
16.78
12.4
281 168
28 4 648
28 465
44 251
149 047
108 230
20775
14 784
219 806
194 204
25 357
23 489
85 848
118 060
66 743
14 658
14 987
32 590
47 373
2 513
29 761
61 128
57 058
5 162
79 661
40 486
24 140
190 733
1 418
8 256
904
98 909
4.42
13.10
3.52
191.33
Flow Size [MB]
K Dataset flows
I
UNIV1
UNIV2
CAIDA2016
CAIDA2018
293 709
17 979
1451 051
563 472
II
UNIV1
UNIV2
CAIDA2016
CAIDA2018
6
6
7
21
III
UNIV1
UNIV2
CAIDA2016
CAIDA2018
26
32
2
376
IV
UNIV1
UNIV2
CAIDA2016
CAIDA2018
229
36
644
3
max min average
1.88
6.50
1.34
3.73
0.000032
0.000064
0.000028
0.000029
0.00847
0.00212
0.00421
0.00948
194.11
250.30
42.65
66.22
145.39
107.610
31.14
21.45
167.37
178.96
37.60
34.08
86.80
95.63
100.05
19.81
18.15
32.53
70.93
3.76
30.35
57.94
85.49
7.46
17.64
28.87
22.66
277.62
1.93
7.081
1.35
143.45
4.42
13.10
3.52
191.33
Number of Packets Flow Duration [s]
max min average
2 643.73
455.32
20.202
22.40
0
0
0
0
6.76
33.87
3.23
2.39
2 643.13
455.29
20.19
22.39
151.97
393.25
15.8
7.04
1 061.68
428.88
15.86
19.45
2 641.51
454.52
20.20
22.40
2.27
48.34
10.19
0.26
182.62
402.615
20.19
17.25
2 643.89
454.52
20.202
22.29
1.39
48.34
0.028
21.09
507.30
402.61
15.21
21.89
Evaluating the Discovered Knowledge
19

Silhouette Analysis - Ambiguous Flows
(Class I, Ambiguous Flows) (Duque-Torres et al., 2019)
20

Ambiguous Flows (Class I)
max min average
11 394
18 180
7 601
14 258
1
2
1
1
13.05
108.81
14.7
13.53
281 168
28 4 648
28 465
44 251
149 047
108 230
20775
14 784
1 275
17 376
834
1 356.28
Flow Size [MB]
K Dataset flows
I
UNIV1
UNIV2
CAIDA2016
CAIDA2018
293 002
17 979
449 904
562 273
II
UNIV1
UNIV2
CAIDA2016
CAIDA2018
707
36
1 147
1 199
max min average
0.456
6.5
0.324
0.922
0.000032
0.000064
0.000028
0.000029
0.00629
0.0214
0.00660
0.00558
145.39
28.87
1.34
3.734
1.88
107.610
31.14
21.45
0.306
178.96
37.60
34.08
Number of Packets Flow Duration [s]
max min average
2 643.73
455.32
20.20
22.40
0
0
0
0
6.09
33.87
3.29
2.36
2 643.05
453.71
20.20
22.403
0.379
16.06
0.05
0.030
0.379
339.58
15.65
15.52
For TCP traffic
○ Flow size = 6 kB
○ Packet count = 14 21

● There is no threshold that can
unambiguously classify HH flows
● Significance of network and traffic
characteristics
○ Flow size = 6 kB
○ Packet count = 14
22

● Open challenges
○ The time required to collect enough flow details for
reliable classification requires further research
■ Investigate the usefulness of per-flow packet size
distributions for accurate classification
■ Develop a threshold-less approach for HH
classification
23

Supervised Learning for Multipath Routing in
Software-Defined Data Center Networks
24

Supervised Learning
Learning approach Training dataset Problems aimed
Supervised Labeled
Classification and regression
Semi-supervised Incomplete labels
Classification Regression
(Boutaba et al., 2018) 25

● Data Center Networks (DCNs)
○ Significant bandwidth capacity
○ Dynamic traffic
○ Coexistence of many kind of flows
■ Small, short-lived flows → mice
■ Few large, long-lived flows → elephants
(Dixit et al., 2013) (Jain et al., 2009)
Multipath Routing in Software-Defined Data Center
Networks
26

● Equal-Cost MultiPath (ECMP)
○ The most prevalent multipath routing mechanism in DCNs
○ Each switch splits flows over the output ports applying a
function on packet headers
■ E.g., a hash function on the 5-tuple header
Networks
27

Input Port
Output 1
Output 2
ABCD
hash
AD
BC
Hot-spot
A + D << B + C
Flows
● ECMP causes hot-spots in DCNs ⟶ low throughput and high
delay
Networks
28

● Software-Defined Networking (SDN) for facing ECMP
limitations
○ A global view of the network
○ A logically centralized controller to dynamically make and
install routing decisions
● A DCN using SDN → SDDCN
(Boutaba et al., 2018)(Tang et al., 2017) (Chao et al., 2016) (Poupart et al., 2016) (Curtis et al., 2011a-b) (Al-Fares et al., 2010)
Networks
29

● Reactive SDN-based multipath routing
○ Dynamically rescheduling of elephant flows
○ Mouse flows are handle by default routing (e.g., ECMP)
○ Discriminate flows using static thresholds
○ Hot-spots may occur before elephants are detected
Networks
30

● Proactive SDN/ML-based multicast routing
○ ML-based methods at the controller for proactively identifying
elephants
○ Central collection
■ Per-flow data
■ Sampling-based data
Networks
31

● Proactive SDN/ML-based multicast routing - controller side
○ Central collection
■ Per-flow data
✗ Heavy traffic overhead and poor scalability
■ Sampling-based data
✗ Delayed and inaccurate flow information
Networks
32

NELLY (Network Elephant Learner and anaLYzer)
(Estrada-Solano et al., 2019)
● Incremental learning at
server-side
● Proactive, accurately and timely
elephant detection
● Adaptation to varying traffic
conditions
● Low control overhead
● Constantly updated model with
limited resources
NELLY
33

Architecture - High level view
34

Analyzer
35

Learner
36

Packet traces UNI1 UNI2
Duration 65 min 158 min
Packets 19.85 M 100 M
IPv4 % of total traffic 98.98% (mostly TCP) 99.9% (mostly UDP)
IPv4 flows 1.02 M (TCP and UDP) 1.04 M (mostly UDP)
Datasets
37

● 50 incremental learning algorithms
● Metrics
○ True positive rate (TPR)
○ False positive rate (FPR)
○ Matthews Correlation Coefficient (MCC)
38

Algorithms
UNI1 UNI2
TPR (%) FPR (%) MCC TC
(µs)
Header
type
TPR (%) FPR (%) MCC TC
(µs)
Header
type
AHOT 85.97 35.52 0.327 4.07 BinNom 60.16 28.58 0.304 10.17 BinNom
ARF 82.39 28.82 0.359 12.01 BinNom 68.65 21.33 0.460 17.39 BinNom
Hoeffding tree 86.79 36.38 0.326 3.18 BinNom 57.92 28.46 0.284 4.64 BinNom
NB 74.76 35.69 0.254 4.76 BinNom 49.74 23.18 0.267 4.82 BinNom
OAUE 86.79 33.63 0.347 25.58 BinNom 63.28 28.65 0.332 33.06 BinNom
OzaBag 87.78 37.11 0.327 23.98 BinNom 64.17 31.13 0.314 36.61 BinNom
OzaBoost 75.88 29.93 0.307 11.56 BinNom 64.62 32.22 0.307 16.82 BinNom
SGD-SVM 16.76 10.21 0.067 0.81 Num 38.69 30.99 0.076 0.8 Num
In bold the top five results of TPR and MCC, and the TC
results shorter than 17.5 µs, for both UNI1 and UNI2
Accuracy and classification time with various incremental learning algorithms
39

Accuracy varying the
elephants weight
40

Comparison
Works
Learning
approach
Elephants
detection
False
elephants
Table
occupancy
Control
overhead
Detection
time
Network
modifications*
Poupart et al., 2016 Incremental Very high Very low Very high High Very short None
Tang et al., 2017 Batch High Very low Medium Medium Medium
Hardware of
switches
Chao et al., 2016
Batch and
incremental
Very high High Low Medium Medium None
Curtis et al., 2011 None Perfect None None Very low Long
Software in
servers
NELLY Incremental Very high Very low None Very low Very short
Software in
servers
* Assuming OpenFlow-based networks
41

● Open challenges
○ NELLY implementation as an in-kernel software
○ Evaluate impact cost in servers
○ Processing and memory requirements
○ Threshold definition
42

● Open challenges
○ ML-based elephant rescheduler at the Controller
■ Predict the rate and duration of elephants
■ Regression or multi-classification?
■ How to define the classes?
○ Reschedule elephants based on prediction
○ Algorithm for allocating elephants to links
43

Reinforcement Learning for Dynamic and Intelligent
Routing
44

Reinforcement Learning
(Mammeri, 2019)
RL Concept
45

● Network traffic routing is fundamental in networking
○ Selecting a path for packet transmission
● Selection criteria
○ Cost minimization
○ Maximization of link utilization
○ Quality of Service (QoS) provisioning
Network Routing Problem
46

● The efficient selection of the best path through a network is still
an important issue to be addressed and improved
STATIC and DYNAMIC routing
● Dynamic routing
○ Distance vector protocols
○ Link state protocols
47

● Distance vector protocols
○ Routing information is not covered by the router itself
○ Delay in routing information
○ Network overhead
● Link state protocols
○ More memory and processing power
○ Understanding its operation is critical
48

● Dynamic routing
○ More complexity
○ Overhead on the network
○ Very rigid
○ Doesn’t learn from past decisions
49

Motivating scenario 50

Motivating scenario 51

(Casas, 2019)
Routing based on Knowledge-Defined Networking
● Architecture based
on RL and SDN
52

(Casas, 2019)
Q-learning
Architecture Operation
53

● Open challenges
○ The convergence time
○ Testing in real large networks
○ Multi-controllers
○ Probing of statistics collection based on RL
○ Collaborative learning for considering sub-networks?
Demo
54

5G Scaling Problem
● Mobile traffic is expected to grow exponentially
● The complexity, heterogeneity and scale of networks have grown
far beyond the limits of manual administration
● NFV is a 5G foundation
● Scaling VNFs pertaining to Network Services and Network Slices
is a big challenge
56

● Scaling in NFV brings cost reduction compared to oversize
capacity for workload peak
● Horizontal scaling dynamically increases/reduces the number of
VNF instances
● Vertical scaling increases/reduces the number of resources
allocated to one or more VNFs
● Most existing solutions scale NFs assuming they operate
isolatedly
5G Scaling Problem
57

Work Title Technique
Carella et al.
(2016)
An extensible autoscaling engine for software-based
network functions
Threshold rules
Dutta et al. (2016) QoE-Aware elasticity support in cloud-native 5G systems Threshold rules
Bilal et al. (2016) Dynamic cloud resource scheduling in virtualized 5G
mobile systems
Time series
Alawe et al. (2018) Improving traffic forecasting for 5G core network
scalability: A machine learning approach
Neural networks
Tang et al. (2015) Efficient auto-scaling approach in the telco cloud using
self-learning algorithm
Q-Learning
Alawe et al. (2018) On the scalability of 5G core network Control theory
5G Scaling Problem
Solutions focused on scaling NFs individually -(Demo)
58

Solutions that consider scaling NFs as part of compound network
services
Work Title Observation
Wang et al. (2016) Online VNF scaling in data centers It assumes that all NFs must
be scaled
Zhang et al. (2016) Co-scaler: cooperative scaling of software-defined
VNF service function chain
It focuses on flow migration
and scales all NFs
Rankothge et al.
(2017)
Optimizing resource allocation for virtualized
network functions in a cloud center using genetic
algorithms
It scales only NF that is
chosen randomly
Prados-Garzon et al.
(2018)
A queuing based dynamic auto scaling algorithm for
the LTE EPC control plane
It is specific for LTE vEPC
5G Scaling Problem
59

Collaborative Learning for 5G Scaling
60

Acknowledgements
Ph.D. Armando Ordoñez
Ph.D. Student - Felipe Estrada
Ph.D. Student - Carlos Tobar
Master Student - Alejandra Duque
Master Student - Daniela Casas
Undergraduate Student - Juan Daza
Undergraduate Student - Andrés Maca

Monitoring link
Commercia
l
office
Commercial
office
Data processing
center
Data link
Performance degradation
(e.g., delays, congestion, and loss)
Monitoring performance parameters of the
network as link usage, packet loss and delay
To assure the QoS
“Accurate and timely statistics”
Service provider
Up-to-date view of network
Probing Problem

Monitoring link
Commercia
l
office
Commercial
office
Data processing
center
Data link
Performance degradation
(e.g., delays, congestion, and loss)
● It is necessary the accurate and timely collection of the
statistics of flows
Service provider
Up-to-date view of network
Probing Problem

Probing based on Knowledge-Defined Networking
● Architecture based
on RL and KDN
65

�ݺ�ߣ

Walking on AI/ML for Networking

More Related Content

Walking on AI/ML for Networking