際際滷

際際滷Share a Scribd company logo
Time Series Data Mining -
from PhD to Startup
Peter Laurinec
October 27, 2018
Highlights
Time series data mining - from PhD to start-up:
1/27
Highlights
Time series data mining - from PhD to start-up:
 Problems and solutions for using large amount of long
time series (TS),
1/27
Highlights
Time series data mining - from PhD to start-up:
 Problems and solutions for using large amount of long
time series (TS),
 TS data mining methods,
1/27
Highlights
Time series data mining - from PhD to start-up:
 Problems and solutions for using large amount of long
time series (TS),
 TS data mining methods,
 PhD. study thesis - combining and developing TS data
mining methods,
1/27
Highlights
Time series data mining - from PhD to start-up:
 Problems and solutions for using large amount of long
time series (TS),
 TS data mining methods,
 PhD. study thesis - combining and developing TS data
mining methods,
 TSrepr R package - TS representations,
1/27
Highlights
Time series data mining - from PhD to start-up:
 Problems and solutions for using large amount of long
time series (TS),
 TS data mining methods,
 PhD. study thesis - combining and developing TS data
mining methods,
 TSrepr R package - TS representations,
 Work after Phd - energy start-up,
1/27
Highlights
Time series data mining - from PhD to start-up:
 Problems and solutions for using large amount of long
time series (TS),
 TS data mining methods,
 PhD. study thesis - combining and developing TS data
mining methods,
 TSrepr R package - TS representations,
 Work after Phd - energy start-up,
 Differences and my thoughts,
1/27
Highlights
Time series data mining - from PhD to start-up:
 Problems and solutions for using large amount of long
time series (TS),
 TS data mining methods,
 PhD. study thesis - combining and developing TS data
mining methods,
 TSrepr R package - TS representations,
 Work after Phd - energy start-up,
 Differences and my thoughts,
 What we do there...
1/27
Time Series Data in Energetics
Smart metering
 Measuring electricity consumption or production
(photovoltaic panels) from every consumer or
producer (together prosumer) every 5, 15, or 30
minutes,
 This creates a large amount of time series data,
 3 years of data from consumer 96*365*3 =
105120...from 10 thousand consumers... > 1 billion rows
of multiple columns,
 Smart grid - set of consumers and producers,
2/27
Time Series Data in Energetics
Smart metering
 Measuring electricity consumption or production
(photovoltaic panels) from every consumer or
producer (together prosumer) every 5, 15, or 30
minutes,
 This creates a large amount of time series data,
 3 years of data from consumer 96*365*3 =
105120...from 10 thousand consumers... > 1 billion rows
of multiple columns,
 Smart grid - set of consumers and producers,
Characteristics:
 High-dimensionality,
 Multiple seasonalities (daily, weekly, yearly),
 Large amount of stochastic factors as: weather,
holidays, black-outs, changes on market etc. 2/27
Examples of Consumers TS
0 250 500 750 1000
20
40
60
0
5
10
15
0.0
0.5
1.0
1.5
2.0
0
1
2
Time (3 weeks)
ElectricityConsumption(kW)
3/27
Typical Use Cases
 Forecasting el. consumption or production - market
planning, black-outs prevention etc.,
4/27
Typical Use Cases
 Forecasting el. consumption or production - market
planning, black-outs prevention etc.,
 Extract typical pro鍖les of consumption - changes in tariffs,
create new ones etc.,
4/27
Typical Use Cases
 Forecasting el. consumption or production - market
planning, black-outs prevention etc.,
 Extract typical pro鍖les of consumption - changes in tariffs,
create new ones etc.,
 Optimizing electricity consumption of some consumer,
4/27
Typical Use Cases
 Forecasting el. consumption or production - market
planning, black-outs prevention etc.,
 Extract typical pro鍖les of consumption - changes in tariffs,
create new ones etc.,
 Optimizing electricity consumption of some consumer,
 Optimizing whole smart grid,
4/27
Typical Use Cases
 Forecasting el. consumption or production - market
planning, black-outs prevention etc.,
 Extract typical pro鍖les of consumption - changes in tariffs,
create new ones etc.,
 Optimizing electricity consumption of some consumer,
 Optimizing whole smart grid,
 Monitoring smart grid,
4/27
Typical Use Cases
 Forecasting el. consumption or production - market
planning, black-outs prevention etc.,
 Extract typical pro鍖les of consumption - changes in tariffs,
create new ones etc.,
 Optimizing electricity consumption of some consumer,
 Optimizing whole smart grid,
 Monitoring smart grid,
 Anomaly detection.
4/27
TS Data Mining Methods
 Methods for working with TS:
5/27
TS Data Mining Methods
 Methods for working with TS:
 TS representations,
5/27
TS Data Mining Methods
 Methods for working with TS:
 TS representations,
 TS distance measures,
5/27
TS Data Mining Methods
 Methods for working with TS:
 TS representations,
 TS distance measures,
 Tasks:
5/27
TS Data Mining Methods
 Methods for working with TS:
 TS representations,
 TS distance measures,
 Tasks:
 TS classi鍖cation,
 TS clustering,
 TS forecasting,
 TS anomaly detection,
 TS indexing.
5/27
PhD. Thesis Goals
 The thesis had the goal to investigate, in the broader
context, the usage of time series data mining (analysis)
methods in order to improve the predictive performance
of machine learning methods and its combinations.
6/27
PhD. Thesis Goals
 The thesis had the goal to investigate, in the broader
context, the usage of time series data mining (analysis)
methods in order to improve the predictive performance
of machine learning methods and its combinations.
 In more detail, the goal was to investigate the usage of
various time series representations for seasonal time
series, clustering, and forecasting methods for electricity
consumption forecasting accuracy improvement.
6/27
Approach Overview
7/27
I. Time Series Representations
8/27
I. Time Series Representations
What can we do for solving problems with high-dimensional
TS?
9/27
I. Time Series Representations
What can we do for solving problems with high-dimensional
TS?
 Use time series representations!
9/27
I. Time Series Representations
What can we do for solving problems with high-dimensional
TS?
 Use time series representations!
They are excellent to:
 Reduce memory load.
 Accelerate subsequent machine learning algorithms.
 Implicitly remove noise from the data.
 Emphasize the essential characteristics of the data.
 Help to 鍖nd patterns in data (or motifs).
9/27
4.00
4.25
4.50
4.75
0 500 1000
Time
Load
4.0
4.2
4.4
4.6
4.8
0 50 100 150
Length
Load
4.0
4.2
4.4
4.6
4.8
0 50 100 150
Length
Load
10/27
4.00
4.25
4.50
4.75
0 500 1000
Time
Load
4.2
4.3
4.4
4.5
4.6
0 10 20 30 40 50
Length
Load
4.2
4.4
4.6
0 100 200 300
Length
Load
11/27
I. Time Series Representations 1
I used TS representations for:
1
Laurinec P., Luck叩 M., Lecture Notes in Engineering and Computer Science:
Proceedings of The World Congress on Engineering and Computer Science 2016.
12/27
I. Time Series Representations 1
I used TS representations for:
 Dimensionality reduction (curse of dimensionality),
1
Laurinec P., Luck叩 M., Lecture Notes in Engineering and Computer Science:
Proceedings of The World Congress on Engineering and Computer Science 2016.
12/27
I. Time Series Representations 1
I used TS representations for:
 Dimensionality reduction (curse of dimensionality),
 Emphasising the main characteristics of data,
1
Laurinec P., Luck叩 M., Lecture Notes in Engineering and Computer Science:
Proceedings of The World Congress on Engineering and Computer Science 2016.
12/27
I. Time Series Representations 1
I used TS representations for:
 Dimensionality reduction (curse of dimensionality),
 Emphasising the main characteristics of data,
 More accurate clustering of consumers TS to create more
predictable (forecastable) groups of aggregated TS of
electricity consumption.
1
Laurinec P., Luck叩 M., Lecture Notes in Engineering and Computer Science:
Proceedings of The World Congress on Engineering and Computer Science 2016.
12/27
Clustered TS Representations
17 18 19 20
13 14 15 16
9 10 11 12
5 6 7 8
1 2 3 4
0 20 40 0 20 40 0 20 40 0 20 40
1
0
1
2
3
2
1
0
1
2
3
2
0
2
2
0
2
2
1
0
1
2
3
1
0
1
2
3
2
1
0
1
2
2
0
2
2
1
0
1
2
0
2
4
2
0
2
4
1
0
1
2
2
0
2
4
2
0
2
2
1
0
1
2
3
1
0
1
2
1
0
1
2
3
0
2
4
2
1
0
1
2
2
1
0
1
2
Length
RegressionCoefficients
13/27
Groups of Aggregated TS
17 18 19 20
13 14 15 16
9 10 11 12
5 6 7 8
1 2 3 4
0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000
0.5
0.0
0.5
1.0
1.5
1.5
1.0
0.5
0.0
0.5
1.0
0.50
0.25
0.00
0.25
1
0
1
1.0
0.5
0.0
0.5
1.0
0.5
0.0
0.5
1.0
1.0
0.5
0.0
0.5
1.0
0.5
0.0
0.5
1.0
1.0
0.5
0.0
0.5
1.0
0.5
0.0
0.5
1.0
0
1
0
1
2
0.5
0.0
0.5
1.0
0.5
0.0
0.5
1.0
0.5
0.0
0.5
1.0
1.5
0
1
0
1
2
0
1
2
3
4
5
1.0
0.5
0.0
0.5
1.0
1
0
1
Time
NormalizedLoad
14/27
TSrepr
TSrepr - CRAN2, GitHub3
 R package for time series representations computing
 Large amount of various methods are implemented
 Several useful support functions are also included
 Easy to extend and to use
data <- rnorm(1000)
repr_paa(data, func = median, q = 10)
2
https://CRAN.R-project.org/package=TSrepr
3
https://github.com/PetoLau/TSrepr/
15/27
All type of time series representations methods are implemented, so far these:
 PAA - Piecewise Aggregate Approximation ( repr_paa )
 DWT - Discrete Wavelet Transform ( repr_dwt )
 DFT - Discrete Fourier Transform ( repr_dft )
 DCT - Discrete Cosine Transform ( repr_dct )
 PIP - Perceptually Important Points ( repr_pip )
 SAX - Symbolic Aggregate Approximation ( repr_sax )
 PLA - Piecewise Linear Approximation ( repr_pla )
 Mean seasonal pro鍖le ( repr_seas_profile )
 Model-based seasonal representations based on linear model ( repr_lm )
 FeaClip - Feature extraction from clipping representation ( repr_feaclip )
Additional useful functions are implemented as:
 Windowing ( repr_windowing )
 Matrix of representations ( repr_matrix )
 Normalisation functions - z-score ( norm_z ), min-max ( norm_min_max )
16/27
Usage of TSrepr
mat <- "some matrix with lot of time series"
mat_reprs <- repr_matrix(mat, func = repr_lm,
args = list(method = "rlm", freq = c(48, 48*7)),
normalise = TRUE, func_norm = norm_z)
mat_reprs <- repr_matrix(mat, func = repr_feaclip,
windowing = TRUE, win_size = 48)
clustering <- kmeans(mat_reprs, 20)
17/27
Simple Extensibility of TSrepr
Example #1:
library(moments)
data_ts_skew <- repr_paa(data, q = 48, func = skewness)
Example #2:
repr_fea_extract <- function(x)
c(mean(x), median(x), max(x), min(x), sd(x))
data_fea <- repr_windowing(data,
win_size = 100, func = repr_fea_extract)
18/27
II. Time Series Clustering
19/27
II. Clustering Multiple Data Streams 4
Motivation:
4
https://github.com/PetoLau/ClipStream/
20/27
II. Clustering Multiple Data Streams 4
Motivation:
 Deal with velocity of data coming,
4
https://github.com/PetoLau/ClipStream/
20/27
II. Clustering Multiple Data Streams 4
Motivation:
 Deal with velocity of data coming,
 Dynamic change of number of clusters,
4
https://github.com/PetoLau/ClipStream/
20/27
II. Clustering Multiple Data Streams 4
Motivation:
 Deal with velocity of data coming,
 Dynamic change of number of clusters,
 Automatic anomaly detection (anomalous consumers),
4
https://github.com/PetoLau/ClipStream/
20/27
II. Clustering Multiple Data Streams 4
Motivation:
 Deal with velocity of data coming,
 Dynamic change of number of clusters,
 Automatic anomaly detection (anomalous consumers),
 Automatic change detection.
4
https://github.com/PetoLau/ClipStream/
20/27
II. Clustering Multiple Data Streams 4
Motivation:
 Deal with velocity of data coming,
 Dynamic change of number of clusters,
 Automatic anomaly detection (anomalous consumers),
 Automatic change detection.
Approach:
 Take advantage of incrementality of clipped representation
(windowing),
 Fast detection of anomalous consumers from extracted features from
clipping,
 Change detection by Anderson-Darling test.
4
https://github.com/PetoLau/ClipStream/
20/27
21/27
III. Time Series Forecasting
22/27
III. Time Series Forecasting
Large number of methods suitable for forecasting:
 Time series analysis methods:
 ARIMA,
 Exponential smoothing,
 Theta,
23/27
III. Time Series Forecasting
Large number of methods suitable for forecasting:
 Time series analysis methods:
 ARIMA,
 Exponential smoothing,
 Theta,
 Regression methods:
 Linear regression, GAM,
 SVR, Gaussian process,
 Regression trees, Bagging, Random Forest, Boosting,
 Arti鍖cial Neural Networks.
23/27
III. Time Series Forecasting 5
Finding the most suitable forecasting methods with
clustering...
 STL+ARIMA, Exponential smoothing, Tree-based methods,
Advanced ANNs (S2S + LSTM nets).
5
https://github.com/PetoLau/TSMedianBasedEnsembleLearning/,
https://github.com/PetoLau/UnsupervisedEnsembles/,
https://github.com/PetoLau/DensityEnsembles/
24/27
III. Time Series Forecasting 5
Finding the most suitable forecasting methods with
clustering...
 STL+ARIMA, Exponential smoothing, Tree-based methods,
Advanced ANNs (S2S + LSTM nets).
The problem of choosing the most suitable method among the
set of methods...
Solution:
 Ensemble learning - combining forecasts.
5
https://github.com/PetoLau/TSMedianBasedEnsembleLearning/,
https://github.com/PetoLau/UnsupervisedEnsembles/,
https://github.com/PetoLau/DensityEnsembles/
24/27
Life after PhD
 I was happy to be hired by start-up PowereX.
 We solve problems strongly related with my thesis.
25/27
Life after PhD
 I was happy to be hired by start-up PowereX.
 We solve problems strongly related with my thesis.
PowereX
 P2P energy sharing - commodity and also capacity,
 Analysis of consumers smart meter data,
 Forecasting and modelling maximal load (hourly, daily,
etc.).
25/27
Differences between PhD and Business
PhD:
 Strong focus on accuracy measures -  % of Mean
Absolute Percentage Error, or internal validation indexes
for clustering...
26/27
Differences between PhD and Business
PhD:
 Strong focus on accuracy measures -  % of Mean
Absolute Percentage Error, or internal validation indexes
for clustering...
 Many times working with poor academic datasets.
26/27
Differences between PhD and Business
PhD:
 Strong focus on accuracy measures -  % of Mean
Absolute Percentage Error, or internal validation indexes
for clustering...
 Many times working with poor academic datasets.
Business:
 Finding real value for customers,
 Accuracy is not that important,
 Working on real rich data.
26/27
Differences between PhD and Business
PhD:
 Strong focus on accuracy measures -  % of Mean
Absolute Percentage Error, or internal validation indexes
for clustering...
 Many times working with poor academic datasets.
Business:
 Finding real value for customers,
 Accuracy is not that important,
 Working on real rich data.
But...they are also related and need each other...
26/27
Conclusions
TS data mining:
27/27
Conclusions
TS data mining:
 TS representations are our 鍖ends in clustering,
forecasting, classi鍖cation etc.,
27/27
Conclusions
TS data mining:
 TS representations are our 鍖ends in clustering,
forecasting, classi鍖cation etc.,
 Implemented in TSrepr package,
27/27
Conclusions
TS data mining:
 TS representations are our 鍖ends in clustering,
forecasting, classi鍖cation etc.,
 Implemented in TSrepr package,
 PhD study is great practice before work.
27/27
Conclusions
TS data mining:
 TS representations are our 鍖ends in clustering,
forecasting, classi鍖cation etc.,
 Implemented in TSrepr package,
 PhD study is great practice before work.
Questions: Peter Laurinec laurinec.peter@gmail.com
Code: https://github.com/PetoLau/
More research: https://petolau.github.io/research
Blog: https://petolau.github.io
27/27

More Related Content

Time Series Data Mining - from PhD to Startup

  • 1. Time Series Data Mining - from PhD to Startup Peter Laurinec October 27, 2018
  • 2. Highlights Time series data mining - from PhD to start-up: 1/27
  • 3. Highlights Time series data mining - from PhD to start-up: Problems and solutions for using large amount of long time series (TS), 1/27
  • 4. Highlights Time series data mining - from PhD to start-up: Problems and solutions for using large amount of long time series (TS), TS data mining methods, 1/27
  • 5. Highlights Time series data mining - from PhD to start-up: Problems and solutions for using large amount of long time series (TS), TS data mining methods, PhD. study thesis - combining and developing TS data mining methods, 1/27
  • 6. Highlights Time series data mining - from PhD to start-up: Problems and solutions for using large amount of long time series (TS), TS data mining methods, PhD. study thesis - combining and developing TS data mining methods, TSrepr R package - TS representations, 1/27
  • 7. Highlights Time series data mining - from PhD to start-up: Problems and solutions for using large amount of long time series (TS), TS data mining methods, PhD. study thesis - combining and developing TS data mining methods, TSrepr R package - TS representations, Work after Phd - energy start-up, 1/27
  • 8. Highlights Time series data mining - from PhD to start-up: Problems and solutions for using large amount of long time series (TS), TS data mining methods, PhD. study thesis - combining and developing TS data mining methods, TSrepr R package - TS representations, Work after Phd - energy start-up, Differences and my thoughts, 1/27
  • 9. Highlights Time series data mining - from PhD to start-up: Problems and solutions for using large amount of long time series (TS), TS data mining methods, PhD. study thesis - combining and developing TS data mining methods, TSrepr R package - TS representations, Work after Phd - energy start-up, Differences and my thoughts, What we do there... 1/27
  • 10. Time Series Data in Energetics Smart metering Measuring electricity consumption or production (photovoltaic panels) from every consumer or producer (together prosumer) every 5, 15, or 30 minutes, This creates a large amount of time series data, 3 years of data from consumer 96*365*3 = 105120...from 10 thousand consumers... > 1 billion rows of multiple columns, Smart grid - set of consumers and producers, 2/27
  • 11. Time Series Data in Energetics Smart metering Measuring electricity consumption or production (photovoltaic panels) from every consumer or producer (together prosumer) every 5, 15, or 30 minutes, This creates a large amount of time series data, 3 years of data from consumer 96*365*3 = 105120...from 10 thousand consumers... > 1 billion rows of multiple columns, Smart grid - set of consumers and producers, Characteristics: High-dimensionality, Multiple seasonalities (daily, weekly, yearly), Large amount of stochastic factors as: weather, holidays, black-outs, changes on market etc. 2/27
  • 12. Examples of Consumers TS 0 250 500 750 1000 20 40 60 0 5 10 15 0.0 0.5 1.0 1.5 2.0 0 1 2 Time (3 weeks) ElectricityConsumption(kW) 3/27
  • 13. Typical Use Cases Forecasting el. consumption or production - market planning, black-outs prevention etc., 4/27
  • 14. Typical Use Cases Forecasting el. consumption or production - market planning, black-outs prevention etc., Extract typical pro鍖les of consumption - changes in tariffs, create new ones etc., 4/27
  • 15. Typical Use Cases Forecasting el. consumption or production - market planning, black-outs prevention etc., Extract typical pro鍖les of consumption - changes in tariffs, create new ones etc., Optimizing electricity consumption of some consumer, 4/27
  • 16. Typical Use Cases Forecasting el. consumption or production - market planning, black-outs prevention etc., Extract typical pro鍖les of consumption - changes in tariffs, create new ones etc., Optimizing electricity consumption of some consumer, Optimizing whole smart grid, 4/27
  • 17. Typical Use Cases Forecasting el. consumption or production - market planning, black-outs prevention etc., Extract typical pro鍖les of consumption - changes in tariffs, create new ones etc., Optimizing electricity consumption of some consumer, Optimizing whole smart grid, Monitoring smart grid, 4/27
  • 18. Typical Use Cases Forecasting el. consumption or production - market planning, black-outs prevention etc., Extract typical pro鍖les of consumption - changes in tariffs, create new ones etc., Optimizing electricity consumption of some consumer, Optimizing whole smart grid, Monitoring smart grid, Anomaly detection. 4/27
  • 19. TS Data Mining Methods Methods for working with TS: 5/27
  • 20. TS Data Mining Methods Methods for working with TS: TS representations, 5/27
  • 21. TS Data Mining Methods Methods for working with TS: TS representations, TS distance measures, 5/27
  • 22. TS Data Mining Methods Methods for working with TS: TS representations, TS distance measures, Tasks: 5/27
  • 23. TS Data Mining Methods Methods for working with TS: TS representations, TS distance measures, Tasks: TS classi鍖cation, TS clustering, TS forecasting, TS anomaly detection, TS indexing. 5/27
  • 24. PhD. Thesis Goals The thesis had the goal to investigate, in the broader context, the usage of time series data mining (analysis) methods in order to improve the predictive performance of machine learning methods and its combinations. 6/27
  • 25. PhD. Thesis Goals The thesis had the goal to investigate, in the broader context, the usage of time series data mining (analysis) methods in order to improve the predictive performance of machine learning methods and its combinations. In more detail, the goal was to investigate the usage of various time series representations for seasonal time series, clustering, and forecasting methods for electricity consumption forecasting accuracy improvement. 6/27
  • 27. I. Time Series Representations 8/27
  • 28. I. Time Series Representations What can we do for solving problems with high-dimensional TS? 9/27
  • 29. I. Time Series Representations What can we do for solving problems with high-dimensional TS? Use time series representations! 9/27
  • 30. I. Time Series Representations What can we do for solving problems with high-dimensional TS? Use time series representations! They are excellent to: Reduce memory load. Accelerate subsequent machine learning algorithms. Implicitly remove noise from the data. Emphasize the essential characteristics of the data. Help to 鍖nd patterns in data (or motifs). 9/27
  • 31. 4.00 4.25 4.50 4.75 0 500 1000 Time Load 4.0 4.2 4.4 4.6 4.8 0 50 100 150 Length Load 4.0 4.2 4.4 4.6 4.8 0 50 100 150 Length Load 10/27
  • 32. 4.00 4.25 4.50 4.75 0 500 1000 Time Load 4.2 4.3 4.4 4.5 4.6 0 10 20 30 40 50 Length Load 4.2 4.4 4.6 0 100 200 300 Length Load 11/27
  • 33. I. Time Series Representations 1 I used TS representations for: 1 Laurinec P., Luck叩 M., Lecture Notes in Engineering and Computer Science: Proceedings of The World Congress on Engineering and Computer Science 2016. 12/27
  • 34. I. Time Series Representations 1 I used TS representations for: Dimensionality reduction (curse of dimensionality), 1 Laurinec P., Luck叩 M., Lecture Notes in Engineering and Computer Science: Proceedings of The World Congress on Engineering and Computer Science 2016. 12/27
  • 35. I. Time Series Representations 1 I used TS representations for: Dimensionality reduction (curse of dimensionality), Emphasising the main characteristics of data, 1 Laurinec P., Luck叩 M., Lecture Notes in Engineering and Computer Science: Proceedings of The World Congress on Engineering and Computer Science 2016. 12/27
  • 36. I. Time Series Representations 1 I used TS representations for: Dimensionality reduction (curse of dimensionality), Emphasising the main characteristics of data, More accurate clustering of consumers TS to create more predictable (forecastable) groups of aggregated TS of electricity consumption. 1 Laurinec P., Luck叩 M., Lecture Notes in Engineering and Computer Science: Proceedings of The World Congress on Engineering and Computer Science 2016. 12/27
  • 37. Clustered TS Representations 17 18 19 20 13 14 15 16 9 10 11 12 5 6 7 8 1 2 3 4 0 20 40 0 20 40 0 20 40 0 20 40 1 0 1 2 3 2 1 0 1 2 3 2 0 2 2 0 2 2 1 0 1 2 3 1 0 1 2 3 2 1 0 1 2 2 0 2 2 1 0 1 2 0 2 4 2 0 2 4 1 0 1 2 2 0 2 4 2 0 2 2 1 0 1 2 3 1 0 1 2 1 0 1 2 3 0 2 4 2 1 0 1 2 2 1 0 1 2 Length RegressionCoefficients 13/27
  • 38. Groups of Aggregated TS 17 18 19 20 13 14 15 16 9 10 11 12 5 6 7 8 1 2 3 4 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 0.50 0.25 0.00 0.25 1 0 1 1.0 0.5 0.0 0.5 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 0.5 0.0 0.5 1.0 0 1 0 1 2 0.5 0.0 0.5 1.0 0.5 0.0 0.5 1.0 0.5 0.0 0.5 1.0 1.5 0 1 0 1 2 0 1 2 3 4 5 1.0 0.5 0.0 0.5 1.0 1 0 1 Time NormalizedLoad 14/27
  • 39. TSrepr TSrepr - CRAN2, GitHub3 R package for time series representations computing Large amount of various methods are implemented Several useful support functions are also included Easy to extend and to use data <- rnorm(1000) repr_paa(data, func = median, q = 10) 2 https://CRAN.R-project.org/package=TSrepr 3 https://github.com/PetoLau/TSrepr/ 15/27
  • 40. All type of time series representations methods are implemented, so far these: PAA - Piecewise Aggregate Approximation ( repr_paa ) DWT - Discrete Wavelet Transform ( repr_dwt ) DFT - Discrete Fourier Transform ( repr_dft ) DCT - Discrete Cosine Transform ( repr_dct ) PIP - Perceptually Important Points ( repr_pip ) SAX - Symbolic Aggregate Approximation ( repr_sax ) PLA - Piecewise Linear Approximation ( repr_pla ) Mean seasonal pro鍖le ( repr_seas_profile ) Model-based seasonal representations based on linear model ( repr_lm ) FeaClip - Feature extraction from clipping representation ( repr_feaclip ) Additional useful functions are implemented as: Windowing ( repr_windowing ) Matrix of representations ( repr_matrix ) Normalisation functions - z-score ( norm_z ), min-max ( norm_min_max ) 16/27
  • 41. Usage of TSrepr mat <- "some matrix with lot of time series" mat_reprs <- repr_matrix(mat, func = repr_lm, args = list(method = "rlm", freq = c(48, 48*7)), normalise = TRUE, func_norm = norm_z) mat_reprs <- repr_matrix(mat, func = repr_feaclip, windowing = TRUE, win_size = 48) clustering <- kmeans(mat_reprs, 20) 17/27
  • 42. Simple Extensibility of TSrepr Example #1: library(moments) data_ts_skew <- repr_paa(data, q = 48, func = skewness) Example #2: repr_fea_extract <- function(x) c(mean(x), median(x), max(x), min(x), sd(x)) data_fea <- repr_windowing(data, win_size = 100, func = repr_fea_extract) 18/27
  • 43. II. Time Series Clustering 19/27
  • 44. II. Clustering Multiple Data Streams 4 Motivation: 4 https://github.com/PetoLau/ClipStream/ 20/27
  • 45. II. Clustering Multiple Data Streams 4 Motivation: Deal with velocity of data coming, 4 https://github.com/PetoLau/ClipStream/ 20/27
  • 46. II. Clustering Multiple Data Streams 4 Motivation: Deal with velocity of data coming, Dynamic change of number of clusters, 4 https://github.com/PetoLau/ClipStream/ 20/27
  • 47. II. Clustering Multiple Data Streams 4 Motivation: Deal with velocity of data coming, Dynamic change of number of clusters, Automatic anomaly detection (anomalous consumers), 4 https://github.com/PetoLau/ClipStream/ 20/27
  • 48. II. Clustering Multiple Data Streams 4 Motivation: Deal with velocity of data coming, Dynamic change of number of clusters, Automatic anomaly detection (anomalous consumers), Automatic change detection. 4 https://github.com/PetoLau/ClipStream/ 20/27
  • 49. II. Clustering Multiple Data Streams 4 Motivation: Deal with velocity of data coming, Dynamic change of number of clusters, Automatic anomaly detection (anomalous consumers), Automatic change detection. Approach: Take advantage of incrementality of clipped representation (windowing), Fast detection of anomalous consumers from extracted features from clipping, Change detection by Anderson-Darling test. 4 https://github.com/PetoLau/ClipStream/ 20/27
  • 50. 21/27
  • 51. III. Time Series Forecasting 22/27
  • 52. III. Time Series Forecasting Large number of methods suitable for forecasting: Time series analysis methods: ARIMA, Exponential smoothing, Theta, 23/27
  • 53. III. Time Series Forecasting Large number of methods suitable for forecasting: Time series analysis methods: ARIMA, Exponential smoothing, Theta, Regression methods: Linear regression, GAM, SVR, Gaussian process, Regression trees, Bagging, Random Forest, Boosting, Arti鍖cial Neural Networks. 23/27
  • 54. III. Time Series Forecasting 5 Finding the most suitable forecasting methods with clustering... STL+ARIMA, Exponential smoothing, Tree-based methods, Advanced ANNs (S2S + LSTM nets). 5 https://github.com/PetoLau/TSMedianBasedEnsembleLearning/, https://github.com/PetoLau/UnsupervisedEnsembles/, https://github.com/PetoLau/DensityEnsembles/ 24/27
  • 55. III. Time Series Forecasting 5 Finding the most suitable forecasting methods with clustering... STL+ARIMA, Exponential smoothing, Tree-based methods, Advanced ANNs (S2S + LSTM nets). The problem of choosing the most suitable method among the set of methods... Solution: Ensemble learning - combining forecasts. 5 https://github.com/PetoLau/TSMedianBasedEnsembleLearning/, https://github.com/PetoLau/UnsupervisedEnsembles/, https://github.com/PetoLau/DensityEnsembles/ 24/27
  • 56. Life after PhD I was happy to be hired by start-up PowereX. We solve problems strongly related with my thesis. 25/27
  • 57. Life after PhD I was happy to be hired by start-up PowereX. We solve problems strongly related with my thesis. PowereX P2P energy sharing - commodity and also capacity, Analysis of consumers smart meter data, Forecasting and modelling maximal load (hourly, daily, etc.). 25/27
  • 58. Differences between PhD and Business PhD: Strong focus on accuracy measures - % of Mean Absolute Percentage Error, or internal validation indexes for clustering... 26/27
  • 59. Differences between PhD and Business PhD: Strong focus on accuracy measures - % of Mean Absolute Percentage Error, or internal validation indexes for clustering... Many times working with poor academic datasets. 26/27
  • 60. Differences between PhD and Business PhD: Strong focus on accuracy measures - % of Mean Absolute Percentage Error, or internal validation indexes for clustering... Many times working with poor academic datasets. Business: Finding real value for customers, Accuracy is not that important, Working on real rich data. 26/27
  • 61. Differences between PhD and Business PhD: Strong focus on accuracy measures - % of Mean Absolute Percentage Error, or internal validation indexes for clustering... Many times working with poor academic datasets. Business: Finding real value for customers, Accuracy is not that important, Working on real rich data. But...they are also related and need each other... 26/27
  • 63. Conclusions TS data mining: TS representations are our 鍖ends in clustering, forecasting, classi鍖cation etc., 27/27
  • 64. Conclusions TS data mining: TS representations are our 鍖ends in clustering, forecasting, classi鍖cation etc., Implemented in TSrepr package, 27/27
  • 65. Conclusions TS data mining: TS representations are our 鍖ends in clustering, forecasting, classi鍖cation etc., Implemented in TSrepr package, PhD study is great practice before work. 27/27
  • 66. Conclusions TS data mining: TS representations are our 鍖ends in clustering, forecasting, classi鍖cation etc., Implemented in TSrepr package, PhD study is great practice before work. Questions: Peter Laurinec laurinec.peter@gmail.com Code: https://github.com/PetoLau/ More research: https://petolau.github.io/research Blog: https://petolau.github.io 27/27