際際滷

際際滷Share a Scribd company logo
Reactive Data-
Analysis with Vert.x
GERALD MCKE, @GMUECKE
1
@gmuecke
About me
 IT Consultant & Java Specialist at DevCon5 (CH)
 Focal Areas
 Tool-assisted quality assurance
 Performance (-testing, -analysis, -tooling)
 Operational Topics (APM, Monitoring)
 Twitter: @gmuecke
2
@gmuecke
What is Big Data?
 Volume
 The quantity of generated and stored data. The size of the data determines the value
and potential insight- and whether it can actually be considered big data or not.
 Variety
 The type and nature of the data. This helps people who analyze it to effectively use
the resulting insight.
 Velocity
 In this context, the speed at which the data is generated and processed to meet the
demands and challenges that lie in the path of growth and development.
 Variability
 Inconsistency of the data set can hamper processes to handle and manage it.
 Veracity
 The data quality of captured data can vary greatly, affecting the accurate analysis.
3
https://en.wikipedia.org/wiki/Big_Data#Characteristics
Velocity
the speed at which the data is generated and processed to
meet the demands and challenges that lie in the path of
growth and development.
@gmuecke
Fast Data
Processing
 Database
 File
 Network (Stream)
4
@gmuecke
The Starting Point
 Customer stored and keep response time measurement of test runs
in a MongoDB
 Lots of Data
 Timestamp & Value
 No Proper Visualization
5
@gmuecke
What are timeseries data?
 a set of datapoints with a timestamp and a value
time
value
6
@gmuecke
What is MongoDB?
 MongoDB
 NoSQL database with focus on scale
 JSON as data representation
 No HTTP endpoint (TCP based Wire Protocol)
 Aggregation framework for complex queries
 Provides an Async Driver
7
@gmuecke
What is Grafana?
 A Service for Visualizing Time Series Data
 Open Source
 Backend written in Go
 Frontend based on Angular
 Dashboards & Alerts
8
@gmuecke
Grafana Architecture 9
Grafana Server
 Implemented in GO
 Persistence for Settings and Dashboards
 Offers Proxy for Service Calls
Browser
Angular UI Data Source Data Source Plugin...
Proxy
DB DB
@gmuecke
Datasources for Grafana 10
Grafana Server
 Implemented in GO
 Persistence for Settings and Dashboards
 Offers Proxy for Service Calls
Browser
Datasource
Angular UI
Data Source Plugin
 Angular
 JavaScript
HTTP
@gmuecke
Connect Angular Directly to
Mongo?
11
@gmuecke
From 2 Tier to 3 Tier 12
Grafana
(Angular) Mongo DB
Grafana
(Angular) Mongo DB
Datasource
Service
HTTP Mongo
Wire
Protocol
@gmuecke
Start Simple
SimpleJsonDatasource (Plugin)
3 ServiceEndpoints
 /search  Labels  names of available timeseries
 /annotations  Annotations  textual markers
 /query  Query  actual time series data
13
https://github.com/grafana/simple-json-datasource
@gmuecke
/search Format
Request
{
"target" : "select metric",
"refId" : "E"
}
Response
[
"Metric Name 1",
"Metric Name2",
]
An array of strings
14
@gmuecke
/annotations Format
Request
{ "annotation" : {
"name" : "Test",
"iconColor" : "rgba(255, 96, 96, 1)",
"datasource" : "Simple Example DS",
"enable" : true,
"query" : "{"name":"Timeseries A"}" },
"range" : {
"from" : "2016-06-13T12:23:47.387Z",
"to" : "2016-06-13T12:24:19.217Z" },
"rangeRaw" : {
"from" : "2016-06-13T12:23:47.387Z",
"to" : "2016-06-13T12:24:19.217Z"
} }
Response
[ { "annotation": {
"name": "Test",
"iconColor": "rgba(255, 96, 96, 1)",
"datasource": "Simple Example DS",
"enable": true,
"query": "{"name":"Timeseries A"}" },
"time": 1465820629774,
"title": "Marker",
"tags": [
"Tag 1",
"Tag 2" ] } ]
15
@gmuecke
/query Format
Request
{ "panelId" : 1,
"maxDataPoints" : 1904,
"format" : "json",
"range" : {
"from" : "2016-06-13T12:23:47.387Z",
"to" : "2016-06-13T12:24:19.217Z" },
"rangeRaw" : {
"from" : "2016-06-13T12:23:47.387Z",
"to" : "2016-06-13T12:24:19.217Z" },
"interval" : "20ms",
"targets" : [ {
"target" : "Time series A",
"refId" : "A" },] }
Response
[ { "target":"Timeseries A",
"datapoints":[
[1936,1465820629774],
[2105,1465820632673],
[4187,1465820635570],
[30001,1465820645243] },
{ "target":"Timeseries B",
"datapoints":[ ] }
]
16
@gmuecke
Structure of the Source Data
{
"_id" : ObjectId("56375bc54f3c4caedfe68aca"),
"t" : {
"eDesc" : "some description",
"eId" : "56375ae24f3c4caedfe68a07",
"name" : "some name",
"profile" : "I01",
"rnId" : "56375b694f3c4caedfe68aa0",
"rnStatus" : "PASSED",
"uId" : "anonymous"
},
"n" : {
"begin" : NumberLong("1446468494689"),
"value" : NumberLong(283)
}
}
17
@gmuecke
Custom Datasource
 Should be
 Lightweight
 Fast / Performant
 Simple
18
@gmuecke
Microservice?
 Options for implementation
 Java EE Microservice (i.e. Wildfly Swarm)
 Springboot Microservice
 Vert.x Microservice
 Node.js
 ...
19
@gmuecke
The Alternative Options
Node.js
 Single Threaded
 Child Worker Processes
 Javascript Only
 Not best-choice for heavy
computation
Spring / Java EE
 Multithreaded
 Clusterable
 Java Only
 Solid Workhorses, cumbersome at
times
20
@gmuecke
Why Vert.x?
 High Performance
 Low Memory Footprint
 Few Dependencies
 Polyglott
 Scalable
21
@gmuecke
But first,
some basics
22
@gmuecke
Vert.x is a
Library for
 Asynchronous
 Non-Blocking
 Reactive
 Polyglott
 Microservices
23
This Photo by Unknown Author is licensed under CC BY-NC
@gmuecke
Asynchronous vs. Synchronous 24
息 Jason Lee / Reuters
@gmuecke
Non-blocking vs. Blocking 25
息 Fritz Geller-Grimm
息 Dontworry
@gmuecke
Reactive vs. Non-Reactive
 Responsive
 Resilient
 Elastic
 Message Driven
26
@gmuecke
Polyglott vs. Monoglott 27
息 Kjp993 息 Jacquie Wingate
@gmuecke
Microservice vs. Monolith 28
The weaver
https://www.amazon.com/Wenger-16999-Swiss-Knife-Giant/dp/B001DZTJRQ
@gmuecke
Vert.x Concepts
29
@gmuecke
Verticles
 Contain your processing code
 Provide actor-like concurrency
 Send/Receive messages
 Verticles unit of deployment
30
@gmuecke
Event Loop 31
Verticle
Verticle
Verticle
EventI/O
@gmuecke
Event Loop 32
Photo: Andreas Praefcke
@gmuecke
Event Loop and Verticles 33
Photo: RokerHRO
3rd Floor, Verticle A
2nd Floor, Verticle B
1st Floor, Verticle C
@gmuecke
34
@gmuecke
35
@gmuecke
Event Bus 36
Verticle
Verticle
Verticle
Eventbus
Message
@gmuecke
Event Bus 37
 https://www.youtube.com/watch?v=Kr_4yLhIJ_I
Disclaimer:
I am not affiliated
with Heineken. I
simply liked the
commercial.
Nevertheless:
Drink responsibly!
@gmuecke
CPU
Multi-Reactor 38
Core Core Core Core
Eventbus
Other Vert.x
Instance
Browser
Verticle Verticle
@gmuecke
Event & Worker Verticles
Event Driven Verticles Worker Verticles
39
Verticle
Verticle
Verticle
Thread Pool
Thread Pool
Verticle
Verticle
Verticle
Verticle
Verticle
@gmuecke
40
@gmuecke
Implementing the datasource
 Http Verticle
 Routing requests & sending responses
 Verticles querying the DB
 Searching timeseries labels
 Annotation
 Timeseries data points
 Optional Verticles for Post Processing
41
@gmuecke
What is the challenge?
 Optimization
 Queries can be optimized
 Large datasets have to be searched, read and transported
 Source data can not be modified VS data redundancy
 Sizing
 How to size the analysis system without knowing the query-times?
 How to size thread pools or database pools if most of the queries will
take 100ms  30s ?
42
Analysing Data
from a Database
43
@gmuecke
CPU
Datasource Architecture 44
HTTP
Service
Eventbus
Timeseries
HTTP
Request
HTTP
Response
DB
Labels Annotations
@gmuecke
Step 1  The naive approach
 Find all datapoints within range
45
@gmuecke
CPU
Datasource Architecture 46
HTTP
Service
Eventbus
Query
Database
HTTP
Request
HTTP
Response
DB
@gmuecke
Step 2  Split Request
 Split request into chunks (#chunks = #cores)
 Use multiple Verticle Instance in parallel (#instances = #cores) ?
47
CPU
@gmuecke
CPU
Datasource Architecture 48
HTTP
Service
Split/ Merge
Request
Eventbus
Query
Database
Query
Database
Query
Database
Query
Database
HTTP
Request
HTTP
Response
DB
@gmuecke
Step 3  Aggregate Datapoints
 Use Mongo Aggregation Pipeline
 Reduce Datapoints returned to service
49
@gmuecke
CPU
Datasource Architecture 50
HTTP
Service
Split/ Merge
Request
Eventbus
Query
Database
Query
Database
Query
Database
Query
Database
HTTP
Request
HTTP
Response
DB
@gmuecke
Step 4  Percentiles (CPU)
 Fetch all data
 Calculate percentiles in service
51
CPU
@gmuecke
Step 4  Percentiles (DB)
 Build aggregation pipeline to calculate percentiles
 Algorithm, see
http://www.dummies.com/education/math/statistics/how-to-
calculate-percentiles-in-statistics/
52
DB
@gmuecke
CPU
Datasource Architecture 53
HTTP
Service
Split/ Merge
Request
Eventbus
Query
Database
Query
Database
Query
Database
Query
Database
HTTP
Request
HTTP
Response
DB
@gmuecke
Step 5 - Postprocessing
 Apply additional computation on the result from the database
54
@gmuecke
CPUCPU
Datasource Architecture (final) 55
HTTP
Service
Split
Request
Eventbus
Query
Database
Query
Database
Query
Database
Query
Database
Merge
Result
HTTP
Request
HTTP
Response
DB
Post
Process
Post
Process
Post
Process
Post
Process
Eventbus
@gmuecke
Adding more stats & calculation
 Push Calculation to DB if possible
 Add more workers / node for complex (post-) processing
 Aggregate results before post-processing
 DB performance is king
56
Analysing Data
from a File
58
@gmuecke
Lets read a large data file
 Datafile is large (> 1GB)
 Every line of the file is a datapoint
 The first 10 characters are a timestamp
 The dataset is sorted
 The datapoints are not equally distributed
 Grafana requires reads ~1900 datapoints per chart request
59
@gmuecke
The Challenges (pick one)
 How to randomly access
1900 datapoints without
reading the entire file into
memory?
 How to read a huge file
efficiently into memory?
60
Index
+ Lazy refinement
Index
+ Lazy load
@gmuecke
Lets build an index
 Indexes can be build using a tree-
datastructure
 Node: Timestamp
 Leaf: offset position in file
or the datapoint
 Red-Black Trees provide fast
access
 Read/Insert O(log n)
 Space n
61
息 Cburnett, Wikipedia
@gmuecke
62
@gmuecke
 java.util.TreeMap is a red-black tree
based implementation*
 TreeMap<Long,Long> index =
new TreeMap<>();
63
@gmuecke
How to build an index (fast)?
 Read datapoint from offset positions
 Build a partial index
64
Dataset
@gmuecke
On next query
 Locate Block
 Refine Block
 Update Index
65
@gmuecke
CPUCPU
Datasource Architecture (again) 66
HTTP
Service
Split
Request
Eventbus
Read File
Read File
Read File
Read File
Merge
Result
HTTP
Request
HTTP
Response
Dataset
Post
Process
Post
Process
Post
Process
Post
Process
Eventbus
@gmuecke
Tradeoffs
Block
Size
Index
Size
Startup
Time
Heap
Size
Request
Size
67
@gmuecke
Takeaways
 Vert.x is
 Reactive, Non-Blocking, Asynchronous, Scalable
 Running on JVM
 Polyglott
 Fun
 Valid Choice for Data Stream Processing
68
Source code on:
https://github.com/gmuecke/grafana-vertx-datasource
Thank you!
FEEDBACK APRECIATED!
69

More Related Content

Reactive data analysis with vert.x