This document discusses building a custom data source plugin for Grafana using Vert.x to analyze timeseries data stored in a MongoDB database or large data files. It describes the challenges of efficiently querying large datasets and processing the results. Various approaches are presented, starting with a naive approach, and improving through techniques like splitting queries, aggregating results, using database aggregation pipelines, lazy loading of file indexes, and post-processing. The final architecture employs Vert.x concepts like verticles, event loops and the event bus to asynchronously process queries in parallel and aggregate the results.
2. @gmuecke
About me
IT Consultant & Java Specialist at DevCon5 (CH)
Focal Areas
Tool-assisted quality assurance
Performance (-testing, -analysis, -tooling)
Operational Topics (APM, Monitoring)
Twitter: @gmuecke
2
3. @gmuecke
What is Big Data?
Volume
The quantity of generated and stored data. The size of the data determines the value
and potential insight- and whether it can actually be considered big data or not.
Variety
The type and nature of the data. This helps people who analyze it to effectively use
the resulting insight.
Velocity
In this context, the speed at which the data is generated and processed to meet the
demands and challenges that lie in the path of growth and development.
Variability
Inconsistency of the data set can hamper processes to handle and manage it.
Veracity
The data quality of captured data can vary greatly, affecting the accurate analysis.
3
https://en.wikipedia.org/wiki/Big_Data#Characteristics
Velocity
the speed at which the data is generated and processed to
meet the demands and challenges that lie in the path of
growth and development.
5. @gmuecke
The Starting Point
Customer stored and keep response time measurement of test runs
in a MongoDB
Lots of Data
Timestamp & Value
No Proper Visualization
5
7. @gmuecke
What is MongoDB?
MongoDB
NoSQL database with focus on scale
JSON as data representation
No HTTP endpoint (TCP based Wire Protocol)
Aggregation framework for complex queries
Provides an Async Driver
7
8. @gmuecke
What is Grafana?
A Service for Visualizing Time Series Data
Open Source
Backend written in Go
Frontend based on Angular
Dashboards & Alerts
8
9. @gmuecke
Grafana Architecture 9
Grafana Server
Implemented in GO
Persistence for Settings and Dashboards
Offers Proxy for Service Calls
Browser
Angular UI Data Source Data Source Plugin...
Proxy
DB DB
10. @gmuecke
Datasources for Grafana 10
Grafana Server
Implemented in GO
Persistence for Settings and Dashboards
Offers Proxy for Service Calls
Browser
Datasource
Angular UI
Data Source Plugin
Angular
JavaScript
HTTP
12. @gmuecke
From 2 Tier to 3 Tier 12
Grafana
(Angular) Mongo DB
Grafana
(Angular) Mongo DB
Datasource
Service
HTTP Mongo
Wire
Protocol
13. @gmuecke
Start Simple
SimpleJsonDatasource (Plugin)
3 ServiceEndpoints
/search Labels names of available timeseries
/annotations Annotations textual markers
/query Query actual time series data
13
https://github.com/grafana/simple-json-datasource
19. @gmuecke
Microservice?
Options for implementation
Java EE Microservice (i.e. Wildfly Swarm)
Springboot Microservice
Vert.x Microservice
Node.js
...
19
20. @gmuecke
The Alternative Options
Node.js
Single Threaded
Child Worker Processes
Javascript Only
Not best-choice for heavy
computation
Spring / Java EE
Multithreaded
Clusterable
Java Only
Solid Workhorses, cumbersome at
times
20
23. @gmuecke
Vert.x is a
Library for
Asynchronous
Non-Blocking
Reactive
Polyglott
Microservices
23
This Photo by Unknown Author is licensed under CC BY-NC
37. @gmuecke
Event Bus 37
https://www.youtube.com/watch?v=Kr_4yLhIJ_I
Disclaimer:
I am not affiliated
with Heineken. I
simply liked the
commercial.
Nevertheless:
Drink responsibly!
41. @gmuecke
Implementing the datasource
Http Verticle
Routing requests & sending responses
Verticles querying the DB
Searching timeseries labels
Annotation
Timeseries data points
Optional Verticles for Post Processing
41
42. @gmuecke
What is the challenge?
Optimization
Queries can be optimized
Large datasets have to be searched, read and transported
Source data can not be modified VS data redundancy
Sizing
How to size the analysis system without knowing the query-times?
How to size thread pools or database pools if most of the queries will
take 100ms 30s ?
42
54. @gmuecke
Step 5 - Postprocessing
Apply additional computation on the result from the database
54
55. @gmuecke
CPUCPU
Datasource Architecture (final) 55
HTTP
Service
Split
Request
Eventbus
Query
Database
Query
Database
Query
Database
Query
Database
Merge
Result
HTTP
Request
HTTP
Response
DB
Post
Process
Post
Process
Post
Process
Post
Process
Eventbus
56. @gmuecke
Adding more stats & calculation
Push Calculation to DB if possible
Add more workers / node for complex (post-) processing
Aggregate results before post-processing
DB performance is king
56
58. @gmuecke
Lets read a large data file
Datafile is large (> 1GB)
Every line of the file is a datapoint
The first 10 characters are a timestamp
The dataset is sorted
The datapoints are not equally distributed
Grafana requires reads ~1900 datapoints per chart request
59
59. @gmuecke
The Challenges (pick one)
How to randomly access
1900 datapoints without
reading the entire file into
memory?
How to read a huge file
efficiently into memory?
60
Index
+ Lazy refinement
Index
+ Lazy load
60. @gmuecke
Lets build an index
Indexes can be build using a tree-
datastructure
Node: Timestamp
Leaf: offset position in file
or the datapoint
Red-Black Trees provide fast
access
Read/Insert O(log n)
Space n
61
息 Cburnett, Wikipedia
65. @gmuecke
CPUCPU
Datasource Architecture (again) 66
HTTP
Service
Split
Request
Eventbus
Read File
Read File
Read File
Read File
Merge
Result
HTTP
Request
HTTP
Response
Dataset
Post
Process
Post
Process
Post
Process
Post
Process
Eventbus
67. @gmuecke
Takeaways
Vert.x is
Reactive, Non-Blocking, Asynchronous, Scalable
Running on JVM
Polyglott
Fun
Valid Choice for Data Stream Processing
68
Source code on:
https://github.com/gmuecke/grafana-vertx-datasource