Analytics is the divination of signal from apparent noise, the discovery of behavior and patterns. Analytics is currently used as the input to operational databases for realtime decision making . This is commonly don with MapReduce and/or Complex Event Processing. This presentation is an introduction to Analytics and discusses realtime analytics using Aerospike's indeed map reduce
1 of 37
Downloaded 14 times
More Related Content
Real Time Analytics
1. IN-MEMORY NOSQL
REAL-TIME ANALYTICS ON A
HIGH PERFORMANCE
DATABASE PLATFORM
PETER MILNE
DIRECTOR OF APPLICATION ENGINEERING
BIG DATA STRATEGY
VILNIUS
MAY 2014
Aerospike aer . o . spike [air-oh- spahyk]
noun, 1. tip of a rocket that enhances speed and stability
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 1
2. Where is the wisdom we have lost in
knowledge? Where is the knowledge we
have lost in information?
-T.S.Elliot 1934
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 2
What is: Analytics?
Analytics is:
Finding Knowledge in Information
Finding Signal in Noise
3. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 3
Signal in Noise
We see/hear noise
But its all signal
4. How do we find Signal in Noise
Smoothing & Filtering
Classification
Similarity
Dimension Reduction
Aggregation
Voodoo + Alchemy (just joking)
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 4
5. Smoothing & Filtering
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 5
6. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 6
Smoothing
Moving average - each point in
the signal is replaced with the
average of m adjacent points,
where m is a positive integer
called the smooth width
7. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 7
Filtering
Pass Filters High, Low, Band,
Butterworth, etc
Recursive Filters Kalman Filter
or linear quadratic estimation
8. Classification
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 8
9. Cluster Analysis
Cluster analysis or clustering is the task of
grouping a set of objects in such a way
that objects in the same group (called a
cluster) are more similar (in some sense
or another) to each other than to those in
other groups (clusters).
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 9
10. Na誰ve Bayes Classifier
A naive Bayes classifier is a simple
probabilistic classifier based on applying
Bayes' theorem with strong (naive)
independence assumptions. - Wikipedia
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 10
11. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 11
Similarity
12. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 12
Cosine Similarity
Cosine similarity is a measure of
similarity between two vectors of
an inner product space that measures
the cosine of the angle between
them. . Cosine similarity is
particularly used in positive space,
where the outcome is neatly
bounded in [0,1]. - Wikipedia
13. Dynamic Time Warping - DTW
In time series analysis, dynamic time
warping (DTW) is an algorithm for
measuring similarity between two
temporal sequences which may vary in
time or speed.
A well known application has been
automatic speech recognition, to cope
with different speaking speeds.
Other applications include speaker
recognition and online signature
recognition. Also it is seen that it can be
used in partial shape matching
application. - Wikipedia
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 13
14. Dimension Reduction
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 14
15. Principal Component Analysis PCA
The PCA method finds the directions with
the greatest variance in the data, called
principal components.
Eigenfaces Facial recognition, OpenCV
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 15
16. Linear Discriminant Analysis - LDA
Linear discriminant analysis (LDA) and
the related Fisher's linear discriminant
are methods used in statistics, pattern
recognition and machine learning to find
a linear combination of features which
characterizes or separates two or more
classes of objects or events. The resulting
combination may be used as a linear
classifier, or, more commonly, for
dimensionality reduction before later
classification. - Wikipedia
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 16
Bankruptcy prediction
Facial recognition
Marketing
17. My Brain is Full
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 17
18. Technologies
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 18
19. Complex Event Processing (CEP)
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 19
Storm
20. Map Reduce
Distributed Database Cluster
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 20
21. Hadoop
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 21
Large
Powerful
Capable
Methodical
Batch
Input: HDFS - PetaBytes of RAW
data
Output: NoSQL Signal from
Noise
22. HOT ANALYTICS In Real-time
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 22
23. Key Challenges
Handle extremely high rates of read/write transactions with
concurrent real-time analytics
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 23
Avoid hot spots
On a node
An index
A key
Pre-qualify data to be processed in Map Reduce
Maximize parallelism
Minimize programmer complexity
In Realtime
24. 息 2013 Aerospike. All rights reserved Pg. 24
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 24
1) Shared Nothing
Architecture,
every node identical
2) No hotspots
DHT with RIPEMD160
3) Single row ACID
synch replication within
cluster
4) Real-time prioritization
of transactions + long
running tasks
5) Smart Cluster Zero
touch auto fail-over,
rebalancing, rolling
upgrades..
6) Smart Client - 1 hop to
data, no load balancers
Aerospike Architecture
25. Queries + User Defined Functions = Real-time Analytics
STREAM
AGGREGATIONS
(INDEXED MAP-REDUCE)
Pipe Query results
through UDFs
Filter, Transform,
Aggregate.. Map,
Reduce
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 25
User Defined
Functions (UDFs)
for real-time analytics
and aggregations
26. Conceptual Stream Processing
Output of a query is a Stream
Stream flows through
Filter
Mapper
Aggregator
Reducer
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 26
27. Hot Analytics Scanario Airline Late Flights
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 27
Data
Airline flights in the USA January 2012
`1,050,000 flight records
Task
On a specific date
Which Airline had late flights?
How many flights?
How many were late?
Percentage late flights?
Performance Requirements
Results in < 1 Sec
No impact on production transaction performance (300K TPS)
GitHub Repo - https://github.com/aerospike/flights-analytics
28. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 28
Solution
Index the flight records by Date
Aggregate (Map) late flight data on node
Reduce flight data from each node in the
client
User Defined Functions (UDFs) written in
Lua
Registered with the Aerospike Cluster
Invoked as part of a secondary index
query
Indexed Map Reduce
29. Prepare and execute a Query
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 29
30. Aggregation Function (Map) function
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 30
31. Reduce function
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 31
32. Stream Function (StreamUDF)
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 32
33. Operations (300k TPS) + Analytics (Indexed Map/Reduce)
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 33
Java App calculates
% of late flights by Airline
300k TPS Operations +
Process 1 Million
records
Indexed Map/Reduce
Aggregations
Distributed Queries + UDF
Runs in 0.5 seconds
34. Operational + Analytics + Adding servers and Re-balancing
300k TPS Operations +
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 34
Process 1 Million
records
Runs in .5 seconds
Add servers,
auto-rebalance
while running query
Cluster
3 Nodes
Hex core 3.4Ghz
RAM 24GB
2x Micron p420m SSDs
10GB network
35. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 35
Books
36. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 36
Software
Aerospike
http://www.aerospike.com/free-aerospike-3-community-edition/
Tools
Eclipse - http://www.eclipse.org/
Lua Plugin - http://www.eclipse.org/koneki/ldt/
Aerospike Plugin - https://github.com/aerospike/eclipse-tools
Example
Fligtt Analytics - https://github.com/aerospike/flights-analytics