際際滷

際際滷Share a Scribd company logo
IN-MEMORY NOSQL 
REAL-TIME ANALYTICS ON A 
HIGH PERFORMANCE 
DATABASE PLATFORM 
PETER MILNE 
DIRECTOR OF APPLICATION ENGINEERING 
BIG DATA STRATEGY 
VILNIUS 
MAY 2014 
Aerospike aer . o . spike [air-oh- spahyk] 
noun, 1. tip of a rocket that enhances speed and stability 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 1
Where is the wisdom we have lost in 
knowledge? Where is the knowledge we 
have lost in information? 
-T.S.Elliot 1934 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 2 
What is: Analytics? 
Analytics is: 
 Finding Knowledge in Information 
 Finding Signal in Noise
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 3 
Signal in Noise 
We see/hear noise 
But its all signal
How do we find Signal in Noise 
 Smoothing & Filtering 
 Classification 
 Similarity 
 Dimension Reduction 
 Aggregation 
 Voodoo + Alchemy (just joking) 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 4
Smoothing & Filtering 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 5
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 6 
Smoothing 
 Moving average - each point in 
the signal is replaced with the 
average of m adjacent points, 
where m is a positive integer 
called the smooth width
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 7 
Filtering 
 Pass Filters  High, Low, Band, 
Butterworth, etc 
 Recursive Filters  Kalman Filter 
or linear quadratic estimation
Classification 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 8
Cluster Analysis 
Cluster analysis or clustering is the task of 
grouping a set of objects in such a way 
that objects in the same group (called a 
cluster) are more similar (in some sense 
or another) to each other than to those in 
other groups (clusters). 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 9
Na誰ve Bayes Classifier 
A naive Bayes classifier is a simple 
probabilistic classifier based on applying 
Bayes' theorem with strong (naive) 
independence assumptions. - Wikipedia 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 10
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 11 
Similarity
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 12 
Cosine Similarity 
Cosine similarity is a measure of 
similarity between two vectors of 
an inner product space that measures 
the cosine of the angle between 
them. . Cosine similarity is 
particularly used in positive space, 
where the outcome is neatly 
bounded in [0,1]. - Wikipedia
Dynamic Time Warping - DTW 
In time series analysis, dynamic time 
warping (DTW) is an algorithm for 
measuring similarity between two 
temporal sequences which may vary in 
time or speed. 
 
A well known application has been 
automatic speech recognition, to cope 
with different speaking speeds. 
 
Other applications include speaker 
recognition and online signature 
recognition. Also it is seen that it can be 
used in partial shape matching 
application. - Wikipedia 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 13
Dimension Reduction 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 14
Principal Component Analysis  PCA 
The PCA method finds the directions with 
the greatest variance in the data, called 
principal components. 
Eigenfaces  Facial recognition, OpenCV 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 15
Linear Discriminant Analysis - LDA 
Linear discriminant analysis (LDA) and 
the related Fisher's linear discriminant 
are methods used in statistics, pattern 
recognition and machine learning to find 
a linear combination of features which 
characterizes or separates two or more 
classes of objects or events. The resulting 
combination may be used as a linear 
classifier, or, more commonly, for 
dimensionality reduction before later 
classification. - Wikipedia 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 16 
Bankruptcy prediction 
Facial recognition 
Marketing
My Brain is Full 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 17
Technologies 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 18
Complex Event Processing (CEP) 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 19 
Storm
Map Reduce 
Distributed Database Cluster 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 20
Hadoop 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 21 
Large 
Powerful 
Capable 
Methodical 
Batch 
Input: HDFS - PetaBytes of RAW 
data 
Output: NoSQL  Signal from 
Noise
HOT ANALYTICS  In Real-time 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 22
Key Challenges 
 Handle extremely high rates of read/write transactions with 
concurrent real-time analytics 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 23 
 Avoid hot spots 
 On a node 
 An index 
 A key 
 Pre-qualify data to be processed in Map Reduce 
 Maximize parallelism 
 Minimize programmer complexity 
 In Realtime
息 2013 Aerospike. All rights reserved Pg. 24 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 24 
1) Shared Nothing 
Architecture, 
every node identical 
2) No hotspots 
 DHT with RIPEMD160 
3) Single row ACID 
 synch replication within 
cluster 
4) Real-time prioritization 
of transactions + long 
running tasks 
5) Smart Cluster  Zero 
touch auto fail-over, 
rebalancing, rolling 
upgrades.. 
6) Smart Client - 1 hop to 
data, no load balancers 
Aerospike Architecture
Queries + User Defined Functions = Real-time Analytics 
STREAM 
AGGREGATIONS 
(INDEXED MAP-REDUCE) 
Pipe Query results 
through UDFs 
 Filter, Transform, 
Aggregate.. Map, 
Reduce 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 25 
User Defined 
Functions (UDFs) 
for real-time analytics 
and aggregations
Conceptual Stream Processing 
 Output of a query is a Stream 
 Stream flows through 
Filter 
Mapper 
Aggregator 
Reducer 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 26
Hot Analytics Scanario  Airline Late Flights 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 27 
Data 
Airline flights in the USA January 2012 
`1,050,000 flight records 
Task 
On a specific date 
 Which Airline had late flights? 
 How many flights? 
 How many were late? 
 Percentage late flights? 
Performance Requirements 
Results in < 1 Sec 
No impact on production transaction performance (300K TPS) 
GitHub Repo - https://github.com/aerospike/flights-analytics
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 28 
Solution 
 Index the flight records by Date 
 Aggregate (Map) late flight data on node 
 Reduce flight data from each node in the 
client 
 User Defined Functions (UDFs) written in 
Lua 
 Registered with the Aerospike Cluster 
 Invoked as part of a secondary index 
query 
Indexed Map Reduce
Prepare and execute a Query 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 29
Aggregation Function (Map) function 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 30
Reduce function 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 31
Stream Function (StreamUDF) 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 32
Operations (300k TPS) + Analytics (Indexed Map/Reduce) 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 33 
 Java App calculates 
% of late flights by Airline 
 300k TPS Operations + 
Process 1 Million 
records 
 Indexed Map/Reduce 
 Aggregations 
 Distributed Queries + UDF 
 Runs in 0.5 seconds
Operational + Analytics + Adding servers and Re-balancing 
 300k TPS Operations + 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 34 
Process 1 Million 
records 
 Runs in .5 seconds 
 Add servers, 
auto-rebalance 
while running query 
Cluster 
 3 Nodes 
 Hex core 3.4Ghz 
 RAM 24GB 
 2x Micron p420m SSDs 
 10GB network
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 35 
Books
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 36 
Software 
 Aerospike 
http://www.aerospike.com/free-aerospike-3-community-edition/ 
 Tools 
Eclipse - http://www.eclipse.org/ 
Lua Plugin - http://www.eclipse.org/koneki/ldt/ 
Aerospike Plugin - https://github.com/aerospike/eclipse-tools 
 Example 
Fligtt Analytics - https://github.com/aerospike/flights-analytics
QUESTIONS? 
info@aerospike.com 
www.aerospike.com 
息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius  May 2014 | 37

More Related Content

Real Time Analytics

  • 1. IN-MEMORY NOSQL REAL-TIME ANALYTICS ON A HIGH PERFORMANCE DATABASE PLATFORM PETER MILNE DIRECTOR OF APPLICATION ENGINEERING BIG DATA STRATEGY VILNIUS MAY 2014 Aerospike aer . o . spike [air-oh- spahyk] noun, 1. tip of a rocket that enhances speed and stability 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 1
  • 2. Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information? -T.S.Elliot 1934 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 2 What is: Analytics? Analytics is: Finding Knowledge in Information Finding Signal in Noise
  • 3. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 3 Signal in Noise We see/hear noise But its all signal
  • 4. How do we find Signal in Noise Smoothing & Filtering Classification Similarity Dimension Reduction Aggregation Voodoo + Alchemy (just joking) 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 4
  • 5. Smoothing & Filtering 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 5
  • 6. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 6 Smoothing Moving average - each point in the signal is replaced with the average of m adjacent points, where m is a positive integer called the smooth width
  • 7. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 7 Filtering Pass Filters High, Low, Band, Butterworth, etc Recursive Filters Kalman Filter or linear quadratic estimation
  • 8. Classification 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 8
  • 9. Cluster Analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 9
  • 10. Na誰ve Bayes Classifier A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. - Wikipedia 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 10
  • 11. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 11 Similarity
  • 12. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 12 Cosine Similarity Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. . Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. - Wikipedia
  • 13. Dynamic Time Warping - DTW In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences which may vary in time or speed. A well known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. Also it is seen that it can be used in partial shape matching application. - Wikipedia 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 13
  • 14. Dimension Reduction 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 14
  • 15. Principal Component Analysis PCA The PCA method finds the directions with the greatest variance in the data, called principal components. Eigenfaces Facial recognition, OpenCV 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 15
  • 16. Linear Discriminant Analysis - LDA Linear discriminant analysis (LDA) and the related Fisher's linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification. - Wikipedia 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 16 Bankruptcy prediction Facial recognition Marketing
  • 17. My Brain is Full 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 17
  • 18. Technologies 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 18
  • 19. Complex Event Processing (CEP) 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 19 Storm
  • 20. Map Reduce Distributed Database Cluster 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 20
  • 21. Hadoop 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 21 Large Powerful Capable Methodical Batch Input: HDFS - PetaBytes of RAW data Output: NoSQL Signal from Noise
  • 22. HOT ANALYTICS In Real-time 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 22
  • 23. Key Challenges Handle extremely high rates of read/write transactions with concurrent real-time analytics 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 23 Avoid hot spots On a node An index A key Pre-qualify data to be processed in Map Reduce Maximize parallelism Minimize programmer complexity In Realtime
  • 24. 息 2013 Aerospike. All rights reserved Pg. 24 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 24 1) Shared Nothing Architecture, every node identical 2) No hotspots DHT with RIPEMD160 3) Single row ACID synch replication within cluster 4) Real-time prioritization of transactions + long running tasks 5) Smart Cluster Zero touch auto fail-over, rebalancing, rolling upgrades.. 6) Smart Client - 1 hop to data, no load balancers Aerospike Architecture
  • 25. Queries + User Defined Functions = Real-time Analytics STREAM AGGREGATIONS (INDEXED MAP-REDUCE) Pipe Query results through UDFs Filter, Transform, Aggregate.. Map, Reduce 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 25 User Defined Functions (UDFs) for real-time analytics and aggregations
  • 26. Conceptual Stream Processing Output of a query is a Stream Stream flows through Filter Mapper Aggregator Reducer 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 26
  • 27. Hot Analytics Scanario Airline Late Flights 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 27 Data Airline flights in the USA January 2012 `1,050,000 flight records Task On a specific date Which Airline had late flights? How many flights? How many were late? Percentage late flights? Performance Requirements Results in < 1 Sec No impact on production transaction performance (300K TPS) GitHub Repo - https://github.com/aerospike/flights-analytics
  • 28. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 28 Solution Index the flight records by Date Aggregate (Map) late flight data on node Reduce flight data from each node in the client User Defined Functions (UDFs) written in Lua Registered with the Aerospike Cluster Invoked as part of a secondary index query Indexed Map Reduce
  • 29. Prepare and execute a Query 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 29
  • 30. Aggregation Function (Map) function 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 30
  • 31. Reduce function 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 31
  • 32. Stream Function (StreamUDF) 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 32
  • 33. Operations (300k TPS) + Analytics (Indexed Map/Reduce) 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 33 Java App calculates % of late flights by Airline 300k TPS Operations + Process 1 Million records Indexed Map/Reduce Aggregations Distributed Queries + UDF Runs in 0.5 seconds
  • 34. Operational + Analytics + Adding servers and Re-balancing 300k TPS Operations + 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 34 Process 1 Million records Runs in .5 seconds Add servers, auto-rebalance while running query Cluster 3 Nodes Hex core 3.4Ghz RAM 24GB 2x Micron p420m SSDs 10GB network
  • 35. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 35 Books
  • 36. 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 36 Software Aerospike http://www.aerospike.com/free-aerospike-3-community-edition/ Tools Eclipse - http://www.eclipse.org/ Lua Plugin - http://www.eclipse.org/koneki/ldt/ Aerospike Plugin - https://github.com/aerospike/eclipse-tools Example Fligtt Analytics - https://github.com/aerospike/flights-analytics
  • 37. QUESTIONS? info@aerospike.com www.aerospike.com 息 2014 Aerospike, Inc. All rights reserved. Confidential. | Bid Data Strategy, Vilnius May 2014 | 37

Editor's Notes

  • #16: OpenCV Facial recognition