The document discusses analyzing streaming data using streaming algorithms. It explains how to calculate the mean of a dataset incrementally as new data points are added by keeping a running sum and incrementing the count of data points. It also questions how to calculate the median incrementally as new data points are added. The document then discusses real-time tradeoffs around location, availability, throughput, and latency for high-volume data. Finally, it concludes that big data is also about analyzing small things quickly and making the data accessible.
7. Get ahead of the curve
Noise
Normal
Normal
Normal
Noise
[J Gama, University of Porto] #bigdataMY
8. Get ahead of the curve
Noise
New concept
Normal
Concept
drift
Normal
Normal
Big Data is much more likely to catch the
black swan as it swoops in
Noise - Norman Nie, Revolution Analytics
[J Gama, University of Porto] #bigdataMY
18. V
where time 21:00 - 23:00
count(*)
Under the hood
21:00 all = 1345 :00 = 45 :01 = 62 ...
22:00 all = 3221 :00 = 22 :01 = 19 ...
... ...
UK all = 228 user01 = 1 user14 = 12 ...
US all = 354 user01 = 15 user14 = 0 ...
MY all = 28 user01 = 0 user02 = 0 ...
...
#bigdataMY
19. Streaming algorithms
A = [a1, a2, a3, a4, a5]
mean(A) = sum it up / number of things
#bigdataMY
20. Streaming algorithms
A = [a1, a2, a3, a4, a5]
mean(A) = sum it up / number of things
now add another item a6...???
#bigdataMY
21. Streaming algorithms
A = [a1, a2, a3, a4, a5]
mean(A) = sum it up / number of things
now add another item a6...???
sum = sum + a6
inc(number of things)
#bigdataMY
22. Streaming algorithms
A = [a1, a2, a3, a4, a5]
mean(A) = sum it up / number of things
now add another item a6...???
sum = sum + a6
inc(number of things)
try this with median?
#bigdataMY
23. V Realtime tradeoffs
ity
loc
Ad
-ve
-ho
gh
c
Hi
High-volume
#bigdataMY
24. V Conclusion
Big Data also about the Little Things, done fast.
The devil is in the details.
Make it accessible.
#bigdataMY