ݺߣ

ݺߣShare a Scribd company logo
AUA Data Science Meetup
D AV I D G E V O R K YA N
@ d a v i d g e v
d a v i d g e v o r k y a n
G R A D U AT E D A U A I N 2 0 0 8
W H AT I S B I G D ATA ?
FA S H I O N A B L E T E R M ?
8 0 % O F D ATA E X I S T I N G I N A N Y E N T E R P R I S E I S
U N S T R U C T U R E D D ATA
ST R U C T U R E D 	
  
DATA
S E M I -­‐ 	
  
ST R U C T U R E D 	
  
U N ST R U C T U R E D 	
  
DATA
RDBMS Data Warehousing
9 0 % O F T H E D ATA I N T H E W O R L D T O D AY H A S
B E E N C R E AT E D I N T H E L A S T T W O Y E A R S A L O N E
S o u rc e : h t t p : / / w w w. i n t e l . c o m / c o n t e n t / w w w / u s / e n / c o m m u n i c a t i o n s / i n t e r n e t - m i n u t e - i n f o g r a p h i c . h t m l
4 V ’ S O F B I G D ATA
VOLUME (large amount of data)
VARIETY (sensors, video, audio, email, social)
VELOCITY (speed of data generation)
VERACITY (authenticity and/or accuracy)
S O L U T I O N S R E Q U I R E D
f o rc e s y o u t o c h a n g e t h e w a y y o u
• C O L L E C T
• T R A N S P O RT
• S T O R E
• M A N A G E
• A N A LY Z E
• V I S U A L I Z E
AUA Data Science Meetup
W H AT I S D ATA S C I E N C E ?
D ATA S C I E N C E ! = S TAT I S T I C A L A N A LY S I S
I T I S S C I E N C E A N D “ A RT ” O F …
• E X P L O R I N G T H E U N K N O W N A B O U T D ATA
“ m a k e d i s c o v e r i e s w h i l e s w i m m i n g i n t h e d a t a ”
• R E F I N I N G T H E R E S U LT S F O R A C C U R A C Y
• D E R I V I N G A C T I O N A B L E I N S I G H T
• C R E AT I N G D ATA - D R I V E N P R O D U C T S
W H O A R E D ATA S C I E N T I S T S ?
W H O A R E D ATA S C I E N T I S T S ?
D re w C o n w a y, 2 0 1 0
B I G D ATA S C I E N C E T O O L S ?
• S c a l a , J a v a , P y t h o n , R … ( b o n u s : C l o j u re , H a s k e l l , E r l a n g )
• H a d o o p , H D F S , M a p R e d u c e … ( b o n u s : S p a r k , S t o r m , Te z )
• S c a l d i n g , H B a s e , P i g , H i v e … ( b o n u s : S h a r k , T i t a n , G i r a p h )
• F l u m e , S q o o p , E T L , We b s c r a p e r s … ( b o n u s : H u m e )
• S Q L , R D B M S , D W, O L A P… ( b o n u s : S O L R , E l a s t i c S e a rc h )
• K n i m e , We k a , R a p i d M i n e r… ( b o n u s : S c i P y, N u m P y, P a n d a s )
• D 3 . j s , K i b a n a , g g p l o t 2 , Ta b l e u … ( b o n u s : S h i n y, F l a re ,
D a t a m e e r )
• S P S S , M a t l a b , S A S … ( t h e e n t e r p r i s e m a n )
• N o S Q L , M o n g o D B , C a s s a n d r a , C o u c h D B
• A n d Ye s ! … M S - E x c e l : t h e m o s t u s e d , m o s t u n d e r r a t e d D S t o o l
AUA Data Science Meetup
G O A L ?
• R e v e n u e , re v e n u e , re v e n u e
• I m p ro v e t h e c u s t o m e r e x p e r i e n c e
• I n c re a s e o p e r a t i o n a l e ff i c i e n c y
• G E : O p t i m i z e m a i n t e n a n c e i n t e r v a l s f o r i n d u s t r i a l
p ro d u c t s
• G o o g l e : R e f i n e s e a rc h a n d a d - s e r v i n g a l g o r i t h m s
• Z y n g a : O p t i m i z e t h e g a m e e x p e r i e n c e f o r b o t h
l o n g - t e r m e n g a g e m e n t a n d re v e n u e
• N e t f l i x : M o v i e re c o m m e n d a t i o n s
• K a p l a n : U n c o v e r e ff e c t i v e l e a r n i n g s t r a t e g i e s
• e H a r m o n y : C re a t e h a p p y re l a t i o n s h i p s
W H O A R E W E ?
T R A D I T I O N A L M E T H O D S D O N O T W O R K
A N Y M O R E …
E H A R M O N Y C R E AT E S
T H E H A P P I E S T,
M O S T PA S S I O N AT E
A N D M O S T F U L F I L L I N G
R E L AT I O N S H I P S *
* A C C O R D I N G T O A R E C E N T S T U D Y
4 3 8
M A R R I A G E S P E R D AY
T H E D I F F E R E N C E ?
T H E D I F F E R E N C E ?
Compatibility Matching System®
C O M PAT I B I L I T Y
M AT C H I N G
A F F I N I T Y
M AT C H I N G
M AT C H
D I S T R I B U T I O N
T H E D I F F E R E N C E ?
Compatibility Matching System®
C O M PAT I B I L I T Y
M AT C H I N G
A F F I N I T Y
M AT C H I N G
M AT C H
D I S T R I B U T I O N
U N I D I R E C T I O N A L U S E R D E F I N E D C R I T E R I A
Nicolette
U N I D I R E C T I O N A L U S E R D E F I N E D C R I T E R I A
B I D I R E C T I O N A L
Leo
Ian
Steve
Nicolette
U N I D I R E C T I O N A L U S E R D E F I N E D C R I T E R I A
Leo
Ian
Steve
Nicolette
B I D I R E C T I O N A L
AUA Data Science Meetup
AUA Data Science Meetup
AUA Data Science Meetup
150	
  	
  
ques5ons
Personality	
  
Values	
  
A@ributes	
  
Beliefs
Intellect
Energy
Sociability
Ambition
Kindness
Curiosity
Humor
Spirituality
C O M PAT I B I L I T Y M AT C H I N G
U S E R D E F I N E D
C R I T E R I A
C O M PAT I B I L I T Y
M O D E L S
M O N G O D B
V O L D E M O RT
M O N G O D B
DATA STORE NEEDS
P O W E R F U L
I N D E X I N G
M O D E L S
FA S T M U LT I -
AT T R I B U T E
S E A R C H E S
E A S Y T O
M A I N TA I N
6 0 M +
Q U E R I E S
per day
M O N G O D B
WINS
A U T O
S C A L I N G
B U I LT- I N
S H A R D I N G
A U T O
B A L A N C I N G
M M S
V O L D E M O RT ?
T H AT N A M E
S O U N D S FA M I L I A R
V O L D E M O RT
DATA STORE NEEDS
C R U D
O P E R AT I O N S
VA R I E D
T R A N S A C T I O N
S I Z E S
B I L L I O N +
P O T E N T I A L
M AT C H E S
per day
V O L D E M O RT
WINS
A U T O
R E P L I C AT I O N
A U T O
PA RT I T I O N I N G
P L U G G A B L E
S E R I A L I Z AT I O N
A F F I N I T Y M AT C H I N G
Compatibility Matching System®
C O M PAT I B I L I T Y
M AT C H I N G
A F F I N I T Y
M AT C H I N G
M AT C H
D I S T R I B U T I O N
65 30
3000 miles
Commprobability
Distance in Miles
0 1 3 7 15 63 255 1023 4095
P R O B
AUA Data Science Meetup
Commprobability
Height difference in cm
-29 -25 -21 -17 -13 -9 -6 -3 0 3 6 9 12 16 20 24 28 32 36 40 44 48 52 56
4	
  -­‐	
  8	
  in
P R O B
W O R D S T O U S E
W O R D S T O U S E
S O M E I N S I G H T
D ATA N E E D S F O R A F F I N I T Y
5 0 M + R E G I S T E R E D U S E R S
1 0 3
AT T R I B U T E S
1 0 7
D A I LY M AT C H E S
2 5 0 M +
P H O T O S
4 B + Q U E S T I O N N A I R E S
A N S W E R E D
C O M M U N I C AT I O N A G G R E G AT E S
E V E N T L I S T E N E R
S E R V I C E
U S E R A C T I V I T Y
S E R V I C E
~ 5 M S
R E S P O N S E
T I M E S
1 0 K E V E N T S
P E R S E C O N D
U S E R
S E R V I C E
H O U R LY, D A I LY
T O TA L
O F F L I N E B AT C H J O B S
U S E R
S E R V I C E
M A P - S I D E J O I N S
( T B )
S C O R I N G
1+GB	
  Compressed	
  Protocol	
  
Buffers	
  
PA I R I N G S
S E R V I C E
750M	
  Compressed	
  
Protocol	
  Buffers	
  
B I L L I O N +
P O T E N T I A L
M AT C H E S
A M A Z O N
E M R
AW S D I R E C T
C O N N E C T
2 5 6 N O D E S
5 0 T B S T O R A G E
I N - H O U S E
S E A M I C R O
D ATA R E T R I E VA L L AT E N C Y
L O W O P E R AT I O N A L C O S T
L O W P O W E R C O N S U M P T I O N
P R E D I C TA B L E C O M P L E T I O N T I M E S
M O D E L R E T R A I N I N G
distcp
Protocol	
  Buffers	
  from	
  
Offline	
  Jobs	
  
M AT C H D I S T R I B U T I O N
Compatibility Matching System®
C O M PAT I B I L I T Y
M AT C H I N G
A F F I N I T Y
M AT C H I N G
M AT C H
D I S T R I B U T I O N
Delivering the right matches
at the right time to as many
people as possible across
the entire network
AUA Data Science Meetup
AUA Data Science Meetup
AUA Data Science Meetup
AUA Data Science Meetup
AUA Data Science Meetup
AUA Data Science Meetup
T H A N K Y O U
Q U E S T I O N S ?
C R E D I T S :
The Noun Project
http://thenounproject.com
Visual Elements From

More Related Content

AUA Data Science Meetup

  • 2. D AV I D G E V O R K YA N @ d a v i d g e v d a v i d g e v o r k y a n
  • 3. G R A D U AT E D A U A I N 2 0 0 8
  • 4. W H AT I S B I G D ATA ?
  • 5. FA S H I O N A B L E T E R M ?
  • 6. 8 0 % O F D ATA E X I S T I N G I N A N Y E N T E R P R I S E I S U N S T R U C T U R E D D ATA ST R U C T U R E D   DATA S E M I -­‐   ST R U C T U R E D   U N ST R U C T U R E D   DATA RDBMS Data Warehousing
  • 7. 9 0 % O F T H E D ATA I N T H E W O R L D T O D AY H A S B E E N C R E AT E D I N T H E L A S T T W O Y E A R S A L O N E S o u rc e : h t t p : / / w w w. i n t e l . c o m / c o n t e n t / w w w / u s / e n / c o m m u n i c a t i o n s / i n t e r n e t - m i n u t e - i n f o g r a p h i c . h t m l
  • 8. 4 V ’ S O F B I G D ATA VOLUME (large amount of data) VARIETY (sensors, video, audio, email, social) VELOCITY (speed of data generation) VERACITY (authenticity and/or accuracy)
  • 9. S O L U T I O N S R E Q U I R E D f o rc e s y o u t o c h a n g e t h e w a y y o u • C O L L E C T • T R A N S P O RT • S T O R E • M A N A G E • A N A LY Z E • V I S U A L I Z E
  • 11. W H AT I S D ATA S C I E N C E ?
  • 12. D ATA S C I E N C E ! = S TAT I S T I C A L A N A LY S I S I T I S S C I E N C E A N D “ A RT ” O F … • E X P L O R I N G T H E U N K N O W N A B O U T D ATA “ m a k e d i s c o v e r i e s w h i l e s w i m m i n g i n t h e d a t a ” • R E F I N I N G T H E R E S U LT S F O R A C C U R A C Y • D E R I V I N G A C T I O N A B L E I N S I G H T • C R E AT I N G D ATA - D R I V E N P R O D U C T S
  • 13. W H O A R E D ATA S C I E N T I S T S ?
  • 14. W H O A R E D ATA S C I E N T I S T S ? D re w C o n w a y, 2 0 1 0
  • 15. B I G D ATA S C I E N C E T O O L S ?
  • 16. • S c a l a , J a v a , P y t h o n , R … ( b o n u s : C l o j u re , H a s k e l l , E r l a n g ) • H a d o o p , H D F S , M a p R e d u c e … ( b o n u s : S p a r k , S t o r m , Te z ) • S c a l d i n g , H B a s e , P i g , H i v e … ( b o n u s : S h a r k , T i t a n , G i r a p h ) • F l u m e , S q o o p , E T L , We b s c r a p e r s … ( b o n u s : H u m e ) • S Q L , R D B M S , D W, O L A P… ( b o n u s : S O L R , E l a s t i c S e a rc h ) • K n i m e , We k a , R a p i d M i n e r… ( b o n u s : S c i P y, N u m P y, P a n d a s ) • D 3 . j s , K i b a n a , g g p l o t 2 , Ta b l e u … ( b o n u s : S h i n y, F l a re , D a t a m e e r ) • S P S S , M a t l a b , S A S … ( t h e e n t e r p r i s e m a n ) • N o S Q L , M o n g o D B , C a s s a n d r a , C o u c h D B • A n d Ye s ! … M S - E x c e l : t h e m o s t u s e d , m o s t u n d e r r a t e d D S t o o l
  • 18. G O A L ?
  • 19. • R e v e n u e , re v e n u e , re v e n u e • I m p ro v e t h e c u s t o m e r e x p e r i e n c e • I n c re a s e o p e r a t i o n a l e ff i c i e n c y • G E : O p t i m i z e m a i n t e n a n c e i n t e r v a l s f o r i n d u s t r i a l p ro d u c t s • G o o g l e : R e f i n e s e a rc h a n d a d - s e r v i n g a l g o r i t h m s • Z y n g a : O p t i m i z e t h e g a m e e x p e r i e n c e f o r b o t h l o n g - t e r m e n g a g e m e n t a n d re v e n u e • N e t f l i x : M o v i e re c o m m e n d a t i o n s • K a p l a n : U n c o v e r e ff e c t i v e l e a r n i n g s t r a t e g i e s • e H a r m o n y : C re a t e h a p p y re l a t i o n s h i p s
  • 20. W H O A R E W E ?
  • 21. T R A D I T I O N A L M E T H O D S D O N O T W O R K A N Y M O R E …
  • 22. E H A R M O N Y C R E AT E S T H E H A P P I E S T, M O S T PA S S I O N AT E A N D M O S T F U L F I L L I N G R E L AT I O N S H I P S * * A C C O R D I N G T O A R E C E N T S T U D Y
  • 23. 4 3 8 M A R R I A G E S P E R D AY
  • 24. T H E D I F F E R E N C E ?
  • 25. T H E D I F F E R E N C E ? Compatibility Matching System® C O M PAT I B I L I T Y M AT C H I N G A F F I N I T Y M AT C H I N G M AT C H D I S T R I B U T I O N
  • 26. T H E D I F F E R E N C E ? Compatibility Matching System® C O M PAT I B I L I T Y M AT C H I N G A F F I N I T Y M AT C H I N G M AT C H D I S T R I B U T I O N
  • 27. U N I D I R E C T I O N A L U S E R D E F I N E D C R I T E R I A Nicolette
  • 28. U N I D I R E C T I O N A L U S E R D E F I N E D C R I T E R I A B I D I R E C T I O N A L Leo Ian Steve Nicolette
  • 29. U N I D I R E C T I O N A L U S E R D E F I N E D C R I T E R I A Leo Ian Steve Nicolette B I D I R E C T I O N A L
  • 33. 150     ques5ons Personality   Values   A@ributes   Beliefs
  • 35. C O M PAT I B I L I T Y M AT C H I N G U S E R D E F I N E D C R I T E R I A C O M PAT I B I L I T Y M O D E L S M O N G O D B V O L D E M O RT
  • 36. M O N G O D B DATA STORE NEEDS P O W E R F U L I N D E X I N G M O D E L S FA S T M U LT I - AT T R I B U T E S E A R C H E S E A S Y T O M A I N TA I N 6 0 M + Q U E R I E S per day
  • 37. M O N G O D B WINS A U T O S C A L I N G B U I LT- I N S H A R D I N G A U T O B A L A N C I N G M M S
  • 38. V O L D E M O RT ? T H AT N A M E S O U N D S FA M I L I A R
  • 39. V O L D E M O RT DATA STORE NEEDS C R U D O P E R AT I O N S VA R I E D T R A N S A C T I O N S I Z E S B I L L I O N + P O T E N T I A L M AT C H E S per day
  • 40. V O L D E M O RT WINS A U T O R E P L I C AT I O N A U T O PA RT I T I O N I N G P L U G G A B L E S E R I A L I Z AT I O N
  • 41. A F F I N I T Y M AT C H I N G Compatibility Matching System® C O M PAT I B I L I T Y M AT C H I N G A F F I N I T Y M AT C H I N G M AT C H D I S T R I B U T I O N
  • 43. Commprobability Distance in Miles 0 1 3 7 15 63 255 1023 4095 P R O B
  • 45. Commprobability Height difference in cm -29 -25 -21 -17 -13 -9 -6 -3 0 3 6 9 12 16 20 24 28 32 36 40 44 48 52 56 4  -­‐  8  in P R O B
  • 46. W O R D S T O U S E
  • 47. W O R D S T O U S E
  • 48. S O M E I N S I G H T
  • 49. D ATA N E E D S F O R A F F I N I T Y 5 0 M + R E G I S T E R E D U S E R S 1 0 3 AT T R I B U T E S 1 0 7 D A I LY M AT C H E S 2 5 0 M + P H O T O S 4 B + Q U E S T I O N N A I R E S A N S W E R E D
  • 50. C O M M U N I C AT I O N A G G R E G AT E S E V E N T L I S T E N E R S E R V I C E U S E R A C T I V I T Y S E R V I C E ~ 5 M S R E S P O N S E T I M E S 1 0 K E V E N T S P E R S E C O N D U S E R S E R V I C E H O U R LY, D A I LY T O TA L
  • 51. O F F L I N E B AT C H J O B S U S E R S E R V I C E M A P - S I D E J O I N S ( T B ) S C O R I N G 1+GB  Compressed  Protocol   Buffers   PA I R I N G S S E R V I C E 750M  Compressed   Protocol  Buffers   B I L L I O N + P O T E N T I A L M AT C H E S
  • 52. A M A Z O N E M R AW S D I R E C T C O N N E C T 2 5 6 N O D E S 5 0 T B S T O R A G E I N - H O U S E S E A M I C R O D ATA R E T R I E VA L L AT E N C Y L O W O P E R AT I O N A L C O S T L O W P O W E R C O N S U M P T I O N P R E D I C TA B L E C O M P L E T I O N T I M E S
  • 53. M O D E L R E T R A I N I N G distcp Protocol  Buffers  from   Offline  Jobs  
  • 54. M AT C H D I S T R I B U T I O N Compatibility Matching System® C O M PAT I B I L I T Y M AT C H I N G A F F I N I T Y M AT C H I N G M AT C H D I S T R I B U T I O N
  • 55. Delivering the right matches at the right time to as many people as possible across the entire network
  • 62. T H A N K Y O U Q U E S T I O N S ?
  • 63. C R E D I T S : The Noun Project http://thenounproject.com Visual Elements From