際際滷

際際滷Share a Scribd company logo
Cloud based low cost, low
maintenance, scalable data platform
Apoorva Gaurav
Why hunt elephant to sell shoes ?
Why hunt elephant to sell shoes ?
WHOM
HOW
WHAT
Use case : List products based on CTR
 Take all impressions of a product and action
performed
 Some products are more attractive than
others
 Give benefit to such products
Use case : List products based on CTR
 select product_id, sum(clicked)/sum
(appeared) as ctr from tbl_prod_log group
by product_id order by ctr desc
 >100K products, > 500 million impressions
a day --- DIFFICULT TO SCALE
Use case : User segmentation
 Different users have
different browsing
patterns
 Segment them based
on their history
 Provide them different
experience
Use case : User segmentation
 select depth, count
(cookie_id), group by
depth from user_log
 > 1m users daily,
multiple browsers,
devices
 DIFFICULT TO SCALE
Use case : Recommend similar products
 Compute score of
products based on
various attributes
 Compute score of a
user based on
products (s)he
browses
 Recommend similar
products
Use case : Recommend similar products
 select id, (w1.att1 + w2.
att2 + ... wN.attN) as
score from products
 select userid, (v1.
score1 + v2.score2 + ...
+ vN.scoreN)
 >1m user >100K
products DIFFICULT TO
COMPUTE
Constraints
 Fast paced
 Tangible results
 Limited budget
 Low engineering bandwidth
Design goals
 Solution should be able to scale up and
down
 Record data now, ask questions later
 Generic data model
 Segregate reads from writes
 Low running cost
 Low maintenance overhead
Cloud computing
Pros
 No setup cost
 Pay as you use
 Scaling is a breeze
 Managed services
Cons
 Performance
 Reliability
 Data security
 Control
A very basic Big Data system
Highly available
Very low latency
Initial filtering
Storage agnostic
Scale up and down
easily
Essentially distributed
Very easy to use
Highly reliable
Huge capacity
Cater to any data model
Cheap
Architecture Diagram
Architecture Diagram Hadoop on cloud
Easy to scale up
and down
Pay as you use
Infinite capacity
11 nines of
durability
Flat file storage
Cheap
Persistent
distributed Q
100K msg/sec
Events can be
played back
Highly concurrent
server
Very easy to use
Flexible
Much easier to
introduce HA,
reliability etc
Both server and
client side data
Segregate and
upload events to
S3
Scales horizontally
Distributed
config mgmt
Fault tolerant
Some numbers
 ~20 million events getting logged daily
 Corresponds to ~800 million data points
 & ~25GB
 Close to a 100 jobs a day
 The biggest job has footprints of ~2
billion events
 Platform costs ~20$ daily; jobs ~15$ daily
 One can code in english (Finagle)
myService = handleExceptions andThen recordInKafka andThen respond
 Need not be in C or Erlang to be performant (Kafka)
 Can search without index
s3://<BUCKET>/addToCart/y=2013/m=06/d=14/h=13/min=30
s3://<BUCKET>/orderConfirmation/y=2013/m=06/d=14/h=13/min=30
 Spot EMR clusters effeciently
 m1.small are not small
 awk + grep = awesome
 Apache mailing lists SUCK!!!
Some key learnings
Thank you!!
& we are hiring
apoorva.gaurav@myntra.com

More Related Content

Myntra.com's Big Data Platform

  • 1. Cloud based low cost, low maintenance, scalable data platform Apoorva Gaurav
  • 2. Why hunt elephant to sell shoes ?
  • 3. Why hunt elephant to sell shoes ? WHOM HOW WHAT
  • 4. Use case : List products based on CTR Take all impressions of a product and action performed Some products are more attractive than others Give benefit to such products
  • 5. Use case : List products based on CTR select product_id, sum(clicked)/sum (appeared) as ctr from tbl_prod_log group by product_id order by ctr desc >100K products, > 500 million impressions a day --- DIFFICULT TO SCALE
  • 6. Use case : User segmentation Different users have different browsing patterns Segment them based on their history Provide them different experience
  • 7. Use case : User segmentation select depth, count (cookie_id), group by depth from user_log > 1m users daily, multiple browsers, devices DIFFICULT TO SCALE
  • 8. Use case : Recommend similar products Compute score of products based on various attributes Compute score of a user based on products (s)he browses Recommend similar products
  • 9. Use case : Recommend similar products select id, (w1.att1 + w2. att2 + ... wN.attN) as score from products select userid, (v1. score1 + v2.score2 + ... + vN.scoreN) >1m user >100K products DIFFICULT TO COMPUTE
  • 10. Constraints Fast paced Tangible results Limited budget Low engineering bandwidth
  • 11. Design goals Solution should be able to scale up and down Record data now, ask questions later Generic data model Segregate reads from writes Low running cost Low maintenance overhead
  • 12. Cloud computing Pros No setup cost Pay as you use Scaling is a breeze Managed services Cons Performance Reliability Data security Control
  • 13. A very basic Big Data system Highly available Very low latency Initial filtering Storage agnostic Scale up and down easily Essentially distributed Very easy to use Highly reliable Huge capacity Cater to any data model Cheap
  • 15. Architecture Diagram Hadoop on cloud Easy to scale up and down Pay as you use Infinite capacity 11 nines of durability Flat file storage Cheap Persistent distributed Q 100K msg/sec Events can be played back Highly concurrent server Very easy to use Flexible Much easier to introduce HA, reliability etc Both server and client side data Segregate and upload events to S3 Scales horizontally Distributed config mgmt Fault tolerant
  • 16. Some numbers ~20 million events getting logged daily Corresponds to ~800 million data points & ~25GB Close to a 100 jobs a day The biggest job has footprints of ~2 billion events Platform costs ~20$ daily; jobs ~15$ daily
  • 17. One can code in english (Finagle) myService = handleExceptions andThen recordInKafka andThen respond Need not be in C or Erlang to be performant (Kafka) Can search without index s3://<BUCKET>/addToCart/y=2013/m=06/d=14/h=13/min=30 s3://<BUCKET>/orderConfirmation/y=2013/m=06/d=14/h=13/min=30 Spot EMR clusters effeciently m1.small are not small awk + grep = awesome Apache mailing lists SUCK!!! Some key learnings
  • 18. Thank you!! & we are hiring apoorva.gaurav@myntra.com