�ݺ�ߣ

Cloud based low cost, low
maintenance, scalable data platform
Apoorva Gaurav

Why hunt elephant to sell shoes ?

Why hunt elephant to sell shoes ?
WHOM
HOW
WHAT

Use case : List products based on CTR
● Take all impressions of a product and action
performed
● Some products are more attractive than
others
● Give benefit to such products

Use case : List products based on CTR
● select product_id, sum(clicked)/sum
(appeared) as ctr from tbl_prod_log group
by product_id order by ctr desc
● >100K products, > 500 million impressions
a day --- DIFFICULT TO SCALE

Use case : User segmentation
● Different users have
different browsing
patterns
● Segment them based
on their history
● Provide them different
experience

Use case : User segmentation
● select depth, count
(cookie_id), group by
depth from user_log
● > 1m users daily,
multiple browsers,
devices
● DIFFICULT TO SCALE

Use case : Recommend similar products
● Compute score of
products based on
various attributes
● Compute score of a
user based on
products (s)he
browses
● Recommend similar
products

Use case : Recommend similar products
● select id, (w1.att1 + w2.
att2 + ... wN.attN) as
score from products
● select userid, (v1.
score1 + v2.score2 + ...
+ vN.scoreN)
● >1m user >100K
products DIFFICULT TO
COMPUTE

Constraints
● Fast paced
● Tangible results
● Limited budget
● Low engineering bandwidth

Design goals
● Solution should be able to scale up and
down
● Record data now, ask questions later
● Generic data model
● Segregate reads from writes
● Low running cost
● Low maintenance overhead

Cloud computing
Pros
● No setup cost
● Pay as you use
● Scaling is a breeze
● Managed services
Cons
● Performance
● Reliability
● Data security
● Control

A very basic Big Data system
Highly available
Very low latency
Initial filtering
Storage agnostic
Scale up and down
easily
Essentially distributed
Very easy to use
Highly reliable
Huge capacity
Cater to any data model
Cheap

Architecture Diagram Hadoop on cloud
Easy to scale up
and down
Pay as you use
Infinite capacity
11 nines of
durability
Flat file storage
Cheap
Persistent
distributed Q
100K msg/sec
Events can be
played back
Highly concurrent
server
Very easy to use
Flexible
Much easier to
introduce HA,
reliability etc
Both server and
client side data
Segregate and
upload events to
S3
Scales horizontally
Distributed
config mgmt
Fault tolerant

Some numbers
● ~20 million events getting logged daily
● Corresponds to ~800 million data points
● & ~25GB
● Close to a 100 jobs a day
● The biggest job has footprints of ~2
billion events
● Platform costs ~20$ daily; jobs ~15$ daily

● One can code in english (Finagle)
myService = handleExceptions andThen recordInKafka andThen respond
● Need not be in C or Erlang to be performant (Kafka)
● Can search without index
s3://<BUCKET>/addToCart/y=2013/m=06/d=14/h=13/min=30
s3://<BUCKET>/orderConfirmation/y=2013/m=06/d=14/h=13/min=30
● Spot EMR clusters effeciently
● m1.small are not small
● awk + grep = awesome
● Apache mailing lists SUCK!!!
Some key learnings

Thank you!!
& we are hiring
apoorva.gaurav@myntra.com

�ݺ�ߣ

Myntra.com's Big Data Platform

More Related Content

Myntra.com's Big Data Platform