This is the presentation given in Fifth Elephant Conference 2013. It talks about how we've created a cloud based big data which is low on maintenance and running cost. Key technologies used here are Twitter Finagle, Apache Kafka, Apache Zookeeper, Amazon S3 and Amazon EMR.
1 of 18
Downloaded 110 times
More Related Content
Myntra.com's Big Data Platform
1. Cloud based low cost, low
maintenance, scalable data platform
Apoorva Gaurav
4. Use case : List products based on CTR
Take all impressions of a product and action
performed
Some products are more attractive than
others
Give benefit to such products
5. Use case : List products based on CTR
select product_id, sum(clicked)/sum
(appeared) as ctr from tbl_prod_log group
by product_id order by ctr desc
>100K products, > 500 million impressions
a day --- DIFFICULT TO SCALE
6. Use case : User segmentation
Different users have
different browsing
patterns
Segment them based
on their history
Provide them different
experience
7. Use case : User segmentation
select depth, count
(cookie_id), group by
depth from user_log
> 1m users daily,
multiple browsers,
devices
DIFFICULT TO SCALE
8. Use case : Recommend similar products
Compute score of
products based on
various attributes
Compute score of a
user based on
products (s)he
browses
Recommend similar
products
9. Use case : Recommend similar products
select id, (w1.att1 + w2.
att2 + ... wN.attN) as
score from products
select userid, (v1.
score1 + v2.score2 + ...
+ vN.scoreN)
>1m user >100K
products DIFFICULT TO
COMPUTE
11. Design goals
Solution should be able to scale up and
down
Record data now, ask questions later
Generic data model
Segregate reads from writes
Low running cost
Low maintenance overhead
12. Cloud computing
Pros
No setup cost
Pay as you use
Scaling is a breeze
Managed services
Cons
Performance
Reliability
Data security
Control
13. A very basic Big Data system
Highly available
Very low latency
Initial filtering
Storage agnostic
Scale up and down
easily
Essentially distributed
Very easy to use
Highly reliable
Huge capacity
Cater to any data model
Cheap
15. Architecture Diagram Hadoop on cloud
Easy to scale up
and down
Pay as you use
Infinite capacity
11 nines of
durability
Flat file storage
Cheap
Persistent
distributed Q
100K msg/sec
Events can be
played back
Highly concurrent
server
Very easy to use
Flexible
Much easier to
introduce HA,
reliability etc
Both server and
client side data
Segregate and
upload events to
S3
Scales horizontally
Distributed
config mgmt
Fault tolerant
16. Some numbers
~20 million events getting logged daily
Corresponds to ~800 million data points
& ~25GB
Close to a 100 jobs a day
The biggest job has footprints of ~2
billion events
Platform costs ~20$ daily; jobs ~15$ daily
17. One can code in english (Finagle)
myService = handleExceptions andThen recordInKafka andThen respond
Need not be in C or Erlang to be performant (Kafka)
Can search without index
s3://<BUCKET>/addToCart/y=2013/m=06/d=14/h=13/min=30
s3://<BUCKET>/orderConfirmation/y=2013/m=06/d=14/h=13/min=30
Spot EMR clusters effeciently
m1.small are not small
awk + grep = awesome
Apache mailing lists SUCK!!!
Some key learnings