AdamCloud: A Cloud infrastructure for a Genomic project. The AdamCloud project aims to simplify the installation of the AmpLab genomic pipeline (Snap, Adam, Avocado).
The results of the first iteration (part II) were presented here:
http://www.slideshare.net/davidonlaptop/bdm32-adam-cloud-part-2-43514904
1 of 15
Downloaded 10 times
More Related Content
BDM29: AdamCloud Project - Part I
1. AdamCloud: a cloud
infrastructure for a
genomics project
David Lauzon & S辿bastien Bonami
Presented at Big Data Montreal #29 on October 7th 2014
2. Plan
Project
Use Cases
Requirements
Technologies
Environments
Planning
Challenges
Conclusion
3. Project
Final project / Projet de fin d辿tudes (PFE)
S辿bastien Bonami
Student in IT Engineering at cole de technologie
sup辿rieure (TS)
Goal: Doing a proof of concept of a new genomics platform
and optimize the infrastructure for portability
5. Use Cases: UC1 Genome ETL
About 98% of a human DNA is similar to
every other humans
ETL
o Workload: CPU & RAM & HDD intensive
o Data size (per patient sample)
Input: 100 - 200 GB
Output: 10 MB
o Process Duration: currently takes 2-3 weeks
6. Use Cases: UC2 Genome Data Mining
Not really big data
Researchers use output of UC1 for data
mining
7. Requirements
Infrastructure portability
o from local workstations to the Cloud
RAD
o Ease of development
o IT students focus on infrastructure
o Developer students focus on development
8. Requirements
Demo to hospitals and conferences
o Avoid firewall / bureaucratic issues
Knowledge/Project Transfer
o For next student who picks up the project
o Quick startup
9. Technologies
UC1 Genome ETL
o Apache Spark
o Berkeley Genomics Stack (based on Spark)
Snap
Adam
Avocado
10. Technologies
UC2 Genome Data Mining
o Backend
Adam / Spark
HDFS
Play! Framework (for REST API)
...
o Frontend
HTML5 / Bootstrap / Backbone / jQuery
...
11. Environments
Local
o For UC1 & UC2 developer
Cluster of Mini PCs
o For testing
TS servers
o For private data
Amazon AWS
o For public data
12. Infrastructure centralization ?
e.g. management layer
Berkeley Genomics (Snap / Adam / Avocado)
Spark
Docker (1 container per service)
VMWareBoot2Docker
Amazon
AMI
Linux Bare
Metal
Mac Mini
(w/ Linux)
Mac
OS X
Windows
13. Planning
1. Run the Stack in 1 container on 1 node
2. Then, in multiple containers on 1 node
3. Then, on multiple nodes
4. Then, on a different hardware infrastructure
5. Then, in the Cloud
6. Monitor the environment
7. Document everything
14. Challenges
Understanding genomics concepts
o Whats a genome ?
Understanding the Berkeley Genomics Stack
o What does each tool do ?
o How they can integrate with each other ?
Managing a distributed system and a cluster