際際滷

際際滷Share a Scribd company logo
AdamCloud: a cloud
infrastructure for a
genomics project
David Lauzon & S辿bastien Bonami
Presented at Big Data Montreal #29 on October 7th 2014
Plan
 Project
 Use Cases
 Requirements
 Technologies
 Environments
 Planning
 Challenges
 Conclusion
Project
Final project / Projet de fin d辿tudes (PFE)
S辿bastien Bonami
Student in IT Engineering at cole de technologie
sup辿rieure (TS)
Goal: Doing a proof of concept of a new genomics platform
and optimize the infrastructure for portability
Use Cases
 Genome ETL
 Genome Data Mining
Use Cases: UC1 Genome ETL
 About 98% of a human DNA is similar to
every other humans
 ETL
o Workload: CPU & RAM & HDD intensive
o Data size (per patient sample)
 Input: 100 - 200 GB
 Output: 10 MB
o Process Duration: currently takes 2-3 weeks
Use Cases: UC2 Genome Data Mining
 Not really big data
 Researchers use output of UC1 for data
mining
Requirements
 Infrastructure portability
o from local workstations to the Cloud
 RAD
o Ease of development
o IT students focus on infrastructure
o Developer students focus on development
Requirements
 Demo to hospitals and conferences
o Avoid firewall / bureaucratic issues
 Knowledge/Project Transfer
o For next student who picks up the project
o Quick startup
Technologies
 UC1 Genome ETL
o Apache Spark
o Berkeley Genomics Stack (based on Spark)
 Snap
 Adam
 Avocado
Technologies
 UC2 Genome Data Mining
o Backend
 Adam / Spark
 HDFS
 Play! Framework (for REST API)
 ...
o Frontend
 HTML5 / Bootstrap / Backbone / jQuery
 ...
Environments
 Local
o For UC1 & UC2 developer
 Cluster of Mini PCs
o For testing
 TS servers
o For private data
 Amazon AWS
o For public data
Infrastructure centralization ?
e.g. management layer
Berkeley Genomics (Snap / Adam / Avocado)
Spark
Docker (1 container per service)
VMWareBoot2Docker
Amazon
AMI
Linux Bare
Metal
Mac Mini
(w/ Linux)
Mac
OS X
Windows
Planning
1. Run the Stack in 1 container on 1 node
2. Then, in multiple containers on 1 node
3. Then, on multiple nodes
4. Then, on a different hardware infrastructure
5. Then, in the Cloud
6. Monitor the environment
7. Document everything
Challenges
 Understanding genomics concepts
o Whats a genome ?
 Understanding the Berkeley Genomics Stack
o What does each tool do ?
o How they can integrate with each other ?
 Managing a distributed system and a cluster
Conclusion
Any comments ?

More Related Content

BDM29: AdamCloud Project - Part I

  • 1. AdamCloud: a cloud infrastructure for a genomics project David Lauzon & S辿bastien Bonami Presented at Big Data Montreal #29 on October 7th 2014
  • 2. Plan Project Use Cases Requirements Technologies Environments Planning Challenges Conclusion
  • 3. Project Final project / Projet de fin d辿tudes (PFE) S辿bastien Bonami Student in IT Engineering at cole de technologie sup辿rieure (TS) Goal: Doing a proof of concept of a new genomics platform and optimize the infrastructure for portability
  • 4. Use Cases Genome ETL Genome Data Mining
  • 5. Use Cases: UC1 Genome ETL About 98% of a human DNA is similar to every other humans ETL o Workload: CPU & RAM & HDD intensive o Data size (per patient sample) Input: 100 - 200 GB Output: 10 MB o Process Duration: currently takes 2-3 weeks
  • 6. Use Cases: UC2 Genome Data Mining Not really big data Researchers use output of UC1 for data mining
  • 7. Requirements Infrastructure portability o from local workstations to the Cloud RAD o Ease of development o IT students focus on infrastructure o Developer students focus on development
  • 8. Requirements Demo to hospitals and conferences o Avoid firewall / bureaucratic issues Knowledge/Project Transfer o For next student who picks up the project o Quick startup
  • 9. Technologies UC1 Genome ETL o Apache Spark o Berkeley Genomics Stack (based on Spark) Snap Adam Avocado
  • 10. Technologies UC2 Genome Data Mining o Backend Adam / Spark HDFS Play! Framework (for REST API) ... o Frontend HTML5 / Bootstrap / Backbone / jQuery ...
  • 11. Environments Local o For UC1 & UC2 developer Cluster of Mini PCs o For testing TS servers o For private data Amazon AWS o For public data
  • 12. Infrastructure centralization ? e.g. management layer Berkeley Genomics (Snap / Adam / Avocado) Spark Docker (1 container per service) VMWareBoot2Docker Amazon AMI Linux Bare Metal Mac Mini (w/ Linux) Mac OS X Windows
  • 13. Planning 1. Run the Stack in 1 container on 1 node 2. Then, in multiple containers on 1 node 3. Then, on multiple nodes 4. Then, on a different hardware infrastructure 5. Then, in the Cloud 6. Monitor the environment 7. Document everything
  • 14. Challenges Understanding genomics concepts o Whats a genome ? Understanding the Berkeley Genomics Stack o What does each tool do ? o How they can integrate with each other ? Managing a distributed system and a cluster