際際滷

際際滷Share a Scribd company logo
d奪巽達叩谷=叩奪=_叩=a~鱈~==
觜一危=蟯=豌=一危
Permissions: you are free to blog or live-blog about this presentation as long as you attribute the work to its authors
Chang Bum Hong
る  伎手鍵..
Genomics 企殊磯 
Cloud Computing
Cloud in Genomics
襦襦 危エ覲企
Big Data 覿  HPC Cloud
1,588覈 旧 螳企
54%螳  蟆 覿
覲願 矩り 
Nature, 2010
蠍一郁規  襭
覿殊 螳   譴蟆企,
5-10 螳 襷豢れ 覲危ク
蟆, 讌襷 貉危螻 語
覿譟煙  蟇碁朱
Nature, 2010
Genomics and BigData - case study
Data Size
Needs Public Data
Various Software
http://seqanswers.com/wiki/Software
Complicate Pipeline
Computing resources
Fusaro VA, Patil P, Gafni E, Wall DP, Tonellato PJ (2011) Biomedical Cloud Computing With Amazon Web Services. PLoS Comput Biol 7(8): e1002147
Linux skill
Nature Biotech (2006)
To the Clinic - License, HIPPA
To the Clinic - License, HIPPA
To the Clinic - License, HIPPA
Cloud Computing and HPC
(Hight Performance Computing)
貉危一 る誤 蠍壱, 讌// 螳,   螻旧
磯Μ螳  螻 螳 企殊磯
る 伎手鍵 企殊磯
る 伎手鍵 企殊磯
Journal of Biomedical Informatics (2013)
伎 螳訖襷 , れ
觜る 
Amazon Web Services 觜 覈襦
伎 螳訖襷 , れ
觜る 
Google 觜 覈襦
伎 螳襦 覿 視 蟆
 OS襯  螳ロ
襷豢 OS  螳ロ
 (reproducibility) 覲
 覿 蟆 蟲豢
 覿 蟆 + 一危 + ろ襴渚
 覿 蟆 + 一危 + ろ襴渚呉
覦襦 朱れ るジ り骸 螻旧
伎 螳襦 覿 視 蟆
伎譴 覯 ろ 豢螳
ろ 伎 企語 螻旧
伎 螳襦 覿 視 蟆
  覯襯
 襷 煙 螳ロ
伎 螳襦 覿 視 蟆
 るジ ろ襴讌 觜
Object Storage
http://whatis.techtarget.com/reference/Object-storage-Fast-Guide
伎 螳襦 覿 視 蟆
 るジ ろ襴讌 觜
Object Storage
 覈蟆 覦襦 Cloud Software
-IaaS 蟲豢 螳ロ ろ 企殊磯 
OpeStack, Eucalyptus, OpenNebula, CloudStack
OpenStack Architecture
HPC Clustering
Cloud in Genomics
Cloud in Genomics
Genomics 蟲豌伎 cloud 襦
觜 蟲覿 觜る 觜 伎 KT
IaaS
Amazon WebServices
PaaS, SaaS 觜ろれ 蠏 讌
public dataset 牛 螳 讌
ucloud biz
Google Cloud 貉危螻 ろ襴讌 讌 ucloud biz
NeCTAR Research Cloud OpenStack 蠍磯 郁規襯  Private Cloud ucloud biz
SaaS
DNANexus NGS 一危 覿 危殊 螻 g-Analysis
SevenBridge Genomics NGS 一危 覿 危殊 螻 g-Analysis
GotCloud NGS 一危 覿 危殊 螻 g-Analysis
Globus Genomics Galaxy襯 AWS 蠍磯朱 螻 g-Galaxy
GenomeSpace Storage 蠍磯 bioinformatics 觜 螻 g-Storage
PaaS
CloudMan ろ襴讌 蠍磯 Bioinformatics  讌
SeqWare 豌 覿 螳ロ 蠍磯  螻
StarCluster AWS 蠍磯 HPC Cluster 貉危 蟆 螻
CycleComputing AWS 蠍磯 HPC Cluster 貉危 蟆 螻
Google Genomics, BigQeury NGS 一危 覿  API 螻 g-Insight
襦襦 危エ覲企
Big Data 覿  HPC Cloud
http://seqware.github.io/docs/
HPC Cloud  - Sequencers
http://seqware.github.io/docs/
LIMS
Object
Storage
High
Speed
File
Transfer
IaaS
HPC
Private Cloud
OpenStack
Job Schedule
Bioinformatics
Linux
Bioinformatics
Bioinformatics
Hadoop and
Database
LIMS for NGS
Clarity LIMS from GenoLogics
Galaxy LIMS for NGS
Bioinformatics (2013)
LIMS for NGS
Galaxy LIMS
https://bitbucket.org/jelle/galaxy-central-tron-lims/
Data Upload - online
Data Upload - offline Import/Export
HPC Cloud  - Storage
http://seqware.github.io/docs/
LIMS
Object
Storage
High
Speed
File
Transfer
IaaS
HPC
Private Cloud
OpenStack
Job Schedule
Bioinformatics
Linux
Bioinformatics
Bioinformatics
Hadoop and
Database
http://www.genomespace.org/
GenomeSpace is a cloud-based interoperability framework to support
integrative genomics analysis through an easy-to-use Web interface.
HPC Cloud  - HPC
http://seqware.github.io/docs/
LIMS
Object
Storage
High
Speed
File
Transfer
IaaS
HPC
Private Cloud
OpenStack
Job Schedule
Bioinformatics
Linux
Bioinformatics
Bioinformatics
Hadoop and
Database
Before Cloud Computing
蟲襷 豌 -> 蟆->覯 覲>伎れ-> 語
 れ -> 豕譬 ろ ->蟲=I=蟲=一危
NeCTAR
StarCluster and CloudMan
http://star.mit.edu/cluster/
https://wiki.galaxyproject.org/CloudMan/AWS/GettingStarted
HPC Cloud  - Pipeline
http://seqware.github.io/docs/
LIMS
Object
Storage
High
Speed
File
Transfer
IaaS
HPC
Private Cloud
OpenStack
Job Schedule
Bioinformatics
Linux
Bioinformatics
Bioinformatics
Hadoop and
Database
DNANexus and HGSC
DNANexus and HGSC
DNANexus and HGSC
DNANexus and HGSC
HPC Cloud  - Web Service
http://seqware.github.io/docs/
LIMS
Object
Storage
High
Speed
File
Transfer
IaaS
HPC
Private Cloud
OpenStack
Job Schedule
Bioinformatics
Linux
Bioinformatics
Bioinformatics
Hadoop and
Database
Google Genomics
https://developers.google.com/genomics
Google Genomics API
http://googleresearch.blogspot.co.uk/2014/02/google-joins-global-alliance-for.html
Interoperability: One API, Many Apps
Google Genomics Examples
API襯 伎 轟  轟 read 覲 豢豢
Google Genomics Examples
API JavaScript襯 伎 Genome Browser
HPC Cloud  - Query Engine
http://seqware.github.io/docs/
LIMS
Object
Storage
High
Speed
File
Transfer
IaaS
HPC
Private Cloud
OpenStack
Job Schedule
Bioinformatics
Linux
Bioinformatics
Bioinformatics
Hadoop and
Database
Google BigQuery
# Compute the Ti/Tv ratio for BRCA1.
SELECT
transitions,
transversions,
transitions/transversions AS titv
FROM (
SELECT
SUM(IF(mutation IN ('A->G',
'G->A',
'C->T',
'T->C'),
INTEGER(num_snps),
INTEGER(0))) AS transitions,
SUM(IF(mutation IN ('A->C',
'C->A',
'G->T',
'T->G',
'A->T',
'T->A',
'C->G',
'G->C'),
INTEGER(num_snps),
INTEGER(0))) AS transversions,
FROM (
SELECT
CONCAT(reference_bases,
CONCAT(STRING('->'),
alternate_bases)) AS mutation,
COUNT(alternate_bases) AS num_snps,
FROM
[google.com:biggene:1000genomes.variants1kG]
WHERE
contig = '17'
AND position BETWEEN 41196312
AND 41277500
AND vt = 'SNP'
GROUP BY
mutation
ORDER BY
mutation));
Google BigQuery with plot
result <- query_exec(project = "google.com:biggene", dataset =
"1000genomes",
query = sql, billing = billing_project)
Ti/Tv ratio in BRCA1
# Count the variation for each sample including phenotypic traits
SELECT
samples.genotype.sample_id AS sample_id,
gender,
population,
super_population,
COUNT(samples.genotype.sample_id) AS num_variants_for_sample,
SUM(IF(samples.af >= 0.05,
INTEGER(1),
INTEGER(0))) AS common_variant,
SUM(IF(samples.af < 0.05
AND samples.af > 0.005,
INTEGER(1),
INTEGER(0))) AS middle_variant,
SUM(IF(samples.af <= 0.005
AND samples.af > 0.001,
INTEGER(1),
INTEGER(0))) AS rare_variant,
SUM(IF(samples.af <= 0.001,
INTEGER(1),
INTEGER(0))) AS very_rare_variant,
FROM
FLATTEN([google.com:biggene:1000genomes.variants1kG],
genotype) AS samples
JOIN
[google.com:biggene:1000genomes.sample_info] p
ON
samples.genotype.sample_id = p.sample
WHERE
samples.vt = 'SNP'
AND (samples.genotype.first_allele > 0
OR samples.genotype.second_allele > 0)
GROUP BY
sample_id,
gender,
population,
super_population
ORDER BY
sample_id;
Google BigQuery with R
ggplot(result, aes(x = population, y = common_variant, fill =
super_population)) +
geom_boxplot() + ylab("Count of common variants per sample") +
ggtitle("Common Variants (Minimum Allelic Frequency 5%)")
Variant type
Google BigQuery with RStudio
Markdown/Knit HTML
Publish your R code with git and
RPubs
https://github.com/
http://www.rpubs.com/
Data storage: $0.026 (per GB/month)
Data query: $0.005 / GB
1000 genomes data
Data storage: $0.026 (per GB/month) * 5,500 GB = $143 / month = 15,700/
Data query (allele frequency query): $0.005 / GB * 647 GB = $3.3 = 3,600
Cost
30豐 1000genomes 覈 allele frequency襯 蟲.
Conclusion
企 讌蠍 伎螳蟾讌
譬覓殊碁 .
襷 螻殊 れ伎.
   襷 螻殊襯 伎.

More Related Content

Genomics and BigData - case study

  • 1. d奪巽達叩谷=叩奪=_叩=a~鱈~== 觜一危=蟯=豌=一危 Permissions: you are free to blog or live-blog about this presentation as long as you attribute the work to its authors Chang Bum Hong
  • 2. る 伎手鍵.. Genomics 企殊磯 Cloud Computing Cloud in Genomics 襦襦 危エ覲企 Big Data 覿 HPC Cloud
  • 3. 1,588覈 旧 螳企 54%螳 蟆 覿 覲願 矩り Nature, 2010
  • 4. 蠍一郁規 襭 覿殊 螳 譴蟆企, 5-10 螳 襷豢れ 覲危ク 蟆, 讌襷 貉危螻 語 覿譟煙 蟇碁朱 Nature, 2010
  • 10. Computing resources Fusaro VA, Patil P, Gafni E, Wall DP, Tonellato PJ (2011) Biomedical Cloud Computing With Amazon Web Services. PLoS Comput Biol 7(8): e1002147
  • 12. To the Clinic - License, HIPPA
  • 13. To the Clinic - License, HIPPA
  • 14. To the Clinic - License, HIPPA
  • 15. Cloud Computing and HPC (Hight Performance Computing)
  • 16. 貉危一 る誤 蠍壱, 讌// 螳, 螻旧 磯Μ螳 螻 螳 企殊磯
  • 18. る 伎手鍵 企殊磯 Journal of Biomedical Informatics (2013)
  • 19. 伎 螳訖襷 , れ 觜る Amazon Web Services 觜 覈襦
  • 20. 伎 螳訖襷 , れ 觜る Google 觜 覈襦
  • 21. 伎 螳襦 覿 視 蟆 OS襯 螳ロ 襷豢 OS 螳ロ
  • 22. (reproducibility) 覲 覿 蟆 蟲豢 覿 蟆 + 一危 + ろ襴渚 覿 蟆 + 一危 + ろ襴渚呉 覦襦 朱れ るジ り骸 螻旧
  • 23. 伎 螳襦 覿 視 蟆 伎譴 覯 ろ 豢螳 ろ 伎 企語 螻旧
  • 24. 伎 螳襦 覿 視 蟆 覯襯 襷 煙 螳ロ
  • 25. 伎 螳襦 覿 視 蟆 るジ ろ襴讌 觜 Object Storage http://whatis.techtarget.com/reference/Object-storage-Fast-Guide
  • 26. 伎 螳襦 覿 視 蟆 るジ ろ襴讌 觜 Object Storage
  • 27. 覈蟆 覦襦 Cloud Software -IaaS 蟲豢 螳ロ ろ 企殊磯 OpeStack, Eucalyptus, OpenNebula, CloudStack
  • 31. Cloud in Genomics Genomics 蟲豌伎 cloud 襦 觜 蟲覿 觜る 觜 伎 KT IaaS Amazon WebServices PaaS, SaaS 觜ろれ 蠏 讌 public dataset 牛 螳 讌 ucloud biz Google Cloud 貉危螻 ろ襴讌 讌 ucloud biz NeCTAR Research Cloud OpenStack 蠍磯 郁規襯 Private Cloud ucloud biz SaaS DNANexus NGS 一危 覿 危殊 螻 g-Analysis SevenBridge Genomics NGS 一危 覿 危殊 螻 g-Analysis GotCloud NGS 一危 覿 危殊 螻 g-Analysis Globus Genomics Galaxy襯 AWS 蠍磯朱 螻 g-Galaxy GenomeSpace Storage 蠍磯 bioinformatics 觜 螻 g-Storage PaaS CloudMan ろ襴讌 蠍磯 Bioinformatics 讌 SeqWare 豌 覿 螳ロ 蠍磯 螻 StarCluster AWS 蠍磯 HPC Cluster 貉危 蟆 螻 CycleComputing AWS 蠍磯 HPC Cluster 貉危 蟆 螻 Google Genomics, BigQeury NGS 一危 覿 API 螻 g-Insight
  • 34. HPC Cloud - Sequencers http://seqware.github.io/docs/ LIMS Object Storage High Speed File Transfer IaaS HPC Private Cloud OpenStack Job Schedule Bioinformatics Linux Bioinformatics Bioinformatics Hadoop and Database
  • 35. LIMS for NGS Clarity LIMS from GenoLogics Galaxy LIMS for NGS Bioinformatics (2013)
  • 36. LIMS for NGS Galaxy LIMS https://bitbucket.org/jelle/galaxy-central-tron-lims/
  • 37. Data Upload - online
  • 38. Data Upload - offline Import/Export
  • 39. HPC Cloud - Storage http://seqware.github.io/docs/ LIMS Object Storage High Speed File Transfer IaaS HPC Private Cloud OpenStack Job Schedule Bioinformatics Linux Bioinformatics Bioinformatics Hadoop and Database
  • 40. http://www.genomespace.org/ GenomeSpace is a cloud-based interoperability framework to support integrative genomics analysis through an easy-to-use Web interface.
  • 41. HPC Cloud - HPC http://seqware.github.io/docs/ LIMS Object Storage High Speed File Transfer IaaS HPC Private Cloud OpenStack Job Schedule Bioinformatics Linux Bioinformatics Bioinformatics Hadoop and Database
  • 42. Before Cloud Computing 蟲襷 豌 -> 蟆->覯 覲>伎れ-> 語 れ -> 豕譬 ろ ->蟲=I=蟲=一危
  • 45. HPC Cloud - Pipeline http://seqware.github.io/docs/ LIMS Object Storage High Speed File Transfer IaaS HPC Private Cloud OpenStack Job Schedule Bioinformatics Linux Bioinformatics Bioinformatics Hadoop and Database
  • 50. HPC Cloud - Web Service http://seqware.github.io/docs/ LIMS Object Storage High Speed File Transfer IaaS HPC Private Cloud OpenStack Job Schedule Bioinformatics Linux Bioinformatics Bioinformatics Hadoop and Database
  • 53. Google Genomics Examples API襯 伎 轟 轟 read 覲 豢豢
  • 54. Google Genomics Examples API JavaScript襯 伎 Genome Browser
  • 55. HPC Cloud - Query Engine http://seqware.github.io/docs/ LIMS Object Storage High Speed File Transfer IaaS HPC Private Cloud OpenStack Job Schedule Bioinformatics Linux Bioinformatics Bioinformatics Hadoop and Database
  • 57. # Compute the Ti/Tv ratio for BRCA1. SELECT transitions, transversions, transitions/transversions AS titv FROM ( SELECT SUM(IF(mutation IN ('A->G', 'G->A', 'C->T', 'T->C'), INTEGER(num_snps), INTEGER(0))) AS transitions, SUM(IF(mutation IN ('A->C', 'C->A', 'G->T', 'T->G', 'A->T', 'T->A', 'C->G', 'G->C'), INTEGER(num_snps), INTEGER(0))) AS transversions, FROM ( SELECT CONCAT(reference_bases, CONCAT(STRING('->'), alternate_bases)) AS mutation, COUNT(alternate_bases) AS num_snps, FROM [google.com:biggene:1000genomes.variants1kG] WHERE contig = '17' AND position BETWEEN 41196312 AND 41277500 AND vt = 'SNP' GROUP BY mutation ORDER BY mutation)); Google BigQuery with plot result <- query_exec(project = "google.com:biggene", dataset = "1000genomes", query = sql, billing = billing_project) Ti/Tv ratio in BRCA1
  • 58. # Count the variation for each sample including phenotypic traits SELECT samples.genotype.sample_id AS sample_id, gender, population, super_population, COUNT(samples.genotype.sample_id) AS num_variants_for_sample, SUM(IF(samples.af >= 0.05, INTEGER(1), INTEGER(0))) AS common_variant, SUM(IF(samples.af < 0.05 AND samples.af > 0.005, INTEGER(1), INTEGER(0))) AS middle_variant, SUM(IF(samples.af <= 0.005 AND samples.af > 0.001, INTEGER(1), INTEGER(0))) AS rare_variant, SUM(IF(samples.af <= 0.001, INTEGER(1), INTEGER(0))) AS very_rare_variant, FROM FLATTEN([google.com:biggene:1000genomes.variants1kG], genotype) AS samples JOIN [google.com:biggene:1000genomes.sample_info] p ON samples.genotype.sample_id = p.sample WHERE samples.vt = 'SNP' AND (samples.genotype.first_allele > 0 OR samples.genotype.second_allele > 0) GROUP BY sample_id, gender, population, super_population ORDER BY sample_id; Google BigQuery with R ggplot(result, aes(x = population, y = common_variant, fill = super_population)) + geom_boxplot() + ylab("Count of common variants per sample") + ggtitle("Common Variants (Minimum Allelic Frequency 5%)") Variant type
  • 59. Google BigQuery with RStudio Markdown/Knit HTML
  • 60. Publish your R code with git and RPubs https://github.com/ http://www.rpubs.com/
  • 61. Data storage: $0.026 (per GB/month) Data query: $0.005 / GB 1000 genomes data Data storage: $0.026 (per GB/month) * 5,500 GB = $143 / month = 15,700/ Data query (allele frequency query): $0.005 / GB * 647 GB = $3.3 = 3,600 Cost 30豐 1000genomes 覈 allele frequency襯 蟲.
  • 62. Conclusion 企 讌蠍 伎螳蟾讌 譬覓殊碁 . 襷 螻殊 れ伎. 襷 螻殊襯 伎.