際際滷

際際滷Share a Scribd company logo
www.bina.com 
A highly efficient and scalable compute platform for massive variant annotation 
and rapid genome interpretation 
James Warren1, Emre Colak1, Amirhossein Kiani1, Jian Li2, Aparna Chhibber2, Sanchita Bhattacharya3, Narges Bani 
Asadi1,2, Sharon Barr1, Atul Butte3, Garry Nolan4, Rong Chen5, Wing H. Wong6,7, and Hugo Y.K. Lam 2, 
MOTIVATION 
After obtaining variants from next generation sequencing data, researchers and clinicians still face the undertaking of interpreting the results. Despite the availability of a 
multitude of public databases, using this collective information is an arduous task due to inconsistent and heterogeneous data, multiple versions, and nonstandard formats. 
Moreover, after aggregating the data and annotating the variants, it remains a laborious exercise to identify the causative variants associated with the disease in question. 
APPROACH 
We have developed a highly efficient data processing pipeline that leverages big data technologies to integrate annotations from a wide range of biological databases. The 
pipeline takes variant call sets, annotates all samples, and indexes the variants for analysis. Users can perform real-time searches and analytical queries against the 
annotated results to rapidly identify variants for further study. 
CHALLENGES 
 Heterogeneous data. An annotation platform must integrate diverse datasets of variants, genes, diseases, transcripts and functional predictions. 
 No standardizations. The platform must account for differences in datasets, such as different reference genomes or changing schemas between versions. 
 Real-time Interaction. A user must be able to interact with the annotation results in real time. Such interaction allows rapid identification of relevant variants while also 
supporting undirected investigation. 
 Contextual interface. The system cannot assume the user is familiar with the underlying data sources. It instead must support contextual queries, such as "find all 
predicted damaging variants of high quality associated with a given disease. 
1. Department of Engineering, Bina Technologies, Redwood City, CA 94065. 
2. Department of Bioinformatics, Bina Technologies, Redwood City, CA 94065. 
3. Division of Systems Medicine, Department of Pediatrics, Stanford University School of 
Medicine, Stanford CA 94305. 
4. Department of Microbiology and Immunology, Stanford University, School of Medicine, 
Stanford California 94305. 
5. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 
New York, NY 10029 
6. Department of Statistics, Stanford University, Stanford, CA 94305. 
7. Department of Health Research and Policy, Stanford University School of Medicine, 
Stanford, CA 94305. 
 To whom correspondence should be addressed. 
Affiliations 
Hadoop / 
Cascalog 
Contact Us 
rd@bina.com 
! 
METHOD 
During the annotation process, the pipeline: 
 constructs indices that can be efficiently composed to support an effectively infinite 
number of queries 
 uses Hadoop MapReduce to associate variants with relevant annotations 
 stores the annotated output and indices in HBase, a NoSQL database 
 a 5-node Hadoop cluster can annotate and index a whole exome sequencing sample in 
30 minutes and a whole genome sequencing sample in under an hour 
As a variant set passes through the data pipeline: 
 linked with over 140 annotation classes 
 from more than 20 databases/datasets 
 annotating a sample and indexing its variants are computationally demanding steps, but 
these are one-time costs 
After the process is complete: 
 users can interact with the results via an intuitive web interface. 
External 
Data Sources* 
Genomic 
Variants 
Variants with 
Predicted 
Effects 
SnpEff 
Fully 
Annotated 
Variants 
Indices / 
Functional 
Filters 
NoSQL 
Datastore 
REST / API 
HBase 
Pre-Computation 
Real-Time Interaction 
* Data sources include: 
1000 Genomes 
Cancer Gene Census 
ClinVar 
dbNSFP 
dbSNP 
DGV 
ENSEMBL 
ESP 
GWAS 
HGMD 
PGMD 
PROTEOME 
RefSeq 
RepeatMasker 
Segmental Duplications 
TRANSFAC 
Ohtahara syndrome (early infantile epileptic encephalopathy with suppression 
bursts) is a rare form of epilepsy that presents in early infancy and occurs slightly 
more often in males. 
We analyzed a whole genome sequenced family trio 
with two unaffected parents and an affected son. 
Using the Bina Annotation Platform we were able to 
filter from over 6.5 million variants in this family 
down to one X-linked non-synonymous variant in 
the gene AGTR2 potentially associated with the 
syndrome in the proband. 
For another application of the Bina Annotation Platform, we analyzed the WGS 
data from the DNA of Ata [1], the skeletal remains of a 6-inch human found in the 
Atacama Desert, Chile. 
The annotation platform discovered 4,000+ exonic non-synonymous SNVs, 400+ 
frame-shift or codon-change indels in genes previously associated with disease, 
and 1,000+ structural variations. Fourteen of these variants were located in genes 
known to be associated with dwarfism and skeletal dysplasia, of which one was 
not in dbSNP. The results were scientifically interesting and taken for further 
investigation [2]. 
[1] http://news.sciencemag.org/health/2013/05/bizarre-6-inch-skeleton-shown-be-human 
[2] S. Bhattacharya, J. Li, H. Lam, R. Lachman, N. Asadi, A. Butte, G. Nolan, Whole genome 
sequencing of mummy DNA shows significant association with human disease phenotype. 
Poster 2914S at ASHG 2014. 
EXAMPLE APPLICATIONS 
. 
CONCLUSION AND FUTURE WORK CITATIONS 
.T he Bina Annotation Platform has proven to be a powerful tool for variant 
interpretation for both single and multi-sample analyses. In future releases the 
platform will support additional workflows such as case-control and cohort 
studies, and will allow users to upload custom databases.

More Related Content

What's hot (20)

PDF
Bioinformatics in dermato-oncology
Joaquin Dopazo
PDF
The trivial case of the missing heritability
Max Moldovan
PDF
Case Study: Unsupervised method for pathway analysis in Alzheimer patients
Jaclyn Williams
PDF
Multigenic (mechanistic) biomarkers
Joaquin Dopazo
PDF
IJSRED-V2I1P5
IJSRED
PPTX
Monitoring the quality of data in the clinical use of pathogen genomes
Health Informatics New Zealand
PPTX
From Expression to Pathways Using Online Tools
Ali Kishk
PPTX
FunGen JC Presentation - Mostafavi et al. (2019)
BrianSchilder
PDF
Soergel oa week-2014-lightning
David Soergel
PPTX
Raj Lab Meeting May/01/2019
Ricardo Vialle
PPTX
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Denis C. Bauer
PDF
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Golden Helix Inc
PDF
仂仍从亠 . An introduction to MOPED: Multi-Omics Profiling Expression Database
bigdatabm
PDF
Platforms CIBERER and INB-ELIXIR-es
Joaquin Dopazo
PPTX
Introduction to Gene Mining Part A: BLASTn-off!
adcobb
PPTX
Common languages in genomic epidemiology: from ontologies to algorithms
Jo達o Andr辿 Carri巽o
PDF
Genomics connectathon
David Hay
PPTX
Bioinformatics for beginners (exam point of view)
School of Biosciences, MACFAST College, Tiruvalla, Kerala, India
PDF
CV of Rong Chen
Rong Chen
PPTX
Sundaram et al. 2018 Presentation
BrianSchilder
Bioinformatics in dermato-oncology
Joaquin Dopazo
The trivial case of the missing heritability
Max Moldovan
Case Study: Unsupervised method for pathway analysis in Alzheimer patients
Jaclyn Williams
Multigenic (mechanistic) biomarkers
Joaquin Dopazo
IJSRED-V2I1P5
IJSRED
Monitoring the quality of data in the clinical use of pathogen genomes
Health Informatics New Zealand
From Expression to Pathways Using Online Tools
Ali Kishk
FunGen JC Presentation - Mostafavi et al. (2019)
BrianSchilder
Soergel oa week-2014-lightning
David Soergel
Raj Lab Meeting May/01/2019
Ricardo Vialle
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Denis C. Bauer
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Golden Helix Inc
仂仍从亠 . An introduction to MOPED: Multi-Omics Profiling Expression Database
bigdatabm
Platforms CIBERER and INB-ELIXIR-es
Joaquin Dopazo
Introduction to Gene Mining Part A: BLASTn-off!
adcobb
Common languages in genomic epidemiology: from ontologies to algorithms
Jo達o Andr辿 Carri巽o
Genomics connectathon
David Hay
Bioinformatics for beginners (exam point of view)
School of Biosciences, MACFAST College, Tiruvalla, Kerala, India
CV of Rong Chen
Rong Chen
Sundaram et al. 2018 Presentation
BrianSchilder

Similar to ASHG_2014_AP (20)

PPTX
2015 functional genomics variant annotation and interpretation- tools and p...
Gabe Rudy
PPTX
Chunlei wu heart_bd2k_201602_ebi
Chunlei Wu
PPTX
High-performance web services for gene and variant annotations
Chunlei Wu
PDF
Annotation capabilities
Golden Helix
PDF
MyVariant.info: Variant Annotation as a Service
Chunlei Wu
PPTX
11-Big Data Application in Biomedical Research and Health Care.pptx
shikhamittal42
DOCX
Internship Report
Neha Gupta
PDF
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Reid Robison
PPTX
Accelerate Pharmaceutical R&D with Big Data and MongoDB
MongoDB
PPTX
Accelerate pharmaceutical r&d with mongo db
MongoDB
PDF
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
Genomika Diagn坦sticos
PPTX
FedCentric_Presentation
Yatpang Cheung
PDF
Genome voyager-beta-brochure
Xing Xu
PDF
2015 GU-ICBI Poster (third printing)
Michael Atkins
PDF
BioJS Human Genetic Variant Viewer
Saket Choudhary
PPTX
CS Lecture 2017 04-11 from Data to Precision Medicine
Gabe Rudy
PPTX
Chunlei Wu BD2K 201601 MyGene.info and MyVariant.info
Chunlei Wu
PDF
Containerized attribute indexing and graph genomes for federated data access
Ben Busby
PDF
Variant analysis and whole exome sequencing
Bioinformatics and Computational Biosciences Branch
PPTX
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Allen Day, PhD
2015 functional genomics variant annotation and interpretation- tools and p...
Gabe Rudy
Chunlei wu heart_bd2k_201602_ebi
Chunlei Wu
High-performance web services for gene and variant annotations
Chunlei Wu
Annotation capabilities
Golden Helix
MyVariant.info: Variant Annotation as a Service
Chunlei Wu
11-Big Data Application in Biomedical Research and Health Care.pptx
shikhamittal42
Internship Report
Neha Gupta
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Reid Robison
Accelerate Pharmaceutical R&D with Big Data and MongoDB
MongoDB
Accelerate pharmaceutical r&d with mongo db
MongoDB
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
Genomika Diagn坦sticos
FedCentric_Presentation
Yatpang Cheung
Genome voyager-beta-brochure
Xing Xu
2015 GU-ICBI Poster (third printing)
Michael Atkins
BioJS Human Genetic Variant Viewer
Saket Choudhary
CS Lecture 2017 04-11 from Data to Precision Medicine
Gabe Rudy
Chunlei Wu BD2K 201601 MyGene.info and MyVariant.info
Chunlei Wu
Containerized attribute indexing and graph genomes for federated data access
Ben Busby
Variant analysis and whole exome sequencing
Bioinformatics and Computational Biosciences Branch
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Allen Day, PhD
Ad

ASHG_2014_AP

  • 1. www.bina.com A highly efficient and scalable compute platform for massive variant annotation and rapid genome interpretation James Warren1, Emre Colak1, Amirhossein Kiani1, Jian Li2, Aparna Chhibber2, Sanchita Bhattacharya3, Narges Bani Asadi1,2, Sharon Barr1, Atul Butte3, Garry Nolan4, Rong Chen5, Wing H. Wong6,7, and Hugo Y.K. Lam 2, MOTIVATION After obtaining variants from next generation sequencing data, researchers and clinicians still face the undertaking of interpreting the results. Despite the availability of a multitude of public databases, using this collective information is an arduous task due to inconsistent and heterogeneous data, multiple versions, and nonstandard formats. Moreover, after aggregating the data and annotating the variants, it remains a laborious exercise to identify the causative variants associated with the disease in question. APPROACH We have developed a highly efficient data processing pipeline that leverages big data technologies to integrate annotations from a wide range of biological databases. The pipeline takes variant call sets, annotates all samples, and indexes the variants for analysis. Users can perform real-time searches and analytical queries against the annotated results to rapidly identify variants for further study. CHALLENGES Heterogeneous data. An annotation platform must integrate diverse datasets of variants, genes, diseases, transcripts and functional predictions. No standardizations. The platform must account for differences in datasets, such as different reference genomes or changing schemas between versions. Real-time Interaction. A user must be able to interact with the annotation results in real time. Such interaction allows rapid identification of relevant variants while also supporting undirected investigation. Contextual interface. The system cannot assume the user is familiar with the underlying data sources. It instead must support contextual queries, such as "find all predicted damaging variants of high quality associated with a given disease. 1. Department of Engineering, Bina Technologies, Redwood City, CA 94065. 2. Department of Bioinformatics, Bina Technologies, Redwood City, CA 94065. 3. Division of Systems Medicine, Department of Pediatrics, Stanford University School of Medicine, Stanford CA 94305. 4. Department of Microbiology and Immunology, Stanford University, School of Medicine, Stanford California 94305. 5. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029 6. Department of Statistics, Stanford University, Stanford, CA 94305. 7. Department of Health Research and Policy, Stanford University School of Medicine, Stanford, CA 94305. To whom correspondence should be addressed. Affiliations Hadoop / Cascalog Contact Us rd@bina.com ! METHOD During the annotation process, the pipeline: constructs indices that can be efficiently composed to support an effectively infinite number of queries uses Hadoop MapReduce to associate variants with relevant annotations stores the annotated output and indices in HBase, a NoSQL database a 5-node Hadoop cluster can annotate and index a whole exome sequencing sample in 30 minutes and a whole genome sequencing sample in under an hour As a variant set passes through the data pipeline: linked with over 140 annotation classes from more than 20 databases/datasets annotating a sample and indexing its variants are computationally demanding steps, but these are one-time costs After the process is complete: users can interact with the results via an intuitive web interface. External Data Sources* Genomic Variants Variants with Predicted Effects SnpEff Fully Annotated Variants Indices / Functional Filters NoSQL Datastore REST / API HBase Pre-Computation Real-Time Interaction * Data sources include: 1000 Genomes Cancer Gene Census ClinVar dbNSFP dbSNP DGV ENSEMBL ESP GWAS HGMD PGMD PROTEOME RefSeq RepeatMasker Segmental Duplications TRANSFAC Ohtahara syndrome (early infantile epileptic encephalopathy with suppression bursts) is a rare form of epilepsy that presents in early infancy and occurs slightly more often in males. We analyzed a whole genome sequenced family trio with two unaffected parents and an affected son. Using the Bina Annotation Platform we were able to filter from over 6.5 million variants in this family down to one X-linked non-synonymous variant in the gene AGTR2 potentially associated with the syndrome in the proband. For another application of the Bina Annotation Platform, we analyzed the WGS data from the DNA of Ata [1], the skeletal remains of a 6-inch human found in the Atacama Desert, Chile. The annotation platform discovered 4,000+ exonic non-synonymous SNVs, 400+ frame-shift or codon-change indels in genes previously associated with disease, and 1,000+ structural variations. Fourteen of these variants were located in genes known to be associated with dwarfism and skeletal dysplasia, of which one was not in dbSNP. The results were scientifically interesting and taken for further investigation [2]. [1] http://news.sciencemag.org/health/2013/05/bizarre-6-inch-skeleton-shown-be-human [2] S. Bhattacharya, J. Li, H. Lam, R. Lachman, N. Asadi, A. Butte, G. Nolan, Whole genome sequencing of mummy DNA shows significant association with human disease phenotype. Poster 2914S at ASHG 2014. EXAMPLE APPLICATIONS . CONCLUSION AND FUTURE WORK CITATIONS .T he Bina Annotation Platform has proven to be a powerful tool for variant interpretation for both single and multi-sample analyses. In future releases the platform will support additional workflows such as case-control and cohort studies, and will allow users to upload custom databases.