際際滷

際際滷Share a Scribd company logo
www.bina.com 
A highly efficient and scalable compute platform for massive variant annotation 
and rapid genome interpretation 
James Warren1, Emre Colak1, Amirhossein Kiani1, Jian Li2, Aparna Chhibber2, Sanchita Bhattacharya3, Narges Bani 
Asadi1,2, Sharon Barr1, Atul Butte3, Garry Nolan4, Rong Chen5, Wing H. Wong6,7, and Hugo Y.K. Lam 2, 
MOTIVATION 
After obtaining variants from next generation sequencing data, researchers and clinicians still face the undertaking of interpreting the results. Despite the availability of a 
multitude of public databases, using this collective information is an arduous task due to inconsistent and heterogeneous data, multiple versions, and nonstandard formats. 
Moreover, after aggregating the data and annotating the variants, it remains a laborious exercise to identify the causative variants associated with the disease in question. 
APPROACH 
We have developed a highly efficient data processing pipeline that leverages big data technologies to integrate annotations from a wide range of biological databases. The 
pipeline takes variant call sets, annotates all samples, and indexes the variants for analysis. Users can perform real-time searches and analytical queries against the 
annotated results to rapidly identify variants for further study. 
CHALLENGES 
 Heterogeneous data. An annotation platform must integrate diverse datasets of variants, genes, diseases, transcripts and functional predictions. 
 No standardizations. The platform must account for differences in datasets, such as different reference genomes or changing schemas between versions. 
 Real-time Interaction. A user must be able to interact with the annotation results in real time. Such interaction allows rapid identification of relevant variants while also 
supporting undirected investigation. 
 Contextual interface. The system cannot assume the user is familiar with the underlying data sources. It instead must support contextual queries, such as "find all 
predicted damaging variants of high quality associated with a given disease. 
1. Department of Engineering, Bina Technologies, Redwood City, CA 94065. 
2. Department of Bioinformatics, Bina Technologies, Redwood City, CA 94065. 
3. Division of Systems Medicine, Department of Pediatrics, Stanford University School of 
Medicine, Stanford CA 94305. 
4. Department of Microbiology and Immunology, Stanford University, School of Medicine, 
Stanford California 94305. 
5. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 
New York, NY 10029 
6. Department of Statistics, Stanford University, Stanford, CA 94305. 
7. Department of Health Research and Policy, Stanford University School of Medicine, 
Stanford, CA 94305. 
 To whom correspondence should be addressed. 
Affiliations 
Hadoop / 
Cascalog 
Contact Us 
rd@bina.com 
! 
METHOD 
During the annotation process, the pipeline: 
 constructs indices that can be efficiently composed to support an effectively infinite 
number of queries 
 uses Hadoop MapReduce to associate variants with relevant annotations 
 stores the annotated output and indices in HBase, a NoSQL database 
 a 5-node Hadoop cluster can annotate and index a whole exome sequencing sample in 
30 minutes and a whole genome sequencing sample in under an hour 
As a variant set passes through the data pipeline: 
 linked with over 140 annotation classes 
 from more than 20 databases/datasets 
 annotating a sample and indexing its variants are computationally demanding steps, but 
these are one-time costs 
After the process is complete: 
 users can interact with the results via an intuitive web interface. 
External 
Data Sources* 
Genomic 
Variants 
Variants with 
Predicted 
Effects 
SnpEff 
Fully 
Annotated 
Variants 
Indices / 
Functional 
Filters 
NoSQL 
Datastore 
REST / API 
HBase 
Pre-Computation 
Real-Time Interaction 
* Data sources include: 
1000 Genomes 
Cancer Gene Census 
ClinVar 
dbNSFP 
dbSNP 
DGV 
ENSEMBL 
ESP 
GWAS 
HGMD 
PGMD 
PROTEOME 
RefSeq 
RepeatMasker 
Segmental Duplications 
TRANSFAC 
Ohtahara syndrome (early infantile epileptic encephalopathy with suppression 
bursts) is a rare form of epilepsy that presents in early infancy and occurs slightly 
more often in males. 
We analyzed a whole genome sequenced family trio 
with two unaffected parents and an affected son. 
Using the Bina Annotation Platform we were able to 
filter from over 6.5 million variants in this family 
down to one X-linked non-synonymous variant in 
the gene AGTR2 potentially associated with the 
syndrome in the proband. 
For another application of the Bina Annotation Platform, we analyzed the WGS 
data from the DNA of Ata [1], the skeletal remains of a 6-inch human found in the 
Atacama Desert, Chile. 
The annotation platform discovered 4,000+ exonic non-synonymous SNVs, 400+ 
frame-shift or codon-change indels in genes previously associated with disease, 
and 1,000+ structural variations. Fourteen of these variants were located in genes 
known to be associated with dwarfism and skeletal dysplasia, of which one was 
not in dbSNP. The results were scientifically interesting and taken for further 
investigation [2]. 
[1] http://news.sciencemag.org/health/2013/05/bizarre-6-inch-skeleton-shown-be-human 
[2] S. Bhattacharya, J. Li, H. Lam, R. Lachman, N. Asadi, A. Butte, G. Nolan, Whole genome 
sequencing of mummy DNA shows significant association with human disease phenotype. 
Poster 2914S at ASHG 2014. 
EXAMPLE APPLICATIONS 
. 
CONCLUSION AND FUTURE WORK CITATIONS 
.T he Bina Annotation Platform has proven to be a powerful tool for variant 
interpretation for both single and multi-sample analyses. In future releases the 
platform will support additional workflows such as case-control and cohort 
studies, and will allow users to upload custom databases.

More Related Content

ASHG_2014_AP

  • 1. www.bina.com A highly efficient and scalable compute platform for massive variant annotation and rapid genome interpretation James Warren1, Emre Colak1, Amirhossein Kiani1, Jian Li2, Aparna Chhibber2, Sanchita Bhattacharya3, Narges Bani Asadi1,2, Sharon Barr1, Atul Butte3, Garry Nolan4, Rong Chen5, Wing H. Wong6,7, and Hugo Y.K. Lam 2, MOTIVATION After obtaining variants from next generation sequencing data, researchers and clinicians still face the undertaking of interpreting the results. Despite the availability of a multitude of public databases, using this collective information is an arduous task due to inconsistent and heterogeneous data, multiple versions, and nonstandard formats. Moreover, after aggregating the data and annotating the variants, it remains a laborious exercise to identify the causative variants associated with the disease in question. APPROACH We have developed a highly efficient data processing pipeline that leverages big data technologies to integrate annotations from a wide range of biological databases. The pipeline takes variant call sets, annotates all samples, and indexes the variants for analysis. Users can perform real-time searches and analytical queries against the annotated results to rapidly identify variants for further study. CHALLENGES Heterogeneous data. An annotation platform must integrate diverse datasets of variants, genes, diseases, transcripts and functional predictions. No standardizations. The platform must account for differences in datasets, such as different reference genomes or changing schemas between versions. Real-time Interaction. A user must be able to interact with the annotation results in real time. Such interaction allows rapid identification of relevant variants while also supporting undirected investigation. Contextual interface. The system cannot assume the user is familiar with the underlying data sources. It instead must support contextual queries, such as "find all predicted damaging variants of high quality associated with a given disease. 1. Department of Engineering, Bina Technologies, Redwood City, CA 94065. 2. Department of Bioinformatics, Bina Technologies, Redwood City, CA 94065. 3. Division of Systems Medicine, Department of Pediatrics, Stanford University School of Medicine, Stanford CA 94305. 4. Department of Microbiology and Immunology, Stanford University, School of Medicine, Stanford California 94305. 5. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029 6. Department of Statistics, Stanford University, Stanford, CA 94305. 7. Department of Health Research and Policy, Stanford University School of Medicine, Stanford, CA 94305. To whom correspondence should be addressed. Affiliations Hadoop / Cascalog Contact Us rd@bina.com ! METHOD During the annotation process, the pipeline: constructs indices that can be efficiently composed to support an effectively infinite number of queries uses Hadoop MapReduce to associate variants with relevant annotations stores the annotated output and indices in HBase, a NoSQL database a 5-node Hadoop cluster can annotate and index a whole exome sequencing sample in 30 minutes and a whole genome sequencing sample in under an hour As a variant set passes through the data pipeline: linked with over 140 annotation classes from more than 20 databases/datasets annotating a sample and indexing its variants are computationally demanding steps, but these are one-time costs After the process is complete: users can interact with the results via an intuitive web interface. External Data Sources* Genomic Variants Variants with Predicted Effects SnpEff Fully Annotated Variants Indices / Functional Filters NoSQL Datastore REST / API HBase Pre-Computation Real-Time Interaction * Data sources include: 1000 Genomes Cancer Gene Census ClinVar dbNSFP dbSNP DGV ENSEMBL ESP GWAS HGMD PGMD PROTEOME RefSeq RepeatMasker Segmental Duplications TRANSFAC Ohtahara syndrome (early infantile epileptic encephalopathy with suppression bursts) is a rare form of epilepsy that presents in early infancy and occurs slightly more often in males. We analyzed a whole genome sequenced family trio with two unaffected parents and an affected son. Using the Bina Annotation Platform we were able to filter from over 6.5 million variants in this family down to one X-linked non-synonymous variant in the gene AGTR2 potentially associated with the syndrome in the proband. For another application of the Bina Annotation Platform, we analyzed the WGS data from the DNA of Ata [1], the skeletal remains of a 6-inch human found in the Atacama Desert, Chile. The annotation platform discovered 4,000+ exonic non-synonymous SNVs, 400+ frame-shift or codon-change indels in genes previously associated with disease, and 1,000+ structural variations. Fourteen of these variants were located in genes known to be associated with dwarfism and skeletal dysplasia, of which one was not in dbSNP. The results were scientifically interesting and taken for further investigation [2]. [1] http://news.sciencemag.org/health/2013/05/bizarre-6-inch-skeleton-shown-be-human [2] S. Bhattacharya, J. Li, H. Lam, R. Lachman, N. Asadi, A. Butte, G. Nolan, Whole genome sequencing of mummy DNA shows significant association with human disease phenotype. Poster 2914S at ASHG 2014. EXAMPLE APPLICATIONS . CONCLUSION AND FUTURE WORK CITATIONS .T he Bina Annotation Platform has proven to be a powerful tool for variant interpretation for both single and multi-sample analyses. In future releases the platform will support additional workflows such as case-control and cohort studies, and will allow users to upload custom databases.