ADAM is a fast, scalable genome analysis platform built using Apache Spark and data formats like Avro and Parquet. It provides tools for read processing, variant calling, and multi-sample analysis on whole genome, high coverage data. The platform is designed to be easy to use for developers while leveraging existing open-source systems and deploying on both local and cloud infrastructures.
1 of 26
Downloaded 34 times
More Related Content
Ga4 gh meeting at the the sanger institute
1. ADAM: Fast, Scalable
Genome Analysis
http://bigdatagenomics.github.io
Matt Massie
Twitter: @matt_massie
Email: massie@berkeley.edu
University of California, Berkeley
http://amplab.cs.berkeley.edu
2. Design
Create a platform with an easy programming environment for
developers
Provide both single and multi-sample methods that are fast and
scalable for whole genome, high-coverage data
Allow for multiple views of the same data, e.g. SQL/Table, Graph
Analysis, Iterator on Records, Resilient Distributed Datasets
Leverage existing open-source systems and plug into current
Big Data ecosystems
Deployable on an in-house cluster or any cloud vendor Amazon EC2, Google Compute Engine or Microsoft Azure
Everything is a 鍖le - bulk data transfer only requires standard
tools like rsync, scp, distcp, S3sync, etc.
3. Commits
Implementation
Accelerated work began September, 2013
Built using Apache Spark execution engine and
Apache Avro and Parquet for 鍖le formats
20K lines of Scala code
Nine contributors from Mt. Sinai, GenomeBridge,
The Broad Institute and others
Apache-licensed open-source
4. Read Pre-Processing
Raw Reads
Features
Mapping
Sorted Mapping
Local Alignment
Mark Duplicates
Base Quality Score
Recalibration
Calling-Ready
Reads
ADAM
Read pre-processing: sort, mark dups, BQSR
Read comparison across multiple covariates
Converters between legacy and ADAM
formats
Avocado - A variant caller, distributed
SNP caller
Fully con鍖gurable pipeline via a con鍖g 鍖le
Local assembler
Support for integrating aligners in M/R
frameworks
6. Avro
Serialization system similar to Google Protobuf and
Apache Thrift
Data formats are fully described with a schema
Data鍖le format is self-descriptive and record-oriented
Bindings for Java, C, C++, C#, JavaScript, Python, Ruby,
PHP and Perl (R in the works)
Provides schema evolution, resolution and projection
Numerous conversion utilities to print Avro as JSON,
extract schema from JAXB, turn XSD/XML to Avro
12. Parquet
Based on Google Dremel design
Columnar File Format
Created by Twitter and Cloudera with contributions from
dozens of open-source developers
Limits I/O to only data that is needed
Fast scans - load only columns you need, e.g. scan a read 鍖ag
on a whole genome, high-coverage 鍖le in less than a minute
Compresses very well - ADAM 鍖les are 5-25% smaller than
BAM 鍖les without loss of data
Integrates easily with Avro, Hadoop, Hive, Shark, Impala, Pig,
Jackson/JSON, Scrooge and others
13. Read Data Example
chrom20 TCGA
chrom20 GAAT
4M1D
chrom20 CCGAT
Projection
Predicate
4M
5M
Row Oriented
chrom20
TCGA
4M
chrom20
GAAT
4M1D
chrom20 CCGAT
5M
Column Oriented
chrom20 chrom20 chrom20
TCGA
GAAT
CCGAT
4M
4M1D
5M
15. Apache Spark
Grew out of Berkeley AMPLab research - now a top-level
Apache project, commercially-supported
Ease of Use - Spark offers over 80 high-level operators that
make it easy to build parallel apps using Scala, Java, Python or
R
Easy to test code in local mode
Speed - Spark has an advanced DAG execution engine that is
10-100x faster than Hadoop M/R
Runs well on in-house clusters, Amazon EC2 and Google
Compute Engine
Can use it interactively for ad-hoc analysis from the Scala,
Python and R shells or using iPython notebook
16. Performance as Proof
Sort
24
Hours
20
Mark Duplicates
BQSR
20.37
17.73
16
12
8.93
8
4
0.33 0.47 0.75
0
Picard
ADAM Single Node
ADAM 100 EC2 Nodes
1000g NA12878 Whole Genome, 60x Coverage
For comparison, Bina Technologies quotes .94 hours for
BQSR at only 37x coverage
17. Summary
Schema-driven design allows developers to
think at the logical layer
Well-designed execution systems allows
developers to focus on science and algorithms
instead of implementation details
Modern data formats enable distributed, fast
computation and easier integration
Moving computation to the data reduces
transfers and improves performance
24. Hadoop Distributed
File System (HDFS)
Based on GoogleFS
Single namespace across entire cluster
Uses commodity hardware - JBOD
Files are broken into blocks (e.g. 128MB)
Blocks replicated for durability and
performance
Write-once, read-many access pattern
26. $ adam
e
d8b
/Y88b
/ Y88b
/____Y88b
/
Y88b
888~-_
888
888
|
888
|
888
/
888_-~
e
d8b
/Y88b
/ Y88b
/____Y88b
/
Y88b
e
e
d8b d8b
d888bdY88b
/ Y88Y Y888b
/
YY
Y888b
/
Y888b
Choose one of the following commands:
transform
print_tags
flagstat
reads2ref
mpileup
print
aggregate_pileups
listdict
compare
compute_variants
bam2adam
adam2vcf
vcf2adam
findreads
fasta2adam
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations
Prints the values and counts of all tags in a set of records
Print statistics on reads in an ADAM file (similar to samtools flagstat)
Convert an ADAM read-oriented file to an ADAM reference-oriented file
Output the samtool mpileup text from ADAM reference-oriented data
Print an ADAM formatted file
Aggregate pileups in an ADAM reference-oriented file
Print the contents of an ADAM sequence dictionary
Compare two ADAM files based on read name
Compute variant data from genotypes
Single-node BAM to ADAM converter (Note: the 'transform' command can take SAM or BAM as input)
Convert an ADAM variant to the VCF ADAM format
Convert a VCF file to the corresponding ADAM format
Find reads that match particular individual or comparative criteria
Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which
represents assembled sequences.
plugin : Executes an AdamPlugin