Describes early work on PacMin, a ploidy aware overlap/string based approach for assembling genomes from long reads.
1 of 18
Download to read offline
More Related Content
PacMin @ AMPLab All-Hands
1. PacMin: rethinking genome
analysis with long reads
Frank Austin Nothaft, AMPLab
Joint work with Adam Bloniarz
10/14/2014
2. Note:
This talk is mostly speculative.
I.e., the methods well talk about are
partially* implemented.
This means you have an opportunity to steer the
direction of this work!
* Im being generous to myself.
3. Sequencing 101
Most sequence data today comes from Illumina
machines, which perform sequencing-by-synthesis
!
!
!
We get short (100-250 bp) reads, with high accuracy
Reads are (usually) paired
http://en.wikipedia.org/wiki/File:Sequencing_by_synthesis_Reversible_terminators.png
4. Current Pipelines are
Reference Based
Map subsequences to a reference genome
Compute variants (diffs) against the reference
From GATK Best Practices, https://www.broadinstitute.org/gatk/guide/best-practices
5. An aside: What is the
reference genome?
Pool together n individuals, and assemble their genomes
together
A few problems:
How does the reference genome handle polymorphisms?
What about structural rearrangements?
Subpopulation specific alternate haplotypes?
It has gaps. 14 years after the first human reference
genome was released, it is still incomplete.*
* This problem is Hard.
6. The Sequencing Abstraction
It was the best of times, it was the worst of times
It was the
the best of
times, it was
worst of times
the worst of
Sample poisson distributed substrings from a
larger string
Reads are more or less unique and correct
Metaphor borrowed from Michael Schatz
best of times was the worst
7. is a leaky abstraction
We frequently encounter gaps in the sequence
Ross et al, Genome Biology 2013
8. is a leakier abstraction
We preferentially sequence from biased regions:
Ross et al, Genome Biology 2013
9. A very leaky abstraction!
Reads arent actually correct
>2% error (expect 0.1% variation)
Error probability estimates are cruddy
Reads arent actually unique
>7% of the genome is not unique (K. Curtis, SiRen)
10. The State of Analysis
Were really good at calling SNPs!
But, were still pretty bad at calling INDELs, and SVs
And were also bad at expressing diffs
Hence, SMaSH! But really, reference + diff format need to be burnt to the
ground and redesigned.
And, its slow. 2 weeks to sequence, 1 week to
analyze. Not fast enough for practical clinical use.
11. Opportunities
New read technologies are available
Provide much longer reads (250bp vs. >10kbp)
Different error model (15% INDEL errors, vs. 2%
SNP errors)
Generally, lower sequence specific bias
Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
12. If long reads are available
We can use conventional methods:
Carneiro et al, Genome Biology 2012
13. But!
Why not make raw assemblies out of the reads?
Find overlapping reads Find consensus sequence
for all pairs of reads (i,j):
i j
=?
ACACTGCGACTCATCGACTC
Problems:
1. Overlapping is O(n
2
) and single evaluation is expensive anyways
2. Typical algorithms find a single consensus sequence; what if weve got
polymorphisms?
14. Fast Overlapping with
MinHashing
Wonderful realization by Berlin et al1: overlapping is
similar to document similarity problem
Use MinHashing to approximate similarity:
1: Berlin et al, bioRxiv 2014
Per document/read,
compute signature:!
!
1. Cut into shingles
2. Apply random
hashes to shingles
3. Take min over all
random hashes
Hash into buckets:!
!
Signatures of length l
can be hashed into b
buckets, so we expect
to compare all elements
with similarity
(1/b)^(b/l)
Compare:!
!
For two documents with
signatures of length l,
Jaccard similarity is
estimated by
(# equal hashes) / l
!
Easy to implement in Spark: map, groupBy, map, filter
15. Overlaps to Assemblies
Finding pairwise overlaps gives us a directed
graph between reads (lots of edges!)
16. Transitive Reduction
We can find a consensus between clique members
Or, we can reduce down:
Via two iterations of Pregel!
17. Actually Making Calls
From here, we need to call copy number per edge
Probably via Newton-Raphson based on coverage; were not sure yet.
Then, per position in each edge, call alleles:
Notes:!
Equation is from Li, Bioinformatics 2011
g = genotype state
m = ploidy
= probability allele was erroneously observed
k = number of reads observed
l = number of reads observed matching reference allele
TBD: equation assumes biallelic observations at site and reference allele; we wont have either of those conveniences
18. Output
Current assemblers emit FASTA contigs
In laypersons speak: long strings
Well emit multigs, which well map back to reference
graph
Multig = multi-allelic (polymorphic) contig
Working with UCSC, whove done some really neat work1
deriving formalisms & building software for mapping
between sequence graphs, and GA4GH ref. variation team
1. Paten et al, Mapping to a Reference Genome Structure, arXiv 2014.