際際滷

際際滷Share a Scribd company logo
PacMin: rethinking genome 
analysis with long reads 
Frank Austin Nothaft, AMPLab 
Joint work with Adam Bloniarz 
10/14/2014
Note: 
 This talk is mostly speculative. 
 I.e., the methods well talk about are 
partially* implemented. 
 This means you have an opportunity to steer the 
direction of this work! 
* Im being generous to myself.
Sequencing 101 
 Most sequence data today comes from Illumina 
machines, which perform sequencing-by-synthesis 
! 
! 
! 
 We get short (100-250 bp) reads, with high accuracy 
 Reads are (usually) paired 
http://en.wikipedia.org/wiki/File:Sequencing_by_synthesis_Reversible_terminators.png
Current Pipelines are 
Reference Based 
 Map subsequences to a reference genome 
 Compute variants (diffs) against the reference 
From GATK Best Practices, https://www.broadinstitute.org/gatk/guide/best-practices
An aside: What is the 
reference genome? 
 Pool together n individuals, and assemble their genomes 
together 
 A few problems: 
 How does the reference genome handle polymorphisms? 
 What about structural rearrangements? 
 Subpopulation specific alternate haplotypes? 
 It has gaps. 14 years after the first human reference 
genome was released, it is still incomplete.* 
* This problem is Hard.
The Sequencing Abstraction 
It was the best of times, it was the worst of times 
It was the 
the best of 
times, it was 
worst of times 
the worst of 
 Sample poisson distributed substrings from a 
larger string 
 Reads are more or less unique and correct 
Metaphor borrowed from Michael Schatz 
best of times was the worst
is a leaky abstraction 
 We frequently encounter gaps in the sequence 
Ross et al, Genome Biology 2013
is a leakier abstraction 
 We preferentially sequence from biased regions: 
Ross et al, Genome Biology 2013
A very leaky abstraction! 
 Reads arent actually correct 
 >2% error (expect 0.1% variation) 
 Error probability estimates are cruddy 
 Reads arent actually unique 
 >7% of the genome is not unique (K. Curtis, SiRen)
The State of Analysis 
 Were really good at calling SNPs! 
 But, were still pretty bad at calling INDELs, and SVs 
 And were also bad at expressing diffs 
 Hence, SMaSH! But really, reference + diff format need to be burnt to the 
ground and redesigned. 
 And, its slow. 2 weeks to sequence, 1 week to 
analyze. Not fast enough for practical clinical use.
Opportunities 
 New read technologies are available 
 Provide much longer reads (250bp vs. >10kbp) 
 Different error model (15% INDEL errors, vs. 2% 
SNP errors) 
 Generally, lower sequence specific bias 
Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
If long reads are available 
 We can use conventional methods: 
Carneiro et al, Genome Biology 2012
But! 
 Why not make raw assemblies out of the reads? 
Find overlapping reads Find consensus sequence 
for all pairs of reads (i,j): 
i j 
=? 
ACACTGCGACTCATCGACTC 
 Problems: 
1. Overlapping is O(n 
2 
) and single evaluation is expensive anyways 
2. Typical algorithms find a single consensus sequence; what if weve got 
polymorphisms?
Fast Overlapping with 
MinHashing 
 Wonderful realization by Berlin et al1: overlapping is 
similar to document similarity problem 
 Use MinHashing to approximate similarity: 
1: Berlin et al, bioRxiv 2014 
Per document/read, 
compute signature:! 
! 
1. Cut into shingles 
2. Apply random 
hashes to shingles 
3. Take min over all 
random hashes 
Hash into buckets:! 
! 
Signatures of length l 
can be hashed into b 
buckets, so we expect 
to compare all elements 
with similarity 
 (1/b)^(b/l) 
Compare:! 
! 
For two documents with 
signatures of length l, 
Jaccard similarity is 
estimated by 
(# equal hashes) / l 
! 
 Easy to implement in Spark: map, groupBy, map, filter
Overlaps to Assemblies 
 Finding pairwise overlaps gives us a directed 
graph between reads (lots of edges!)
Transitive Reduction 
 We can find a consensus between clique members 
 Or, we can reduce down: 
 Via two iterations of Pregel!
Actually Making Calls 
 From here, we need to call copy number per edge 
 Probably via Newton-Raphson based on coverage; were not sure yet. 
 Then, per position in each edge, call alleles: 
Notes:! 
Equation is from Li, Bioinformatics 2011 
g = genotype state 
m = ploidy 
 = probability allele was erroneously observed 
k = number of reads observed 
l = number of reads observed matching reference allele 
TBD: equation assumes biallelic observations at site and reference allele; we wont have either of those conveniences
Output 
 Current assemblers emit FASTA contigs 
 In laypersons speak: long strings 
 Well emit multigs, which well map back to reference 
graph 
 Multig = multi-allelic (polymorphic) contig 
 Working with UCSC, whove done some really neat work1 
deriving formalisms & building software for mapping 
between sequence graphs, and GA4GH ref. variation team 
1. Paten et al, Mapping to a Reference Genome Structure, arXiv 2014.

More Related Content

PacMin @ AMPLab All-Hands

  • 1. PacMin: rethinking genome analysis with long reads Frank Austin Nothaft, AMPLab Joint work with Adam Bloniarz 10/14/2014
  • 2. Note: This talk is mostly speculative. I.e., the methods well talk about are partially* implemented. This means you have an opportunity to steer the direction of this work! * Im being generous to myself.
  • 3. Sequencing 101 Most sequence data today comes from Illumina machines, which perform sequencing-by-synthesis ! ! ! We get short (100-250 bp) reads, with high accuracy Reads are (usually) paired http://en.wikipedia.org/wiki/File:Sequencing_by_synthesis_Reversible_terminators.png
  • 4. Current Pipelines are Reference Based Map subsequences to a reference genome Compute variants (diffs) against the reference From GATK Best Practices, https://www.broadinstitute.org/gatk/guide/best-practices
  • 5. An aside: What is the reference genome? Pool together n individuals, and assemble their genomes together A few problems: How does the reference genome handle polymorphisms? What about structural rearrangements? Subpopulation specific alternate haplotypes? It has gaps. 14 years after the first human reference genome was released, it is still incomplete.* * This problem is Hard.
  • 6. The Sequencing Abstraction It was the best of times, it was the worst of times It was the the best of times, it was worst of times the worst of Sample poisson distributed substrings from a larger string Reads are more or less unique and correct Metaphor borrowed from Michael Schatz best of times was the worst
  • 7. is a leaky abstraction We frequently encounter gaps in the sequence Ross et al, Genome Biology 2013
  • 8. is a leakier abstraction We preferentially sequence from biased regions: Ross et al, Genome Biology 2013
  • 9. A very leaky abstraction! Reads arent actually correct >2% error (expect 0.1% variation) Error probability estimates are cruddy Reads arent actually unique >7% of the genome is not unique (K. Curtis, SiRen)
  • 10. The State of Analysis Were really good at calling SNPs! But, were still pretty bad at calling INDELs, and SVs And were also bad at expressing diffs Hence, SMaSH! But really, reference + diff format need to be burnt to the ground and redesigned. And, its slow. 2 weeks to sequence, 1 week to analyze. Not fast enough for practical clinical use.
  • 11. Opportunities New read technologies are available Provide much longer reads (250bp vs. >10kbp) Different error model (15% INDEL errors, vs. 2% SNP errors) Generally, lower sequence specific bias Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
  • 12. If long reads are available We can use conventional methods: Carneiro et al, Genome Biology 2012
  • 13. But! Why not make raw assemblies out of the reads? Find overlapping reads Find consensus sequence for all pairs of reads (i,j): i j =? ACACTGCGACTCATCGACTC Problems: 1. Overlapping is O(n 2 ) and single evaluation is expensive anyways 2. Typical algorithms find a single consensus sequence; what if weve got polymorphisms?
  • 14. Fast Overlapping with MinHashing Wonderful realization by Berlin et al1: overlapping is similar to document similarity problem Use MinHashing to approximate similarity: 1: Berlin et al, bioRxiv 2014 Per document/read, compute signature:! ! 1. Cut into shingles 2. Apply random hashes to shingles 3. Take min over all random hashes Hash into buckets:! ! Signatures of length l can be hashed into b buckets, so we expect to compare all elements with similarity (1/b)^(b/l) Compare:! ! For two documents with signatures of length l, Jaccard similarity is estimated by (# equal hashes) / l ! Easy to implement in Spark: map, groupBy, map, filter
  • 15. Overlaps to Assemblies Finding pairwise overlaps gives us a directed graph between reads (lots of edges!)
  • 16. Transitive Reduction We can find a consensus between clique members Or, we can reduce down: Via two iterations of Pregel!
  • 17. Actually Making Calls From here, we need to call copy number per edge Probably via Newton-Raphson based on coverage; were not sure yet. Then, per position in each edge, call alleles: Notes:! Equation is from Li, Bioinformatics 2011 g = genotype state m = ploidy = probability allele was erroneously observed k = number of reads observed l = number of reads observed matching reference allele TBD: equation assumes biallelic observations at site and reference allele; we wont have either of those conveniences
  • 18. Output Current assemblers emit FASTA contigs In laypersons speak: long strings Well emit multigs, which well map back to reference graph Multig = multi-allelic (polymorphic) contig Working with UCSC, whove done some really neat work1 deriving formalisms & building software for mapping between sequence graphs, and GA4GH ref. variation team 1. Paten et al, Mapping to a Reference Genome Structure, arXiv 2014.