�ݺ�ߣ

PacMin: rethinking genome
analysis with long reads
Frank Austin Nothaft, AMPLab
Joint work with Adam Bloniarz
10/14/2014

Note:
• This talk is mostly speculative.
• I.e., the methods we’ll talk about are
partially* implemented.
• This means you have an opportunity to steer the
direction of this work!
* I’m being generous to myself.

Sequencing 101
• Most sequence data today comes from Illumina
machines, which perform sequencing-by-synthesis
!
!
!
• We get short (100-250 bp) reads, with high accuracy
• Reads are (usually) paired
http://en.wikipedia.org/wiki/File:Sequencing_by_synthesis_Reversible_terminators.png

Current Pipelines are
Reference Based
• Map subsequences to a “reference genome”
• Compute variants (diffs) against the reference
From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices

An aside: What is the
reference genome?
• Pool together n individuals, and assemble their genomes
together
• A few problems:
• How does the reference genome handle polymorphisms?
• What about structural rearrangements?
• Subpopulation specific alternate haplotypes?
• It has gaps. 14 years after the first human reference
genome was released, it is still incomplete.*
* This problem is Hard.

The Sequencing Abstraction
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
worst of times
the worst of
• Sample poisson distributed substrings from a
larger string
• Reads are more or less unique and correct
Metaphor borrowed from Michael Schatz
best of times was the worst

…is a leaky abstraction
• We frequently encounter “gaps” in the sequence
Ross et al, Genome Biology 2013

…is a leakier abstraction
• We preferentially sequence from “biased” regions:
Ross et al, Genome Biology 2013

A very leaky abstraction!
• Reads aren’t actually correct
• >2% error (expect 0.1% variation)
• Error probability estimates are cruddy
• Reads aren’t actually unique
• >7% of the genome is not unique (K. Curtis, SiRen)

The State of Analysis
• We’re really good at calling SNPs!
• But, we’re still pretty bad at calling INDELs, and SVs
• And we’re also bad at expressing diffs
• Hence, SMaSH! But really, reference + diff format need to be burnt to the
ground and redesigned.
• And, its slow. 2 weeks to sequence, 1 week to
analyze. Not fast enough for practical clinical use.

Opportunities
• New read technologies are available
• Provide much longer reads (250bp vs. >10kbp)
• Different error model… (15% INDEL errors, vs. 2%
SNP errors)
• Generally, lower sequence specific bias
Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/

If long reads are available…
• We can use conventional methods:
Carneiro et al, Genome Biology 2012

But!
• Why not make raw assemblies out of the reads?
Find overlapping reads Find consensus sequence
for all pairs of reads (i,j):
i j
=?
…ACACTGCGACTCATCGACTC…
• Problems:
1. Overlapping is O(n
2
) and single evaluation is expensive anyways
2. Typical algorithms find a single consensus sequence; what if we’ve got
polymorphisms?

Fast Overlapping with
MinHashing
• Wonderful realization by Berlin et al1: overlapping is
similar to document similarity problem
• Use MinHashing to approximate similarity:
1: Berlin et al, bioRxiv 2014
Per document/read,
compute signature:!
!
1. Cut into shingles
2. Apply random
hashes to shingles
3. Take min over all
random hashes
Hash into buckets:!
!
Signatures of length l
can be hashed into b
buckets, so we expect
to compare all elements
with similarity
≥ (1/b)^(b/l)
Compare:!
!
For two documents with
signatures of length l,
Jaccard similarity is
estimated by
(# equal hashes) / l
!
• Easy to implement in Spark: map, groupBy, map, filter

Overlaps to Assemblies
• Finding pairwise overlaps gives us a directed
graph between reads (lots of edges!)

Transitive Reduction
• We can find a consensus between clique members
• Or, we can reduce down:
• Via two iterations of Pregel!

Actually Making Calls
• From here, we need to call copy number per edge
• Probably via Newton-Raphson based on coverage; we’re not sure yet.
• Then, per position in each edge, call alleles:
Notes:!
Equation is from Li, Bioinformatics 2011
g = genotype state
m = ploidy
휖 = probability allele was erroneously observed
k = number of reads observed
l = number of reads observed matching “reference” allele
TBD: equation assumes biallelic observations at site and reference allele; we won’t have either of those conveniences…

Output
• Current assemblers emit FASTA contigs
• In layperson’s speak: long strings
• We’ll emit “multigs”, which we’ll map back to reference
graph
• Multig = multi-allelic (polymorphic) contig
• Working with UCSC, who’ve done some really neat work1
deriving formalisms & building software for mapping
between sequence graphs, and GA4GH ref. variation team
1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.

�ݺ�ߣ

PacMin @ AMPLab All-Hands

More Related Content

PacMin @ AMPLab All-Hands