DNA sequencing is the process of determining the precise order of nucleotides in a DNA molecule. There are two main historical methods - the Maxam-Gilbert chemical method from 1977 and the Sanger dideoxy chain termination method from 1977, which is still commonly used. Next generation sequencing uses massively parallel sequencing to produce millions of short DNA fragments at once, reducing costs significantly. However, next generation sequencing produces huge amounts of data that presents challenges for data storage, analysis, and interpretation.
Convert to study guideBETA
Transform any presentation into a summarized study guide, highlighting the most important points and key insights.
2. DNA SEQUENCING
DNA sequencing is the process of
determining the precise order of
nucleotides within a DNA molecule.
It includes any method or technology that
is used to determine the order of the four
basesadenine, guanine, cytosine, and
thyminein a strand of DNA.
4. A. M. Maxam and W.Gilbert-1977
Chemical Sequencing
Treatment of DNA with certain
Chemicals DNA cuts into
Fragments Monitoring of
sequences
MAXAM & GILBERT METHOD
6. Most common approach used
for DNA sequencing .
Invented by Frederick Sanger -
1977
Nobel prize - 1980
Also termed as Chain
Termination or Dideoxy method
SANGER METHOD
7. SANGER METHOD
The chain termination reaction
Dideoxynucleotide triphosphates (ddNTPs) chain
terminators
havig an H on the 3C of the ribose sugar
(normally OH found in dNTPs)
ssDNA addition of dNTPs elongation
ssDNA addition of ddNTPs elongation stops
12. Fluorescent Dyes
Fluorescent dyes are multicyclic
molecules that absorb and emit
fluorescent light at specific wavelengths.
Examples are fluorescein and rhodamine
derivatives.
For sequencing applications, these
molecules can be covalently linked to
nucleotides.
13. AC
GT
The fragments are
distinguished by size and
color.
Dye Terminator Sequencing
A distinct dye or color is used for each of the
four ddNTP.
Since the terminating nucleotides can be
distinguished by color, all four reactions can be
performed in a single tube.
A
T
G
T
18. The Human Genome Project
First draft genome of human in 2001,
final 2004
Estimated costs $3 billion, time 13 years
Used Sanger Sequencing
Today:
Illumina: 1 week, 9500$
Exome: 6 weeks*, $1000
Towards 1000$ genome
Setia Pramana
18
19. The Human Genome Project
The draft sequence of the
HGP was imperfect
because of the incomplete
coverage of many regions
a huge number of gaps
The IHGSC published a
finished version of the
human genome sequence
in 2004 and the HGP was
then deemed to be
complete
19
20. The Human Genome Project
This finished version of the
genome achieved almost
complete coverage of all the
regions and also significantly
reduced the number of gaps
to 341 from the initial
hundreds of thousands
Initiated a new era in the
study of genetic variation and
the functional
characterization of the
human genome
20
21. Next (second) Generation Sequencing
New technologies allowing the massive
production of tens of millions of short
sequencing fragments. Thus, it is also
called: Massively parallel sequencing
These techniques could be used to
deal with similar problems than microarrays,
but also with many other.
They raised the promise of personalized
medicine
21
22. NGS
The advent of high-throughput
sequencing technologies has initiated
the personal genome sequencing era
for both normal and cancer genomes
Large-scale international projects such
as the 1000 Genomes Project and the
International Cancer Genome
Consortium
22
23. NGS
NGS technologies have been on the
market only since 2004
Have now largely replaced Sanger
sequencing technologies (owing to the
ultra-high-throughput
production/hundreds gigabases)
Ability to simultaneously sequence
millions of DNA fragments - massively
parallel sequencing technologies
23
24. NGS
Reduced sequencing costs
significantly, making large-scale or
WGS studies much more affordable
Setia Pramana
24
32. NGS Challenges
Highest cost is (almost) not the sequencing
but storage and analysis.
A standard human (30-40x) whole genome
sequencing would create 100 Gb of data
Extreme data size causes problems
Just transferring and storing the data
Standard comparisons fail (N*N)
Standard tools can not be used
Think in fast and parallel programs
Setia Pramana
32
33. Bioinformatics Challenges of NGS
Need for large amount of CPU power
- Informatics groups must manage
compute clusters
-Challenges in parallelizing existing
software or redesign of algorithms to work
in a parallel environment
- Another level of software complexity
and challenges to interoperability
Setia Pramana
33
34. Bioinformatics Challenges of NGS
VERY large text files (~10 million lines
long)
- Cant do business as usualwith
familiar tools such as Perl/Python.
- Impossible memory usage and
execution time - Impossible to
browse for problems
Need sequence Quality filtering
Setia Pramana
34
35. Data Management Issues
Raw data are large. How long should be kept?
Processed data are manageable for most people
20 million reads (50bp) ~1Gb
More of an issue for a facility: HiSeq recommends
32 CPU cores, each with 4GB RAM
Certain studies much more data intensive than
other
Whole genome sequencing
30X coverage genome pair (tumor/normal)
~500 GB
50 genome pairs ~ 25 TB
Setia Pramana
35
36. Data Management
Primary data usually discarded soon after run
Secondary and tertiary data maintained on fast access
disk during analysis, then moved to slower access disk
afterward
38. Big Collaboration
Need Collaborative expertise (human intelligence
and intuition) are required for meaning and
interpretation (Bergeron 2002)
Including on-demand communication & sharing of
protocols, electronic resources, data, and findings
among the stakeholders
Collaboration with other Big DATA sources: National
Registers, BPJS, Hospitals, etc.
39. Summary
Challenges:
Still expensive
Lack of Infrastructure (in developing
countries)
Lack of skilled personal on Bioinformatics
Need (large scale) collaborations
Integrate different technologies and system
Making it all clinically relevant
Setia Pramana
39