際際滷

際際滷Share a Scribd company logo
Comparison of Genomic DNA to
cDNA Alignment Methods
Miguel Galves and Zanoni Dias
Institute of Computing  Unicamp  Campinas  SP  Brazil
{miguel.galves,zanoni}@ic.unicamp.br
Scylla Bioinformatics  Campinas  SP  Brazil
{miguel,zanoni}@scylla.com.br
Agenda
 Introduction
 Problem
 Aligners
 Data set
 Subsets
 Evaluation Methods
 Results: Exact Alignments
 Results: EST Alignments
 Running Time Comparison
 Conclusions
Introduction
 Identifying genes in non-characterized DNA
sequences is one of the greatest challenges in
genomics
 EST-to-DNA alignment is one of the most common
methods
 EST are key to understanding the inner working of
an organism
 Human being has between 30000 and 35000 genes
 Alternative Splicing plays an important role in diversity
CCCGGGAAACGAAUAU CCUCUCACCCGGGA CUUGGCCCGGGAAACGAAUAU CCUCUCACCCGGG
A
CUUGG
Problem
Mature mRNA
mRNA
Intron
Exon
Problem: How to solve ?
 Classic algorithms
 Dynamic programming
 Heuristic based algorithms
 Multi-steps
 Based on other tools such as Blast and
local alignments.
Aligners
 Java version of global and semi-global
 Affine gap penalty function
 Linear space
 Global algorithm by Miller and Myers (1988)
 Semi-global based on global algorithm
 Heuristic based algorithms
 sim4, Spidey and est_genome
Data Set
 Human genome database
 Based on FASTA a GENBANKs flat format file from
NCBI repository.
 Filtering criteria
 Genes, mRNAs and CDS with /pseudo tag
 mRNAs without any CDS
 Genes without any mRNA
 CDS matching wrong patterns
 23124 genes and 27448 mRNAs stored in database
Subsets
 Subset 1Subset 1:: 66 genes from chromossome Y whith
less than 100000 bases
 Subset 2: 50 complete genes from chromossome
Y whith less than 100000 bases
 Subset 3: 8056 complete genes from all
chromossomes whith less than 100000 bases
 Subset 4: 493 artificial EST based on complete
genes from chromossome 6 with less than
100000 bases
Evaluation methods
 Number of gaps introduced in the aligned
gene sequence
 Delta exons
 Bases similarity percentage
 Mismatch percentage
Experimental method
 Two score systems, from 15 previously
defined and an alignment strategy were
choosed, using subsets 1 and 2:
 Semi-global aligner
 (1,-2,-1,0) and (1,-2,-10,0) score systems
 The classic semi-global aligner was
compared to sim4, Spidey and est_genome,
both with subsets 3 and 4
Results: Exact Alignments
Extra Gap
Strategy Avg SD %Score 0
SG(1, -2, -1, 0) 0.00 0.00 100.00%
SG(1, -2, -10,
0)
0.00 0.00 100.00%
sim4 1.11 1.63 54.56%
est_genome 16.99 21.49 27.84%
Spidey 0.15 1.39 97.43%
Results: Exact Alignments
Delta Exons
Strategy Avg SD %Score 0
SG(1, -2, -1, 0) 0.00 0.00 100.00%
SG(1, -2, -10, 0) 0.01 0.07 99.91%
sim4 -0.01 0.20 97.46%
est_genome -0.14 0.30 76.79%
Spidey -4.04 3.10 0.00%
Results: Exact Alignments
Base Similarity
Strategy Avg SD %Scr. 100%
SG(1, -2, -1, 0) 99.89% 0.49% 53.56%
SG(1, -2, -10, 0) 99.89% 0.49% 53.49%
sim4 99.39% 1.34% 22.79%
est_genome 53.83% 35.00% 18.11%
Spidey 80.34% 36.49% 44.25%
Results: Exact Alignments
Mismatch Percentage
Strategy Avg SD %Scr. 100%
SG(1, -2, -1, 0) 0.00% 0.00% 100.00%
SG(1, -2, -10, 0) 0.01% 0.03% 99.47%
sim4 0.17% 0.21% 36.68%
est_genome 1.19% 1.26% 21.55%
Spidey 0.15% 0.98% 90.65%
Results: EST Alignments
Results: EST Alignments
Running Time Comparison
EST-to-DNA
(sec/alignment)
mRNA-toDNA
(sec/alignment)
sim4 0.013 0.170
Spidey 0.066 0.140
est_genome 0.640 3.400
Semi-global 0.670 5.170
Conclusions
 Classic semi-globl algorithm produces good
results
 Running time is a problem, although it can be
improved
 Sim4 produces the best results amont
external softwares tested
Thanks
Ad

Recommended

qPCR Design Strategies for Specific Applications
qPCR Design Strategies for Specific Applications
Integrated DNA Technologies
Dna library lecture-Gene libraries and screening
Dna library lecture-Gene libraries and screening
Abdullah Abobakr
AlgoAlignementGenomicSequences.ppt
AlgoAlignementGenomicSequences.ppt
SkanderBena
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
IJDMS
Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019
Bioinformatics and Computational Biosciences Branch
Aug2013 NIST highly confident genotype calls for NA12878
Aug2013 NIST highly confident genotype calls for NA12878
GenomeInABottle
Sequence alignment
Sequence alignment
Dr. Harisingh Gour Vishwavidyalaya (A Central Universuty), Sagar, MP
BLAST and sequence alignment
BLAST and sequence alignment
Bioinformatics and Computational Biosciences Branch
Phylogenetics1
Phylogenetics1
S辿bastien De Landtsheer
Implementation of DNA sequence alignment algorithms using Fpga ,ML,and CNN
Implementation of DNA sequence alignment algorithms using Fpga ,ML,and CNN
Amr Rashed
06_Alignment_2022.pdf
06_Alignment_2022.pdf
Kristen DeAngelis
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Torsten Seemann
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
Golden Helix Inc
仗亠亰亠仆舒亳 亰舒 于舒舒于舒
仗亠亰亠仆舒亳 亰舒 于舒舒于舒
Valeriya Simeonova
Bioalgo 2012-01-gene-prediction-sim
Bioalgo 2012-01-gene-prediction-sim
BioinformaticsInstitute
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Setia Pramana
Comparative genomics
Comparative genomics
Athira RG
Msa & rooted/unrooted tree
Msa & rooted/unrooted tree
Samiul Ehsan
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
Computer Science Club
Comparitive genomics
Comparitive genomics
International Islamic University Islamabad
Blast fasta 4
Blast fasta 4
Er Puspendra Tripathi
Gene targeting and sequence tags
Gene targeting and sequence tags
Alen Shaji
Processamento de tweets em tempo real com Python, Django e Celery - TDC 2014
Miguel Galves
Redis para iniciantes - TDC 2014
Miguel Galves
New Strategy to detect SNPs
New Strategy to detect SNPs
Miguel Galves
Qualifica巽達o de Mestrado
Miguel Galves
Uma abordagem computacional para a determina巽達o de polimorfismos de base 炭nica
Miguel Galves
Django: Uso de frameworks 叩geis para desenvolvimento web
Miguel Galves
GIS em 3 horas
Miguel Galves
AJAX
Miguel Galves

More Related Content

Similar to Comparison of Genomic DNA to cDNA Alignment Methods (14)

Phylogenetics1
Phylogenetics1
S辿bastien De Landtsheer
Implementation of DNA sequence alignment algorithms using Fpga ,ML,and CNN
Implementation of DNA sequence alignment algorithms using Fpga ,ML,and CNN
Amr Rashed
06_Alignment_2022.pdf
06_Alignment_2022.pdf
Kristen DeAngelis
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Torsten Seemann
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
Golden Helix Inc
仗亠亰亠仆舒亳 亰舒 于舒舒于舒
仗亠亰亠仆舒亳 亰舒 于舒舒于舒
Valeriya Simeonova
Bioalgo 2012-01-gene-prediction-sim
Bioalgo 2012-01-gene-prediction-sim
BioinformaticsInstitute
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Setia Pramana
Comparative genomics
Comparative genomics
Athira RG
Msa & rooted/unrooted tree
Msa & rooted/unrooted tree
Samiul Ehsan
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
Computer Science Club
Comparitive genomics
Comparitive genomics
International Islamic University Islamabad
Blast fasta 4
Blast fasta 4
Er Puspendra Tripathi
Gene targeting and sequence tags
Gene targeting and sequence tags
Alen Shaji
Implementation of DNA sequence alignment algorithms using Fpga ,ML,and CNN
Implementation of DNA sequence alignment algorithms using Fpga ,ML,and CNN
Amr Rashed
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Torsten Seemann
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
Golden Helix Inc
仗亠亰亠仆舒亳 亰舒 于舒舒于舒
仗亠亰亠仆舒亳 亰舒 于舒舒于舒
Valeriya Simeonova
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Setia Pramana
Comparative genomics
Comparative genomics
Athira RG
Msa & rooted/unrooted tree
Msa & rooted/unrooted tree
Samiul Ehsan
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
Computer Science Club
Gene targeting and sequence tags
Gene targeting and sequence tags
Alen Shaji

More from Miguel Galves (9)

Processamento de tweets em tempo real com Python, Django e Celery - TDC 2014
Miguel Galves
Redis para iniciantes - TDC 2014
Miguel Galves
New Strategy to detect SNPs
New Strategy to detect SNPs
Miguel Galves
Qualifica巽達o de Mestrado
Miguel Galves
Uma abordagem computacional para a determina巽達o de polimorfismos de base 炭nica
Miguel Galves
Django: Uso de frameworks 叩geis para desenvolvimento web
Miguel Galves
GIS em 3 horas
Miguel Galves
AJAX
Miguel Galves
Data Mining em redes sociais
Miguel Galves
Processamento de tweets em tempo real com Python, Django e Celery - TDC 2014
Miguel Galves
Redis para iniciantes - TDC 2014
Miguel Galves
New Strategy to detect SNPs
New Strategy to detect SNPs
Miguel Galves
Qualifica巽達o de Mestrado
Miguel Galves
Uma abordagem computacional para a determina巽達o de polimorfismos de base 炭nica
Miguel Galves
Django: Uso de frameworks 叩geis para desenvolvimento web
Miguel Galves
GIS em 3 horas
Miguel Galves
Data Mining em redes sociais
Miguel Galves
Ad

Recently uploaded (20)

Climate and Weather_Science 9_Q3_PH.pptx
Climate and Weather_Science 9_Q3_PH.pptx
Dayan Espartero
HOW TO FACE THREATS FROM THE FORCES OF NATURE EXISTING ON PLANET EARTH.pdf
HOW TO FACE THREATS FROM THE FORCES OF NATURE EXISTING ON PLANET EARTH.pdf
Faga1939
esmo-academy-2024-pptx-template smyth final.pdf
esmo-academy-2024-pptx-template smyth final.pdf
GuillermoGutirrez33
Gas Exchange in Insects and structures 01
Gas Exchange in Insects and structures 01
PhoebeAkinyi1
Matt Ridley: Economic Evolution and Ideas that have Sex
Matt Ridley: Economic Evolution and Ideas that have Sex
Conservative Institute / Konzervat鱈vny in邸tit炭t M. R. tef叩nika
Solution Chemistry Basics, molarity Molality
Solution Chemistry Basics, molarity Molality
nuralam819365
Algebra A BASIC REVIEW INTERMEDICATE ALGEBRA
Algebra A BASIC REVIEW INTERMEDICATE ALGEBRA
ropamadoda
GBSN_ Unit 1 - Introduction to Microbiology
GBSN_ Unit 1 - Introduction to Microbiology
Areesha Ahmad
Science Holiday Homework (interesting slide )
Science Holiday Homework (interesting slide )
aryanxkohli88
Aquatic ecosystem and its biomes involved.pdf
Aquatic ecosystem and its biomes involved.pdf
camillemaguid53
Introduction to Microbiology and Microscope
Introduction to Microbiology and Microscope
vaishrawan1
Science Experiment: Properties of Water.pptx
Science Experiment: Properties of Water.pptx
marionrada1985
Cryptocurrency and cyber crime Presentation
Cryptocurrency and cyber crime Presentation
IqraRehaman
Pushkar camel fest at college campus placement 2
Pushkar camel fest at college campus placement 2
nandanitiwari82528
CULTIVATION - HARVESTING - PROCESSING - STORAGE -.pdf
CULTIVATION - HARVESTING - PROCESSING - STORAGE -.pdf
Nistarini College, Purulia (W.B) India
Primary and Secondary immune modulation.pptx
Primary and Secondary immune modulation.pptx
devikasanalkumar35
Pushkar camel fest at college campus placement
Pushkar camel fest at college campus placement
nandanitiwari82528
pollination njnjnjnjnjnjjnjnjnjnjnjnjnnj
pollination njnjnjnjnjnjjnjnjnjnjnjnjnnj
bhg31shagnik
Smart Grids Selected Topics, Advanced Metering Infrastructure
Smart Grids Selected Topics, Advanced Metering Infrastructure
FrancisSeverineRugan
Abzymes mimickers in catalytic reactions at nanoscales
Abzymes mimickers in catalytic reactions at nanoscales
OrchideaMariaLecian
Climate and Weather_Science 9_Q3_PH.pptx
Climate and Weather_Science 9_Q3_PH.pptx
Dayan Espartero
HOW TO FACE THREATS FROM THE FORCES OF NATURE EXISTING ON PLANET EARTH.pdf
HOW TO FACE THREATS FROM THE FORCES OF NATURE EXISTING ON PLANET EARTH.pdf
Faga1939
esmo-academy-2024-pptx-template smyth final.pdf
esmo-academy-2024-pptx-template smyth final.pdf
GuillermoGutirrez33
Gas Exchange in Insects and structures 01
Gas Exchange in Insects and structures 01
PhoebeAkinyi1
Solution Chemistry Basics, molarity Molality
Solution Chemistry Basics, molarity Molality
nuralam819365
Algebra A BASIC REVIEW INTERMEDICATE ALGEBRA
Algebra A BASIC REVIEW INTERMEDICATE ALGEBRA
ropamadoda
GBSN_ Unit 1 - Introduction to Microbiology
GBSN_ Unit 1 - Introduction to Microbiology
Areesha Ahmad
Science Holiday Homework (interesting slide )
Science Holiday Homework (interesting slide )
aryanxkohli88
Aquatic ecosystem and its biomes involved.pdf
Aquatic ecosystem and its biomes involved.pdf
camillemaguid53
Introduction to Microbiology and Microscope
Introduction to Microbiology and Microscope
vaishrawan1
Science Experiment: Properties of Water.pptx
Science Experiment: Properties of Water.pptx
marionrada1985
Cryptocurrency and cyber crime Presentation
Cryptocurrency and cyber crime Presentation
IqraRehaman
Pushkar camel fest at college campus placement 2
Pushkar camel fest at college campus placement 2
nandanitiwari82528
Primary and Secondary immune modulation.pptx
Primary and Secondary immune modulation.pptx
devikasanalkumar35
Pushkar camel fest at college campus placement
Pushkar camel fest at college campus placement
nandanitiwari82528
pollination njnjnjnjnjnjjnjnjnjnjnjnjnnj
pollination njnjnjnjnjnjjnjnjnjnjnjnjnnj
bhg31shagnik
Smart Grids Selected Topics, Advanced Metering Infrastructure
Smart Grids Selected Topics, Advanced Metering Infrastructure
FrancisSeverineRugan
Abzymes mimickers in catalytic reactions at nanoscales
Abzymes mimickers in catalytic reactions at nanoscales
OrchideaMariaLecian
Ad

Comparison of Genomic DNA to cDNA Alignment Methods

  • 1. Comparison of Genomic DNA to cDNA Alignment Methods Miguel Galves and Zanoni Dias Institute of Computing Unicamp Campinas SP Brazil {miguel.galves,zanoni}@ic.unicamp.br Scylla Bioinformatics Campinas SP Brazil {miguel,zanoni}@scylla.com.br
  • 2. Agenda Introduction Problem Aligners Data set Subsets Evaluation Methods Results: Exact Alignments Results: EST Alignments Running Time Comparison Conclusions
  • 3. Introduction Identifying genes in non-characterized DNA sequences is one of the greatest challenges in genomics EST-to-DNA alignment is one of the most common methods EST are key to understanding the inner working of an organism Human being has between 30000 and 35000 genes Alternative Splicing plays an important role in diversity
  • 4. CCCGGGAAACGAAUAU CCUCUCACCCGGGA CUUGGCCCGGGAAACGAAUAU CCUCUCACCCGGG A CUUGG Problem Mature mRNA mRNA Intron Exon
  • 5. Problem: How to solve ? Classic algorithms Dynamic programming Heuristic based algorithms Multi-steps Based on other tools such as Blast and local alignments.
  • 6. Aligners Java version of global and semi-global Affine gap penalty function Linear space Global algorithm by Miller and Myers (1988) Semi-global based on global algorithm Heuristic based algorithms sim4, Spidey and est_genome
  • 7. Data Set Human genome database Based on FASTA a GENBANKs flat format file from NCBI repository. Filtering criteria Genes, mRNAs and CDS with /pseudo tag mRNAs without any CDS Genes without any mRNA CDS matching wrong patterns 23124 genes and 27448 mRNAs stored in database
  • 8. Subsets Subset 1Subset 1:: 66 genes from chromossome Y whith less than 100000 bases Subset 2: 50 complete genes from chromossome Y whith less than 100000 bases Subset 3: 8056 complete genes from all chromossomes whith less than 100000 bases Subset 4: 493 artificial EST based on complete genes from chromossome 6 with less than 100000 bases
  • 9. Evaluation methods Number of gaps introduced in the aligned gene sequence Delta exons Bases similarity percentage Mismatch percentage
  • 10. Experimental method Two score systems, from 15 previously defined and an alignment strategy were choosed, using subsets 1 and 2: Semi-global aligner (1,-2,-1,0) and (1,-2,-10,0) score systems The classic semi-global aligner was compared to sim4, Spidey and est_genome, both with subsets 3 and 4
  • 11. Results: Exact Alignments Extra Gap Strategy Avg SD %Score 0 SG(1, -2, -1, 0) 0.00 0.00 100.00% SG(1, -2, -10, 0) 0.00 0.00 100.00% sim4 1.11 1.63 54.56% est_genome 16.99 21.49 27.84% Spidey 0.15 1.39 97.43%
  • 12. Results: Exact Alignments Delta Exons Strategy Avg SD %Score 0 SG(1, -2, -1, 0) 0.00 0.00 100.00% SG(1, -2, -10, 0) 0.01 0.07 99.91% sim4 -0.01 0.20 97.46% est_genome -0.14 0.30 76.79% Spidey -4.04 3.10 0.00%
  • 13. Results: Exact Alignments Base Similarity Strategy Avg SD %Scr. 100% SG(1, -2, -1, 0) 99.89% 0.49% 53.56% SG(1, -2, -10, 0) 99.89% 0.49% 53.49% sim4 99.39% 1.34% 22.79% est_genome 53.83% 35.00% 18.11% Spidey 80.34% 36.49% 44.25%
  • 14. Results: Exact Alignments Mismatch Percentage Strategy Avg SD %Scr. 100% SG(1, -2, -1, 0) 0.00% 0.00% 100.00% SG(1, -2, -10, 0) 0.01% 0.03% 99.47% sim4 0.17% 0.21% 36.68% est_genome 1.19% 1.26% 21.55% Spidey 0.15% 0.98% 90.65%
  • 17. Running Time Comparison EST-to-DNA (sec/alignment) mRNA-toDNA (sec/alignment) sim4 0.013 0.170 Spidey 0.066 0.140 est_genome 0.640 3.400 Semi-global 0.670 5.170
  • 18. Conclusions Classic semi-globl algorithm produces good results Running time is a problem, although it can be improved Sim4 produces the best results amont external softwares tested