際際滷

際際滷Share a Scribd company logo
DextMP: Deep dive into Text for
predicting Moonlighting Proteins
Ishita K. Khan1, Mansurul Bhuiyan3
& Daisuke Kihara1,2
1Department of Computer Sciences, 2Department of Biological
Sciences, Purdue University, IN, USA
3Department of Computer Science, Indiana University-Purdue
University Indianapolis, IN, USA
1
Bioinformatics (2017) 33 (14): i83-i91
Moonlighting Proteins
 Proteins that are involved in more than one
mechanistically different, independent cellular functions.
 Two distinct functions are not due to splice variants, gene
fusions, or pleiotropism (same function in different
pathways)
 An ancestral protein possessed a single function, but
developed an additional functionality through the course
of evolution.
 The most common primary moonlighting functions are
enzymatic catalyst; secondary functions include signal
transduction, transcriptional regulation, apoptosis,
motility, etc.
2
Examples of
Mechanisms of
Moonlighting
Proteins
3
(Jeffery JC, TIBS 1999)
Examples of Moonlighting Proteins
Protein ID # Domains Function 1 Function 2 Cause
Aconitase Q99798 2
TCA cycle
enzyme
Iron homeostasis
Fe concentration
fluctose-bisphosphate aldolase Q968V9 1
Glycolytic
enzyme
Host-cell
invasion
independent
functions
Phosphopantothenoylcysteine
decarboxylase subunit VHS3
Q08438 1
halotoleranc
e
determinant
coenzyme A
biosynthesis
independent
functions
cAMP-dependent transcription
factor ATF-2
P15336 1
transcription
factor
DNA damage
response
radiation stress
Dihydrolipoyl dehydrogenase,
mitochondrial, DLD
P09622 4
energy
metabolism
Protease
pH in
mitochondrial
matrix
Vacuolar protein-sorting-
associated protein 25
Q7JXV9 1
endosomal
protein
sorting bicoid
mRNA
independent
functions
glutamate racemase D3FPC2 1
glutamate
racemase
DNA gyrase
inhibitor
independent
functions
STAT3 Q99ML3 0
transcription
factor
Electron
transport chain
mutation and
phosphorylation
galactokinase P09608 3
galactose
catabolism
enzyme
Induction of
galactose genes
presence of
galactose 4
Databases of Moonlighting Proteins
5
MOONPROT DB MOON DBMultitasking Protein DB
Jeffrey Lab
Manual curation
E. Querol et al.
From review articles
Keywords from Pubmed
Brun Lab
Human MPs
Literature
Network-based prediction
How to Identify Moonlighting Proteins?
 From currently available annotations (UniProt)
 Most of moonlighting proteins are not labeled as
terms as moonlighting, dual function,
multitasking
1. Are current GO annotations useful to find novel
moonlighting proteins?
2. By text mining?
 From large-scale omics data
 Without GO annotations
 Do moonlighting proteins have any characteristics in
protein-protein interactions, co-expressed genes,
phylogenetic profile, genetic interactions, etc? 6
GO-Based Identification Applied to
the E. coli Genome
7
E. coli
proteins with
GO term
annotation
4146 proteins
Clustering Profile
MP: 140 proteins
Non-MP: 150 proteins
Moonlighting Proteins
1. > 8 GO terms
2. > 2 Clusters at 0.1 Score
3. > 4 Clusters at 0.4 Score
Non-Moonlighting Proteins
1. > 8 GO terms
2. 1 Cluster at 0.1 Score
3. 1 Cluster at 0.4 Score
Literature Survey
43 proteins
(Khan et al., Biology Direct, 2014)
33 proteins
Dual functions
that do not
originate from
multiple domains
8
Features Considered:
 GO annotations (GO)
 PPI network (PPI)
 gene expression profiles (GI)
 phylogenetic profiles (PE)
 genetic interactions (GI)
 disordered protein regions (DOR)
 graph properties of PPI (NET)
Dataset for DextMP
9
 Moonlighting Proteins (MPs): from the MoonProt DB
 Non-MPs: the criteria applied to human, E. coli, yeast, mouse
 Text information taken from UniProt
Khan, Bhuiyan, & Kihara, Bioinformatics (2017) 33 (14): i83-i91
The Number of Abstracts Available to
MPs and non-MPs
10
Workflow of DextMP
11
Text Level
Prediction
Protein Level
Prediction
3 Language Models
 Bag-of-Words: Term Frequency-Inverse Document Frequency
(TFIDF)
 N-dimensional vector (N: dictionary size of a corpus)
 TFIDF(word) = TF(word)*IDF(word)
 TF(w) = (# of w in a text)/(total # of words in the text)
 IDF, Inverse Document Frequency (w) = log(total # of texts in the corpus/# of
texts with w)
 Latent Dirichlet Allocation (LDA)
 A text is characterized by a set of latent topics, which have a distribution of
words
 Dirichlet multinomial distributions for mapping documents to topics, topics to
words
 Deep learning
 Constructs feature vectors so that similar text appear close
 DEEP: texts were from MPs and non-MPs
 PDEEP: pre-trained on the entire texts in UniProt. 1,060,520 titles and 551,056
function descriptions
 paragraph2vec
12
Hyper-Parameter Tuning
 LDA
 # of topics: 10-100 with a step of 10
 DEEP
 Minimum count: 1-5
 Convolution window size: 2-8
 Dimension size (feature vector length): 20-200 with a step of
20
 PDEEP used the same parameters as DEEP
 LR, SVM
 Regularization, a cost parameter
 kernel (linear, radial)
 RF
 # of trees
 GBM
 Learning rate 13
5-Fold Cross Validation
 3 subsets: training
 Under a hyper-parameter set
 Combinations of hyper-parameters from a
language-model & a classifier
 1 subset: validation, to decide the best
hyper-parameter set
 1 subset: testing
 F1-score reported. Average over 5 testing
sets. Weighted for MPs and non-MPs.
 F1 = 2(precision * recall)/(precision + recall)
14
Text-Level Accuracy
15
Protein-Level Accuracy
16
Used for
Genome-scale
prediction
Genome-Scale MP Prediction
17
MPFit: 10.97% 7.82%
18
Summary
 DextMP, a machine learning approach for identifying
moon-lighting proteins from text information is
presented.
 DextMP can help filtering UniProt entries of potential
moonlighting proteins, which can be later examined
manually.
 Estimated moonlighting proteins in a genome:
 Human: ~10-20% of proteins
 Yeast: ~10-30% of proteins
 Xenopus: ~5% of proteins
 Prediction relies on literature information, thus there
maybe more moonlighting proteins in each genome 19
Acknowledgements
http://kiharalab.org@kiharalab
20
Mansurul
Bhuiyan
Ishita
Khan

More Related Content

Similar to DextMP: Text mining for finding moonlighting proteins (20)

Deep learning for extracting protein-protein interactions from biomedical lit...
Deep learning for extracting protein-protein interactions from biomedical lit...Deep learning for extracting protein-protein interactions from biomedical lit...
Deep learning for extracting protein-protein interactions from biomedical lit...
Yifan Peng
Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009
bosc
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
Lars Juhl Jensen
In search of tissue specific regulators in periodontium - a bioinformatic ap...
In search of tissue specific regulators in periodontium  - a bioinformatic ap...In search of tissue specific regulators in periodontium  - a bioinformatic ap...
In search of tissue specific regulators in periodontium - a bioinformatic ap...
Agnieszka Caruso
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
International Journal of Engineering Inventions www.ijeijournal.com
lecture 1.pptx
lecture 1.pptxlecture 1.pptx
lecture 1.pptx
MohamedHasan816582
Cornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsCornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 Nets
Mark Gerstein
Inferring gene functions and regulatory interactions in plants using differen...
Inferring gene functions and regulatory interactions in plants using differen...Inferring gene functions and regulatory interactions in plants using differen...
Inferring gene functions and regulatory interactions in plants using differen...
Klaas Vandepoele
STRING: protein association networks
STRING: protein association networksSTRING: protein association networks
STRING: protein association networks
Lars Juhl Jensen
STRING: Protein association networks
STRING: Protein association networksSTRING: Protein association networks
STRING: Protein association networks
Lars Juhl Jensen
University of Texas at Austin
University of Texas at AustinUniversity of Texas at Austin
University of Texas at Austin
butest
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and text
Lars Juhl Jensen
Network biology - Large-scale integration of data and text
Network biology - Large-scale integration of data and textNetwork biology - Large-scale integration of data and text
Network biology - Large-scale integration of data and text
Lars Juhl Jensen
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
nadeem akhter
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...
Chris Evelo
Reading circle of Epigenome Roadmap: Roadmap Epigenomics Consortium et. al. I...
Reading circle of Epigenome Roadmap: Roadmap Epigenomics Consortium et. al. I...Reading circle of Epigenome Roadmap: Roadmap Epigenomics Consortium et. al. I...
Reading circle of Epigenome Roadmap: Roadmap Epigenomics Consortium et. al. I...
Itoshi Nikaido
Austin Neurology & Neurosciences
Austin Neurology & NeurosciencesAustin Neurology & Neurosciences
Austin Neurology & Neurosciences
Austin Publishing Group
Formal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural GenomesFormal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural Genomes
madalladam
Particle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationParticle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster Identification
Editor IJCATR
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and text
Lars Juhl Jensen
Deep learning for extracting protein-protein interactions from biomedical lit...
Deep learning for extracting protein-protein interactions from biomedical lit...Deep learning for extracting protein-protein interactions from biomedical lit...
Deep learning for extracting protein-protein interactions from biomedical lit...
Yifan Peng
Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009
bosc
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
Lars Juhl Jensen
In search of tissue specific regulators in periodontium - a bioinformatic ap...
In search of tissue specific regulators in periodontium  - a bioinformatic ap...In search of tissue specific regulators in periodontium  - a bioinformatic ap...
In search of tissue specific regulators in periodontium - a bioinformatic ap...
Agnieszka Caruso
Cornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsCornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 Nets
Mark Gerstein
Inferring gene functions and regulatory interactions in plants using differen...
Inferring gene functions and regulatory interactions in plants using differen...Inferring gene functions and regulatory interactions in plants using differen...
Inferring gene functions and regulatory interactions in plants using differen...
Klaas Vandepoele
STRING: protein association networks
STRING: protein association networksSTRING: protein association networks
STRING: protein association networks
Lars Juhl Jensen
STRING: Protein association networks
STRING: Protein association networksSTRING: Protein association networks
STRING: Protein association networks
Lars Juhl Jensen
University of Texas at Austin
University of Texas at AustinUniversity of Texas at Austin
University of Texas at Austin
butest
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and text
Lars Juhl Jensen
Network biology - Large-scale integration of data and text
Network biology - Large-scale integration of data and textNetwork biology - Large-scale integration of data and text
Network biology - Large-scale integration of data and text
Lars Juhl Jensen
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
nadeem akhter
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...
Chris Evelo
Reading circle of Epigenome Roadmap: Roadmap Epigenomics Consortium et. al. I...
Reading circle of Epigenome Roadmap: Roadmap Epigenomics Consortium et. al. I...Reading circle of Epigenome Roadmap: Roadmap Epigenomics Consortium et. al. I...
Reading circle of Epigenome Roadmap: Roadmap Epigenomics Consortium et. al. I...
Itoshi Nikaido
Formal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural GenomesFormal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural Genomes
madalladam
Particle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationParticle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster Identification
Editor IJCATR
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and text
Lars Juhl Jensen

More from Purdue University (10)

Alphafold2 - Protein Structural Bioinformatics After CASP14
Alphafold2 - Protein Structural Bioinformatics After CASP14Alphafold2 - Protein Structural Bioinformatics After CASP14
Alphafold2 - Protein Structural Bioinformatics After CASP14
Purdue University
CASP14 Data Assisted Modeling (KIharalab)
CASP14 Data Assisted Modeling (KIharalab)CASP14 Data Assisted Modeling (KIharalab)
CASP14 Data Assisted Modeling (KIharalab)
Purdue University
Kiharalab Bioinformatics Projects 2019
Kiharalab Bioinformatics Projects 2019Kiharalab Bioinformatics Projects 2019
Kiharalab Bioinformatics Projects 2019
Purdue University
Predicting Assembly Order of Multimeric Protein Complexes
Predicting Assembly Order of Multimeric Protein ComplexesPredicting Assembly Order of Multimeric Protein Complexes
Predicting Assembly Order of Multimeric Protein Complexes
Purdue University
Structure Modeling of Disordered Protein Interactions
Structure Modeling of Disordered Protein InteractionsStructure Modeling of Disordered Protein Interactions
Structure Modeling of Disordered Protein Interactions
Purdue University
Discovery of Ligand-Protein Interactome
Discovery of Ligand-Protein InteractomeDiscovery of Ligand-Protein Interactome
Discovery of Ligand-Protein Interactome
Purdue University
Kihara Lab protein structure prediction performance in CASP11
Kihara Lab protein structure prediction performance in CASP11Kihara Lab protein structure prediction performance in CASP11
Kihara Lab protein structure prediction performance in CASP11
Purdue University
Protein docking by LZerD, KiharaLab at CAPRI meeting 2016
Protein docking by LZerD, KiharaLab at CAPRI meeting 2016Protein docking by LZerD, KiharaLab at CAPRI meeting 2016
Protein docking by LZerD, KiharaLab at CAPRI meeting 2016
Purdue University
Kihara Bioinformatics Lab Research Summary 2016
Kihara Bioinformatics Lab Research Summary 2016Kihara Bioinformatics Lab Research Summary 2016
Kihara Bioinformatics Lab Research Summary 2016
Purdue University
Flexscore: Ensemble-based evaluation for protein Structure models
Flexscore: Ensemble-based evaluation for protein Structure modelsFlexscore: Ensemble-based evaluation for protein Structure models
Flexscore: Ensemble-based evaluation for protein Structure models
Purdue University
Alphafold2 - Protein Structural Bioinformatics After CASP14
Alphafold2 - Protein Structural Bioinformatics After CASP14Alphafold2 - Protein Structural Bioinformatics After CASP14
Alphafold2 - Protein Structural Bioinformatics After CASP14
Purdue University
CASP14 Data Assisted Modeling (KIharalab)
CASP14 Data Assisted Modeling (KIharalab)CASP14 Data Assisted Modeling (KIharalab)
CASP14 Data Assisted Modeling (KIharalab)
Purdue University
Kiharalab Bioinformatics Projects 2019
Kiharalab Bioinformatics Projects 2019Kiharalab Bioinformatics Projects 2019
Kiharalab Bioinformatics Projects 2019
Purdue University
Predicting Assembly Order of Multimeric Protein Complexes
Predicting Assembly Order of Multimeric Protein ComplexesPredicting Assembly Order of Multimeric Protein Complexes
Predicting Assembly Order of Multimeric Protein Complexes
Purdue University
Structure Modeling of Disordered Protein Interactions
Structure Modeling of Disordered Protein InteractionsStructure Modeling of Disordered Protein Interactions
Structure Modeling of Disordered Protein Interactions
Purdue University
Discovery of Ligand-Protein Interactome
Discovery of Ligand-Protein InteractomeDiscovery of Ligand-Protein Interactome
Discovery of Ligand-Protein Interactome
Purdue University
Kihara Lab protein structure prediction performance in CASP11
Kihara Lab protein structure prediction performance in CASP11Kihara Lab protein structure prediction performance in CASP11
Kihara Lab protein structure prediction performance in CASP11
Purdue University
Protein docking by LZerD, KiharaLab at CAPRI meeting 2016
Protein docking by LZerD, KiharaLab at CAPRI meeting 2016Protein docking by LZerD, KiharaLab at CAPRI meeting 2016
Protein docking by LZerD, KiharaLab at CAPRI meeting 2016
Purdue University
Kihara Bioinformatics Lab Research Summary 2016
Kihara Bioinformatics Lab Research Summary 2016Kihara Bioinformatics Lab Research Summary 2016
Kihara Bioinformatics Lab Research Summary 2016
Purdue University
Flexscore: Ensemble-based evaluation for protein Structure models
Flexscore: Ensemble-based evaluation for protein Structure modelsFlexscore: Ensemble-based evaluation for protein Structure models
Flexscore: Ensemble-based evaluation for protein Structure models
Purdue University

Recently uploaded (20)

GALILEO'S OBSERVATION ni Karlo Mariano.pptx
GALILEO'S OBSERVATION ni Karlo Mariano.pptxGALILEO'S OBSERVATION ni Karlo Mariano.pptx
GALILEO'S OBSERVATION ni Karlo Mariano.pptx
ejrguillermo
Introduction to Science form 1 students part ii
Introduction to Science form 1 students part iiIntroduction to Science form 1 students part ii
Introduction to Science form 1 students part ii
Alejiehyde
(Journal Club) Folding DNA to create nanoscale shapes and patterns
(Journal Club) Folding DNA to create nanoscale shapes and patterns(Journal Club) Folding DNA to create nanoscale shapes and patterns
(Journal Club) Folding DNA to create nanoscale shapes and patterns
David Podorefsky, PhD
Plant tissue culture- In-vitro Rooting.ppt
Plant tissue culture-  In-vitro Rooting.pptPlant tissue culture-  In-vitro Rooting.ppt
Plant tissue culture- In-vitro Rooting.ppt
laxmichoudhary77657
Leafcurl viral disease presentation.pptx
Leafcurl viral disease presentation.pptxLeafcurl viral disease presentation.pptx
Leafcurl viral disease presentation.pptx
Mir Ali M
Digestive System - Digestion of carbohydrates, proteins and lipids.ppt
Digestive System - Digestion of carbohydrates, proteins and lipids.pptDigestive System - Digestion of carbohydrates, proteins and lipids.ppt
Digestive System - Digestion of carbohydrates, proteins and lipids.ppt
Jamakala Obaiah
Plant Tissue Culture-Effects of Chemical Factors.ppt
Plant Tissue Culture-Effects of Chemical Factors.pptPlant Tissue Culture-Effects of Chemical Factors.ppt
Plant Tissue Culture-Effects of Chemical Factors.ppt
laxmichoudhary77657
Respiration & Gas Exchange | Cambridge IGCSE Biology
Respiration & Gas Exchange | Cambridge IGCSE BiologyRespiration & Gas Exchange | Cambridge IGCSE Biology
Respiration & Gas Exchange | Cambridge IGCSE Biology
Blessing Ndazie
The Solar Systems passage through the Radcliffe wave during the middle Miocene
The Solar Systems passage through the Radcliffe wave during the middle MioceneThe Solar Systems passage through the Radcliffe wave during the middle Miocene
The Solar Systems passage through the Radcliffe wave during the middle Miocene
S辿rgio Sacani
ABA_in_plant_abiotic_stress_mitigation1.ppt
ABA_in_plant_abiotic_stress_mitigation1.pptABA_in_plant_abiotic_stress_mitigation1.ppt
ABA_in_plant_abiotic_stress_mitigation1.ppt
laxmichoudhary77657
The Sense Organs: Structure and Function of the Eye and Skin | IGCSE Biology
The Sense Organs: Structure and Function of the Eye and Skin | IGCSE BiologyThe Sense Organs: Structure and Function of the Eye and Skin | IGCSE Biology
The Sense Organs: Structure and Function of the Eye and Skin | IGCSE Biology
Blessing Ndazie
Cell Structure & Function | Cambridge IGCSE Biology
Cell Structure & Function | Cambridge IGCSE BiologyCell Structure & Function | Cambridge IGCSE Biology
Cell Structure & Function | Cambridge IGCSE Biology
Blessing Ndazie
Detection of ferrihydrite in Martian red dust records ancient cold and wet co...
Detection of ferrihydrite in Martian red dust records ancient cold and wet co...Detection of ferrihydrite in Martian red dust records ancient cold and wet co...
Detection of ferrihydrite in Martian red dust records ancient cold and wet co...
S辿rgio Sacani
Direct Gene Transfer Techniques for Developing Transgenic Plants
Direct Gene Transfer Techniques for Developing Transgenic PlantsDirect Gene Transfer Techniques for Developing Transgenic Plants
Direct Gene Transfer Techniques for Developing Transgenic Plants
Kuldeep Gauliya
Hormones and the Endocrine System | IGCSE Biology
Hormones and the Endocrine System | IGCSE BiologyHormones and the Endocrine System | IGCSE Biology
Hormones and the Endocrine System | IGCSE Biology
Blessing Ndazie
Drugs and Their Effects | Cambridge IGCSE Biology
Drugs and Their Effects | Cambridge IGCSE BiologyDrugs and Their Effects | Cambridge IGCSE Biology
Drugs and Their Effects | Cambridge IGCSE Biology
Blessing Ndazie
(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...
(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...
(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...
David Podorefsky, PhD
(Journal Club) - Understanding tumor ecosystems by single-cell sequencing: pr...
(Journal Club) - Understanding tumor ecosystems by single-cell sequencing: pr...(Journal Club) - Understanding tumor ecosystems by single-cell sequencing: pr...
(Journal Club) - Understanding tumor ecosystems by single-cell sequencing: pr...
David Podorefsky, PhD
(February 25th, 2025) Real-Time Insights into Cardiothoracic Research with In...
(February 25th, 2025) Real-Time Insights into Cardiothoracic Research with In...(February 25th, 2025) Real-Time Insights into Cardiothoracic Research with In...
(February 25th, 2025) Real-Time Insights into Cardiothoracic Research with In...
Scintica Instrumentation
Variation and Natural Selection | IGCSE Biology
Variation and Natural Selection | IGCSE BiologyVariation and Natural Selection | IGCSE Biology
Variation and Natural Selection | IGCSE Biology
Blessing Ndazie
GALILEO'S OBSERVATION ni Karlo Mariano.pptx
GALILEO'S OBSERVATION ni Karlo Mariano.pptxGALILEO'S OBSERVATION ni Karlo Mariano.pptx
GALILEO'S OBSERVATION ni Karlo Mariano.pptx
ejrguillermo
Introduction to Science form 1 students part ii
Introduction to Science form 1 students part iiIntroduction to Science form 1 students part ii
Introduction to Science form 1 students part ii
Alejiehyde
(Journal Club) Folding DNA to create nanoscale shapes and patterns
(Journal Club) Folding DNA to create nanoscale shapes and patterns(Journal Club) Folding DNA to create nanoscale shapes and patterns
(Journal Club) Folding DNA to create nanoscale shapes and patterns
David Podorefsky, PhD
Plant tissue culture- In-vitro Rooting.ppt
Plant tissue culture-  In-vitro Rooting.pptPlant tissue culture-  In-vitro Rooting.ppt
Plant tissue culture- In-vitro Rooting.ppt
laxmichoudhary77657
Leafcurl viral disease presentation.pptx
Leafcurl viral disease presentation.pptxLeafcurl viral disease presentation.pptx
Leafcurl viral disease presentation.pptx
Mir Ali M
Digestive System - Digestion of carbohydrates, proteins and lipids.ppt
Digestive System - Digestion of carbohydrates, proteins and lipids.pptDigestive System - Digestion of carbohydrates, proteins and lipids.ppt
Digestive System - Digestion of carbohydrates, proteins and lipids.ppt
Jamakala Obaiah
Plant Tissue Culture-Effects of Chemical Factors.ppt
Plant Tissue Culture-Effects of Chemical Factors.pptPlant Tissue Culture-Effects of Chemical Factors.ppt
Plant Tissue Culture-Effects of Chemical Factors.ppt
laxmichoudhary77657
Respiration & Gas Exchange | Cambridge IGCSE Biology
Respiration & Gas Exchange | Cambridge IGCSE BiologyRespiration & Gas Exchange | Cambridge IGCSE Biology
Respiration & Gas Exchange | Cambridge IGCSE Biology
Blessing Ndazie
The Solar Systems passage through the Radcliffe wave during the middle Miocene
The Solar Systems passage through the Radcliffe wave during the middle MioceneThe Solar Systems passage through the Radcliffe wave during the middle Miocene
The Solar Systems passage through the Radcliffe wave during the middle Miocene
S辿rgio Sacani
ABA_in_plant_abiotic_stress_mitigation1.ppt
ABA_in_plant_abiotic_stress_mitigation1.pptABA_in_plant_abiotic_stress_mitigation1.ppt
ABA_in_plant_abiotic_stress_mitigation1.ppt
laxmichoudhary77657
The Sense Organs: Structure and Function of the Eye and Skin | IGCSE Biology
The Sense Organs: Structure and Function of the Eye and Skin | IGCSE BiologyThe Sense Organs: Structure and Function of the Eye and Skin | IGCSE Biology
The Sense Organs: Structure and Function of the Eye and Skin | IGCSE Biology
Blessing Ndazie
Cell Structure & Function | Cambridge IGCSE Biology
Cell Structure & Function | Cambridge IGCSE BiologyCell Structure & Function | Cambridge IGCSE Biology
Cell Structure & Function | Cambridge IGCSE Biology
Blessing Ndazie
Detection of ferrihydrite in Martian red dust records ancient cold and wet co...
Detection of ferrihydrite in Martian red dust records ancient cold and wet co...Detection of ferrihydrite in Martian red dust records ancient cold and wet co...
Detection of ferrihydrite in Martian red dust records ancient cold and wet co...
S辿rgio Sacani
Direct Gene Transfer Techniques for Developing Transgenic Plants
Direct Gene Transfer Techniques for Developing Transgenic PlantsDirect Gene Transfer Techniques for Developing Transgenic Plants
Direct Gene Transfer Techniques for Developing Transgenic Plants
Kuldeep Gauliya
Hormones and the Endocrine System | IGCSE Biology
Hormones and the Endocrine System | IGCSE BiologyHormones and the Endocrine System | IGCSE Biology
Hormones and the Endocrine System | IGCSE Biology
Blessing Ndazie
Drugs and Their Effects | Cambridge IGCSE Biology
Drugs and Their Effects | Cambridge IGCSE BiologyDrugs and Their Effects | Cambridge IGCSE Biology
Drugs and Their Effects | Cambridge IGCSE Biology
Blessing Ndazie
(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...
(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...
(Journal Club) - Transgenic mice for in vivo epigenome editing with CRISPR-ba...
David Podorefsky, PhD
(Journal Club) - Understanding tumor ecosystems by single-cell sequencing: pr...
(Journal Club) - Understanding tumor ecosystems by single-cell sequencing: pr...(Journal Club) - Understanding tumor ecosystems by single-cell sequencing: pr...
(Journal Club) - Understanding tumor ecosystems by single-cell sequencing: pr...
David Podorefsky, PhD
(February 25th, 2025) Real-Time Insights into Cardiothoracic Research with In...
(February 25th, 2025) Real-Time Insights into Cardiothoracic Research with In...(February 25th, 2025) Real-Time Insights into Cardiothoracic Research with In...
(February 25th, 2025) Real-Time Insights into Cardiothoracic Research with In...
Scintica Instrumentation
Variation and Natural Selection | IGCSE Biology
Variation and Natural Selection | IGCSE BiologyVariation and Natural Selection | IGCSE Biology
Variation and Natural Selection | IGCSE Biology
Blessing Ndazie

DextMP: Text mining for finding moonlighting proteins

  • 1. DextMP: Deep dive into Text for predicting Moonlighting Proteins Ishita K. Khan1, Mansurul Bhuiyan3 & Daisuke Kihara1,2 1Department of Computer Sciences, 2Department of Biological Sciences, Purdue University, IN, USA 3Department of Computer Science, Indiana University-Purdue University Indianapolis, IN, USA 1 Bioinformatics (2017) 33 (14): i83-i91
  • 2. Moonlighting Proteins Proteins that are involved in more than one mechanistically different, independent cellular functions. Two distinct functions are not due to splice variants, gene fusions, or pleiotropism (same function in different pathways) An ancestral protein possessed a single function, but developed an additional functionality through the course of evolution. The most common primary moonlighting functions are enzymatic catalyst; secondary functions include signal transduction, transcriptional regulation, apoptosis, motility, etc. 2
  • 4. Examples of Moonlighting Proteins Protein ID # Domains Function 1 Function 2 Cause Aconitase Q99798 2 TCA cycle enzyme Iron homeostasis Fe concentration fluctose-bisphosphate aldolase Q968V9 1 Glycolytic enzyme Host-cell invasion independent functions Phosphopantothenoylcysteine decarboxylase subunit VHS3 Q08438 1 halotoleranc e determinant coenzyme A biosynthesis independent functions cAMP-dependent transcription factor ATF-2 P15336 1 transcription factor DNA damage response radiation stress Dihydrolipoyl dehydrogenase, mitochondrial, DLD P09622 4 energy metabolism Protease pH in mitochondrial matrix Vacuolar protein-sorting- associated protein 25 Q7JXV9 1 endosomal protein sorting bicoid mRNA independent functions glutamate racemase D3FPC2 1 glutamate racemase DNA gyrase inhibitor independent functions STAT3 Q99ML3 0 transcription factor Electron transport chain mutation and phosphorylation galactokinase P09608 3 galactose catabolism enzyme Induction of galactose genes presence of galactose 4
  • 5. Databases of Moonlighting Proteins 5 MOONPROT DB MOON DBMultitasking Protein DB Jeffrey Lab Manual curation E. Querol et al. From review articles Keywords from Pubmed Brun Lab Human MPs Literature Network-based prediction
  • 6. How to Identify Moonlighting Proteins? From currently available annotations (UniProt) Most of moonlighting proteins are not labeled as terms as moonlighting, dual function, multitasking 1. Are current GO annotations useful to find novel moonlighting proteins? 2. By text mining? From large-scale omics data Without GO annotations Do moonlighting proteins have any characteristics in protein-protein interactions, co-expressed genes, phylogenetic profile, genetic interactions, etc? 6
  • 7. GO-Based Identification Applied to the E. coli Genome 7 E. coli proteins with GO term annotation 4146 proteins Clustering Profile MP: 140 proteins Non-MP: 150 proteins Moonlighting Proteins 1. > 8 GO terms 2. > 2 Clusters at 0.1 Score 3. > 4 Clusters at 0.4 Score Non-Moonlighting Proteins 1. > 8 GO terms 2. 1 Cluster at 0.1 Score 3. 1 Cluster at 0.4 Score Literature Survey 43 proteins (Khan et al., Biology Direct, 2014) 33 proteins Dual functions that do not originate from multiple domains
  • 8. 8 Features Considered: GO annotations (GO) PPI network (PPI) gene expression profiles (GI) phylogenetic profiles (PE) genetic interactions (GI) disordered protein regions (DOR) graph properties of PPI (NET)
  • 9. Dataset for DextMP 9 Moonlighting Proteins (MPs): from the MoonProt DB Non-MPs: the criteria applied to human, E. coli, yeast, mouse Text information taken from UniProt Khan, Bhuiyan, & Kihara, Bioinformatics (2017) 33 (14): i83-i91
  • 10. The Number of Abstracts Available to MPs and non-MPs 10
  • 11. Workflow of DextMP 11 Text Level Prediction Protein Level Prediction
  • 12. 3 Language Models Bag-of-Words: Term Frequency-Inverse Document Frequency (TFIDF) N-dimensional vector (N: dictionary size of a corpus) TFIDF(word) = TF(word)*IDF(word) TF(w) = (# of w in a text)/(total # of words in the text) IDF, Inverse Document Frequency (w) = log(total # of texts in the corpus/# of texts with w) Latent Dirichlet Allocation (LDA) A text is characterized by a set of latent topics, which have a distribution of words Dirichlet multinomial distributions for mapping documents to topics, topics to words Deep learning Constructs feature vectors so that similar text appear close DEEP: texts were from MPs and non-MPs PDEEP: pre-trained on the entire texts in UniProt. 1,060,520 titles and 551,056 function descriptions paragraph2vec 12
  • 13. Hyper-Parameter Tuning LDA # of topics: 10-100 with a step of 10 DEEP Minimum count: 1-5 Convolution window size: 2-8 Dimension size (feature vector length): 20-200 with a step of 20 PDEEP used the same parameters as DEEP LR, SVM Regularization, a cost parameter kernel (linear, radial) RF # of trees GBM Learning rate 13
  • 14. 5-Fold Cross Validation 3 subsets: training Under a hyper-parameter set Combinations of hyper-parameters from a language-model & a classifier 1 subset: validation, to decide the best hyper-parameter set 1 subset: testing F1-score reported. Average over 5 testing sets. Weighted for MPs and non-MPs. F1 = 2(precision * recall)/(precision + recall) 14
  • 18. 18
  • 19. Summary DextMP, a machine learning approach for identifying moon-lighting proteins from text information is presented. DextMP can help filtering UniProt entries of potential moonlighting proteins, which can be later examined manually. Estimated moonlighting proteins in a genome: Human: ~10-20% of proteins Yeast: ~10-30% of proteins Xenopus: ~5% of proteins Prediction relies on literature information, thus there maybe more moonlighting proteins in each genome 19

Editor's Notes

  • #5: P15336 (ATF2), they mentioned stress in terms of 2 things - ionizing radiation (IR) and UV-induced lesions. In addition, they also mentioned that the second function (DNA damage response) is also due to ATF2's association with a member of chromatin remodeling complex. "independent function" are the once that dont have a "switch" identified between two functions. In most cases, send function was found accidentally (termed as 'serendipitious' in literature), and the papers claim that the two functions are independent.