際際滷

際際滷Share a Scribd company logo
DextMP: Deep dive into Text for
predicting Moonlighting Proteins
Ishita K. Khan1, Mansurul Bhuiyan3
& Daisuke Kihara1,2
1Department of Computer Sciences, 2Department of Biological
Sciences, Purdue University, IN, USA
3Department of Computer Science, Indiana University-Purdue
University Indianapolis, IN, USA
1
Bioinformatics (2017) 33 (14): i83-i91
Moonlighting Proteins
 Proteins that are involved in more than one
mechanistically different, independent cellular functions.
 Two distinct functions are not due to splice variants, gene
fusions, or pleiotropism (same function in different
pathways)
 An ancestral protein possessed a single function, but
developed an additional functionality through the course
of evolution.
 The most common primary moonlighting functions are
enzymatic catalyst; secondary functions include signal
transduction, transcriptional regulation, apoptosis,
motility, etc.
2
Examples of
Mechanisms of
Moonlighting
Proteins
3
(Jeffery JC, TIBS 1999)
Examples of Moonlighting Proteins
Protein ID # Domains Function 1 Function 2 Cause
Aconitase Q99798 2
TCA cycle
enzyme
Iron homeostasis
Fe concentration
fluctose-bisphosphate aldolase Q968V9 1
Glycolytic
enzyme
Host-cell
invasion
independent
functions
Phosphopantothenoylcysteine
decarboxylase subunit VHS3
Q08438 1
halotoleranc
e
determinant
coenzyme A
biosynthesis
independent
functions
cAMP-dependent transcription
factor ATF-2
P15336 1
transcription
factor
DNA damage
response
radiation stress
Dihydrolipoyl dehydrogenase,
mitochondrial, DLD
P09622 4
energy
metabolism
Protease
pH in
mitochondrial
matrix
Vacuolar protein-sorting-
associated protein 25
Q7JXV9 1
endosomal
protein
sorting bicoid
mRNA
independent
functions
glutamate racemase D3FPC2 1
glutamate
racemase
DNA gyrase
inhibitor
independent
functions
STAT3 Q99ML3 0
transcription
factor
Electron
transport chain
mutation and
phosphorylation
galactokinase P09608 3
galactose
catabolism
enzyme
Induction of
galactose genes
presence of
galactose 4
Databases of Moonlighting Proteins
5
MOONPROT DB MOON DBMultitasking Protein DB
Jeffrey Lab
Manual curation
E. Querol et al.
From review articles
Keywords from Pubmed
Brun Lab
Human MPs
Literature
Network-based prediction
How to Identify Moonlighting Proteins?
 From currently available annotations (UniProt)
 Most of moonlighting proteins are not labeled as
terms as moonlighting, dual function,
multitasking
1. Are current GO annotations useful to find novel
moonlighting proteins?
2. By text mining?
 From large-scale omics data
 Without GO annotations
 Do moonlighting proteins have any characteristics in
protein-protein interactions, co-expressed genes,
phylogenetic profile, genetic interactions, etc? 6
GO-Based Identification Applied to
the E. coli Genome
7
E. coli
proteins with
GO term
annotation
4146 proteins
Clustering Profile
MP: 140 proteins
Non-MP: 150 proteins
Moonlighting Proteins
1. > 8 GO terms
2. > 2 Clusters at 0.1 Score
3. > 4 Clusters at 0.4 Score
Non-Moonlighting Proteins
1. > 8 GO terms
2. 1 Cluster at 0.1 Score
3. 1 Cluster at 0.4 Score
Literature Survey
43 proteins
(Khan et al., Biology Direct, 2014)
33 proteins
Dual functions
that do not
originate from
multiple domains
8
Features Considered:
 GO annotations (GO)
 PPI network (PPI)
 gene expression profiles (GI)
 phylogenetic profiles (PE)
 genetic interactions (GI)
 disordered protein regions (DOR)
 graph properties of PPI (NET)
Dataset for DextMP
9
 Moonlighting Proteins (MPs): from the MoonProt DB
 Non-MPs: the criteria applied to human, E. coli, yeast, mouse
 Text information taken from UniProt
Khan, Bhuiyan, & Kihara, Bioinformatics (2017) 33 (14): i83-i91
The Number of Abstracts Available to
MPs and non-MPs
10
Workflow of DextMP
11
Text Level
Prediction
Protein Level
Prediction
3 Language Models
 Bag-of-Words: Term Frequency-Inverse Document Frequency
(TFIDF)
 N-dimensional vector (N: dictionary size of a corpus)
 TFIDF(word) = TF(word)*IDF(word)
 TF(w) = (# of w in a text)/(total # of words in the text)
 IDF, Inverse Document Frequency (w) = log(total # of texts in the corpus/# of
texts with w)
 Latent Dirichlet Allocation (LDA)
 A text is characterized by a set of latent topics, which have a distribution of
words
 Dirichlet multinomial distributions for mapping documents to topics, topics to
words
 Deep learning
 Constructs feature vectors so that similar text appear close
 DEEP: texts were from MPs and non-MPs
 PDEEP: pre-trained on the entire texts in UniProt. 1,060,520 titles and 551,056
function descriptions
 paragraph2vec
12
Hyper-Parameter Tuning
 LDA
 # of topics: 10-100 with a step of 10
 DEEP
 Minimum count: 1-5
 Convolution window size: 2-8
 Dimension size (feature vector length): 20-200 with a step of
20
 PDEEP used the same parameters as DEEP
 LR, SVM
 Regularization, a cost parameter
 kernel (linear, radial)
 RF
 # of trees
 GBM
 Learning rate 13
5-Fold Cross Validation
 3 subsets: training
 Under a hyper-parameter set
 Combinations of hyper-parameters from a
language-model & a classifier
 1 subset: validation, to decide the best
hyper-parameter set
 1 subset: testing
 F1-score reported. Average over 5 testing
sets. Weighted for MPs and non-MPs.
 F1 = 2(precision * recall)/(precision + recall)
14
Text-Level Accuracy
15
Protein-Level Accuracy
16
Used for
Genome-scale
prediction
Genome-Scale MP Prediction
17
MPFit: 10.97% 7.82%
18
Summary
 DextMP, a machine learning approach for identifying
moon-lighting proteins from text information is
presented.
 DextMP can help filtering UniProt entries of potential
moonlighting proteins, which can be later examined
manually.
 Estimated moonlighting proteins in a genome:
 Human: ~10-20% of proteins
 Yeast: ~10-30% of proteins
 Xenopus: ~5% of proteins
 Prediction relies on literature information, thus there
maybe more moonlighting proteins in each genome 19
Acknowledgements
http://kiharalab.org@kiharalab
20
Mansurul
Bhuiyan
Ishita
Khan

More Related Content

DextMP: Text mining for finding moonlighting proteins

  • 1. DextMP: Deep dive into Text for predicting Moonlighting Proteins Ishita K. Khan1, Mansurul Bhuiyan3 & Daisuke Kihara1,2 1Department of Computer Sciences, 2Department of Biological Sciences, Purdue University, IN, USA 3Department of Computer Science, Indiana University-Purdue University Indianapolis, IN, USA 1 Bioinformatics (2017) 33 (14): i83-i91
  • 2. Moonlighting Proteins Proteins that are involved in more than one mechanistically different, independent cellular functions. Two distinct functions are not due to splice variants, gene fusions, or pleiotropism (same function in different pathways) An ancestral protein possessed a single function, but developed an additional functionality through the course of evolution. The most common primary moonlighting functions are enzymatic catalyst; secondary functions include signal transduction, transcriptional regulation, apoptosis, motility, etc. 2
  • 4. Examples of Moonlighting Proteins Protein ID # Domains Function 1 Function 2 Cause Aconitase Q99798 2 TCA cycle enzyme Iron homeostasis Fe concentration fluctose-bisphosphate aldolase Q968V9 1 Glycolytic enzyme Host-cell invasion independent functions Phosphopantothenoylcysteine decarboxylase subunit VHS3 Q08438 1 halotoleranc e determinant coenzyme A biosynthesis independent functions cAMP-dependent transcription factor ATF-2 P15336 1 transcription factor DNA damage response radiation stress Dihydrolipoyl dehydrogenase, mitochondrial, DLD P09622 4 energy metabolism Protease pH in mitochondrial matrix Vacuolar protein-sorting- associated protein 25 Q7JXV9 1 endosomal protein sorting bicoid mRNA independent functions glutamate racemase D3FPC2 1 glutamate racemase DNA gyrase inhibitor independent functions STAT3 Q99ML3 0 transcription factor Electron transport chain mutation and phosphorylation galactokinase P09608 3 galactose catabolism enzyme Induction of galactose genes presence of galactose 4
  • 5. Databases of Moonlighting Proteins 5 MOONPROT DB MOON DBMultitasking Protein DB Jeffrey Lab Manual curation E. Querol et al. From review articles Keywords from Pubmed Brun Lab Human MPs Literature Network-based prediction
  • 6. How to Identify Moonlighting Proteins? From currently available annotations (UniProt) Most of moonlighting proteins are not labeled as terms as moonlighting, dual function, multitasking 1. Are current GO annotations useful to find novel moonlighting proteins? 2. By text mining? From large-scale omics data Without GO annotations Do moonlighting proteins have any characteristics in protein-protein interactions, co-expressed genes, phylogenetic profile, genetic interactions, etc? 6
  • 7. GO-Based Identification Applied to the E. coli Genome 7 E. coli proteins with GO term annotation 4146 proteins Clustering Profile MP: 140 proteins Non-MP: 150 proteins Moonlighting Proteins 1. > 8 GO terms 2. > 2 Clusters at 0.1 Score 3. > 4 Clusters at 0.4 Score Non-Moonlighting Proteins 1. > 8 GO terms 2. 1 Cluster at 0.1 Score 3. 1 Cluster at 0.4 Score Literature Survey 43 proteins (Khan et al., Biology Direct, 2014) 33 proteins Dual functions that do not originate from multiple domains
  • 8. 8 Features Considered: GO annotations (GO) PPI network (PPI) gene expression profiles (GI) phylogenetic profiles (PE) genetic interactions (GI) disordered protein regions (DOR) graph properties of PPI (NET)
  • 9. Dataset for DextMP 9 Moonlighting Proteins (MPs): from the MoonProt DB Non-MPs: the criteria applied to human, E. coli, yeast, mouse Text information taken from UniProt Khan, Bhuiyan, & Kihara, Bioinformatics (2017) 33 (14): i83-i91
  • 10. The Number of Abstracts Available to MPs and non-MPs 10
  • 11. Workflow of DextMP 11 Text Level Prediction Protein Level Prediction
  • 12. 3 Language Models Bag-of-Words: Term Frequency-Inverse Document Frequency (TFIDF) N-dimensional vector (N: dictionary size of a corpus) TFIDF(word) = TF(word)*IDF(word) TF(w) = (# of w in a text)/(total # of words in the text) IDF, Inverse Document Frequency (w) = log(total # of texts in the corpus/# of texts with w) Latent Dirichlet Allocation (LDA) A text is characterized by a set of latent topics, which have a distribution of words Dirichlet multinomial distributions for mapping documents to topics, topics to words Deep learning Constructs feature vectors so that similar text appear close DEEP: texts were from MPs and non-MPs PDEEP: pre-trained on the entire texts in UniProt. 1,060,520 titles and 551,056 function descriptions paragraph2vec 12
  • 13. Hyper-Parameter Tuning LDA # of topics: 10-100 with a step of 10 DEEP Minimum count: 1-5 Convolution window size: 2-8 Dimension size (feature vector length): 20-200 with a step of 20 PDEEP used the same parameters as DEEP LR, SVM Regularization, a cost parameter kernel (linear, radial) RF # of trees GBM Learning rate 13
  • 14. 5-Fold Cross Validation 3 subsets: training Under a hyper-parameter set Combinations of hyper-parameters from a language-model & a classifier 1 subset: validation, to decide the best hyper-parameter set 1 subset: testing F1-score reported. Average over 5 testing sets. Weighted for MPs and non-MPs. F1 = 2(precision * recall)/(precision + recall) 14
  • 18. 18
  • 19. Summary DextMP, a machine learning approach for identifying moon-lighting proteins from text information is presented. DextMP can help filtering UniProt entries of potential moonlighting proteins, which can be later examined manually. Estimated moonlighting proteins in a genome: Human: ~10-20% of proteins Yeast: ~10-30% of proteins Xenopus: ~5% of proteins Prediction relies on literature information, thus there maybe more moonlighting proteins in each genome 19

Editor's Notes

  • #5: P15336 (ATF2), they mentioned stress in terms of 2 things - ionizing radiation (IR) and UV-induced lesions. In addition, they also mentioned that the second function (DNA damage response) is also due to ATF2's association with a member of chromatin remodeling complex. "independent function" are the once that dont have a "switch" identified between two functions. In most cases, send function was found accidentally (termed as 'serendipitious' in literature), and the papers claim that the two functions are independent.