�ݺ�ߣ

DextMP: Deep dive into Text for
predicting Moonlighting Proteins
Ishita K. Khan1, Mansurul Bhuiyan3
& Daisuke Kihara1,2
1Department of Computer Sciences, 2Department of Biological
Sciences, Purdue University, IN, USA
3Department of Computer Science, Indiana University-Purdue
University Indianapolis, IN, USA
1
Bioinformatics (2017) 33 (14): i83-i91

Moonlighting Proteins
 Proteins that are involved in more than one
mechanistically different, independent cellular functions.
 Two distinct functions are not due to splice variants, gene
fusions, or pleiotropism (same function in different
pathways)
 An ancestral protein possessed a single function, but
developed an additional functionality through the course
of evolution.
 The most common primary moonlighting functions are
enzymatic catalyst; secondary functions include signal
transduction, transcriptional regulation, apoptosis,
motility, etc.
2

Examples of
Mechanisms of
Moonlighting
Proteins
3
(Jeffery JC, TIBS 1999)

Examples of Moonlighting Proteins
Protein ID # Domains Function 1 Function 2 Cause
Aconitase Q99798 2
TCA cycle
enzyme
Iron homeostasis
Fe concentration
fluctose-bisphosphate aldolase Q968V9 1
Glycolytic
enzyme
Host-cell
invasion
independent
functions
Phosphopantothenoylcysteine
decarboxylase subunit VHS3
Q08438 1
halotoleranc
e
determinant
coenzyme A
biosynthesis
independent
functions
cAMP-dependent transcription
factor ATF-2
P15336 1
transcription
factor
DNA damage
response
radiation stress
Dihydrolipoyl dehydrogenase,
mitochondrial, DLD
P09622 4
energy
metabolism
Protease
pH in
mitochondrial
matrix
Vacuolar protein-sorting-
associated protein 25
Q7JXV9 1
endosomal
protein
sorting bicoid
mRNA
independent
functions
glutamate racemase D3FPC2 1
glutamate
racemase
DNA gyrase
inhibitor
independent
functions
STAT3 Q99ML3 0
transcription
factor
Electron
transport chain
mutation and
phosphorylation
galactokinase P09608 3
galactose
catabolism
enzyme
Induction of
galactose genes
presence of
galactose 4

Databases of Moonlighting Proteins
5
MOONPROT DB MOON DBMultitasking Protein DB
Jeffrey Lab
Manual curation
E. Querol et al.
From review articles
Keywords from Pubmed
Brun Lab
Human MPs
Literature
Network-based prediction

How to Identify Moonlighting Proteins?
 From currently available annotations (UniProt)
• Most of moonlighting proteins are not labeled as
terms as “moonlighting”, “dual function”,
“multitasking”
1. Are current GO annotations useful to find novel
moonlighting proteins?
2. By text mining?
 From large-scale omics data
• Without GO annotations
• Do moonlighting proteins have any characteristics in
protein-protein interactions, co-expressed genes,
phylogenetic profile, genetic interactions, etc? 6

GO-Based Identification Applied to
the E. coli Genome
7
E. coli
proteins with
GO term
annotation
4146 proteins
Clustering Profile
MP: 140 proteins
Non-MP: 150 proteins
Moonlighting Proteins
1. > 8 GO terms
2. > 2 Clusters at 0.1 Score
3. > 4 Clusters at 0.4 Score
Non-Moonlighting Proteins
1. > 8 GO terms
2. 1 Cluster at 0.1 Score
3. 1 Cluster at 0.4 Score
Literature Survey
43 proteins
(Khan et al., Biology Direct, 2014)
33 proteins
Dual functions
that do not
originate from
multiple domains

8
Features Considered:
• GO annotations (GO)
• PPI network (PPI)
• gene expression profiles (GI)
• phylogenetic profiles (PE)
• genetic interactions (GI)
• disordered protein regions (DOR)
• graph properties of PPI (NET)

Dataset for DextMP
9
• Moonlighting Proteins (MPs): from the MoonProt DB
• Non-MPs: the criteria applied to human, E. coli, yeast, mouse
• Text information taken from UniProt
Khan, Bhuiyan, & Kihara, Bioinformatics (2017) 33 (14): i83-i91

The Number of Abstracts Available to
MPs and non-MPs
10

Workflow of DextMP
11
Text Level
Prediction
Protein Level
Prediction

3 Language Models
 Bag-of-Words: Term Frequency-Inverse Document Frequency
(TFIDF)
 N-dimensional vector (N: dictionary size of a corpus)
 TFIDF(word) = TF(word)*IDF(word)
 TF(w) = (# of w in a text)/(total # of words in the text)
 IDF, Inverse Document Frequency (w) = log(total # of texts in the corpus/# of
texts with w)
 Latent Dirichlet Allocation (LDA)
 A text is characterized by a set of latent topics, which have a distribution of
words
 Dirichlet multinomial distributions for mapping documents to topics, topics to
words
 Deep learning
 Constructs feature vectors so that similar text appear close
 DEEP: texts were from MPs and non-MPs
 PDEEP: pre-trained on the entire texts in UniProt. 1,060,520 titles and 551,056
function descriptions
 paragraph2vec
12

Hyper-Parameter Tuning
 LDA
 # of topics: 10-100 with a step of 10
 DEEP
 Minimum count: 1-5
 Convolution window size: 2-8
 Dimension size (feature vector length): 20-200 with a step of
20
 PDEEP used the same parameters as DEEP
 LR, SVM
 Regularization, a cost parameter
 kernel (linear, radial)
 RF
 # of trees
 GBM
 Learning rate 13

5-Fold Cross Validation
 3 subsets: training
 Under a hyper-parameter set
 Combinations of hyper-parameters from a
language-model & a classifier
 1 subset: validation, to decide the best
hyper-parameter set
 1 subset: testing
 F1-score reported. Average over 5 testing
sets. Weighted for MPs and non-MPs.
 F1 = 2(precision * recall)/(precision + recall)
14

Protein-Level Accuracy
16
Used for
Genome-scale
prediction

Genome-Scale MP Prediction
17
MPFit: 10.97% 7.82%

Summary
 DextMP, a machine learning approach for identifying
moon-lighting proteins from text information is
presented.
 DextMP can help filtering UniProt entries of potential
moonlighting proteins, which can be later examined
manually.
 Estimated moonlighting proteins in a genome:
 Human: ~10-20% of proteins
 Yeast: ~10-30% of proteins
 Xenopus: ~5% of proteins
 Prediction relies on literature information, thus there
maybe more moonlighting proteins in each genome 19

Acknowledgements
http://kiharalab.org@kiharalab
20
Mansurul
Bhuiyan
Ishita
Khan

�ݺ�ߣ

DextMP: Text mining for finding moonlighting proteins

Recommended

More Related Content

Similar to DextMP: Text mining for finding moonlighting proteins (20)

More from Purdue University (10)

Recently uploaded (20)

DextMP: Text mining for finding moonlighting proteins

Editor's Notes