際際滷s presented at ISMB 2017 in Prague on "DextMP: deep dive in text for predicting moonlighting proteins" by Ishita K. Khan, Mansurul Bhuiiyan, &. Daisuke Kihara. ISMB Proceeding talk, published on Bioinformatics: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx231
1 of 20
Downloaded 11 times
More Related Content
DextMP: Text mining for finding moonlighting proteins
1. DextMP: Deep dive into Text for
predicting Moonlighting Proteins
Ishita K. Khan1, Mansurul Bhuiyan3
& Daisuke Kihara1,2
1Department of Computer Sciences, 2Department of Biological
Sciences, Purdue University, IN, USA
3Department of Computer Science, Indiana University-Purdue
University Indianapolis, IN, USA
1
Bioinformatics (2017) 33 (14): i83-i91
2. Moonlighting Proteins
Proteins that are involved in more than one
mechanistically different, independent cellular functions.
Two distinct functions are not due to splice variants, gene
fusions, or pleiotropism (same function in different
pathways)
An ancestral protein possessed a single function, but
developed an additional functionality through the course
of evolution.
The most common primary moonlighting functions are
enzymatic catalyst; secondary functions include signal
transduction, transcriptional regulation, apoptosis,
motility, etc.
2
4. Examples of Moonlighting Proteins
Protein ID # Domains Function 1 Function 2 Cause
Aconitase Q99798 2
TCA cycle
enzyme
Iron homeostasis
Fe concentration
fluctose-bisphosphate aldolase Q968V9 1
Glycolytic
enzyme
Host-cell
invasion
independent
functions
Phosphopantothenoylcysteine
decarboxylase subunit VHS3
Q08438 1
halotoleranc
e
determinant
coenzyme A
biosynthesis
independent
functions
cAMP-dependent transcription
factor ATF-2
P15336 1
transcription
factor
DNA damage
response
radiation stress
Dihydrolipoyl dehydrogenase,
mitochondrial, DLD
P09622 4
energy
metabolism
Protease
pH in
mitochondrial
matrix
Vacuolar protein-sorting-
associated protein 25
Q7JXV9 1
endosomal
protein
sorting bicoid
mRNA
independent
functions
glutamate racemase D3FPC2 1
glutamate
racemase
DNA gyrase
inhibitor
independent
functions
STAT3 Q99ML3 0
transcription
factor
Electron
transport chain
mutation and
phosphorylation
galactokinase P09608 3
galactose
catabolism
enzyme
Induction of
galactose genes
presence of
galactose 4
5. Databases of Moonlighting Proteins
5
MOONPROT DB MOON DBMultitasking Protein DB
Jeffrey Lab
Manual curation
E. Querol et al.
From review articles
Keywords from Pubmed
Brun Lab
Human MPs
Literature
Network-based prediction
6. How to Identify Moonlighting Proteins?
From currently available annotations (UniProt)
Most of moonlighting proteins are not labeled as
terms as moonlighting, dual function,
multitasking
1. Are current GO annotations useful to find novel
moonlighting proteins?
2. By text mining?
From large-scale omics data
Without GO annotations
Do moonlighting proteins have any characteristics in
protein-protein interactions, co-expressed genes,
phylogenetic profile, genetic interactions, etc? 6
7. GO-Based Identification Applied to
the E. coli Genome
7
E. coli
proteins with
GO term
annotation
4146 proteins
Clustering Profile
MP: 140 proteins
Non-MP: 150 proteins
Moonlighting Proteins
1. > 8 GO terms
2. > 2 Clusters at 0.1 Score
3. > 4 Clusters at 0.4 Score
Non-Moonlighting Proteins
1. > 8 GO terms
2. 1 Cluster at 0.1 Score
3. 1 Cluster at 0.4 Score
Literature Survey
43 proteins
(Khan et al., Biology Direct, 2014)
33 proteins
Dual functions
that do not
originate from
multiple domains
8. 8
Features Considered:
GO annotations (GO)
PPI network (PPI)
gene expression profiles (GI)
phylogenetic profiles (PE)
genetic interactions (GI)
disordered protein regions (DOR)
graph properties of PPI (NET)
9. Dataset for DextMP
9
Moonlighting Proteins (MPs): from the MoonProt DB
Non-MPs: the criteria applied to human, E. coli, yeast, mouse
Text information taken from UniProt
Khan, Bhuiyan, & Kihara, Bioinformatics (2017) 33 (14): i83-i91
10. The Number of Abstracts Available to
MPs and non-MPs
10
12. 3 Language Models
Bag-of-Words: Term Frequency-Inverse Document Frequency
(TFIDF)
N-dimensional vector (N: dictionary size of a corpus)
TFIDF(word) = TF(word)*IDF(word)
TF(w) = (# of w in a text)/(total # of words in the text)
IDF, Inverse Document Frequency (w) = log(total # of texts in the corpus/# of
texts with w)
Latent Dirichlet Allocation (LDA)
A text is characterized by a set of latent topics, which have a distribution of
words
Dirichlet multinomial distributions for mapping documents to topics, topics to
words
Deep learning
Constructs feature vectors so that similar text appear close
DEEP: texts were from MPs and non-MPs
PDEEP: pre-trained on the entire texts in UniProt. 1,060,520 titles and 551,056
function descriptions
paragraph2vec
12
13. Hyper-Parameter Tuning
LDA
# of topics: 10-100 with a step of 10
DEEP
Minimum count: 1-5
Convolution window size: 2-8
Dimension size (feature vector length): 20-200 with a step of
20
PDEEP used the same parameters as DEEP
LR, SVM
Regularization, a cost parameter
kernel (linear, radial)
RF
# of trees
GBM
Learning rate 13
14. 5-Fold Cross Validation
3 subsets: training
Under a hyper-parameter set
Combinations of hyper-parameters from a
language-model & a classifier
1 subset: validation, to decide the best
hyper-parameter set
1 subset: testing
F1-score reported. Average over 5 testing
sets. Weighted for MPs and non-MPs.
F1 = 2(precision * recall)/(precision + recall)
14
19. Summary
DextMP, a machine learning approach for identifying
moon-lighting proteins from text information is
presented.
DextMP can help filtering UniProt entries of potential
moonlighting proteins, which can be later examined
manually.
Estimated moonlighting proteins in a genome:
Human: ~10-20% of proteins
Yeast: ~10-30% of proteins
Xenopus: ~5% of proteins
Prediction relies on literature information, thus there
maybe more moonlighting proteins in each genome 19
#5: P15336 (ATF2), they mentioned stress in terms of 2 things - ionizing radiation (IR) and UV-induced lesions. In addition, they also mentioned that the second function (DNA damage response) is also due to ATF2's association with a member of chromatin remodeling complex.
"independent function" are the once that dont have a "switch" identified between two functions. In most cases, send function was found accidentally (termed as 'serendipitious' in literature), and the papers claim that the two functions are independent.