The document summarizes a research paper on inferring gene regulatory networks using only positive examples. It discusses the challenges of the "positive only" problem and different approaches to address it, including negative selection heuristics based on network motifs. Specifically, it proposes considering connections not present in known network motifs as potential negative examples to improve supervised learning performance when labeled negative examples are unavailable.
Convert to study guideBETA
Transform any presentation into a summarized study guide, highlighting the most important points and key insights.
1 of 52
Downloaded 165 times
More Related Content
Learning gene regulations with only positive examples
1. On Learning Gene
Regulatory Networks with
Only Positive Examples
Luigi Cerulo, University of Sannio, Italy
Michele Ceccarelli, University of Sannio, Italy
Tuesday, October 12, 2010
2. Outline
Supervised inference of gene regulatory
networks
The positive only problem
Negative selection approaches
Effect on prediction accuracy
Conclusions and future directions
Tuesday, October 12, 2010
3. Gene Regulatory
Network (GRN)
The network of transcription dependences among genes of an organism,
known as transcription factors, and their binding sites.
TF protein
TF
Gene A Gene B
gene A gene B
Tuesday, October 12, 2010
4. Gene Regulatory
Network (GRN)
A gene regulatory G2
network can be G1
represented as a graph
G = (Vertices, Edges) G6 G3
Vertices = Genes G7
Edges = Interactions G5
G4
G8
Tuesday, October 12, 2010
5. Inference of Gene
regulatory networks
G2
G1
G6 G3
G7
G5
G4
G8
Gi = {e1 , e2 , e3 , . . . , en }
Tuesday, October 12, 2010
7. GRN
supervised Inference
G2
G1
Part of the network
is known in advance G6
G3
from public databases G7
(Eg. RegulonDB) G5
G4
G8
Tuesday, October 12, 2010
10. SIRENE approach
trains an SVM classi鍖er for each gene and predicts
which genes are regulated by that gene
combines all predicted regulations to obtain the full
regulatory network
G2 G2 G2
G1 G1 G1
...
G6 G3 G6 G3 G6 G3
G7 G7 G7
G5 G5 G5
G4 G4 G4
G8 G8 G8
Tuesday, October 12, 2010
11. 1 1
CLR
SIRENE
0.8 0.8 SIRENEBias
Ratio of true positives
0.6 0.6
Precision
Precision
0.4 0.4
0.2 CLR 0.2
SIRENE
SIRENEBias
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Ratio of false positives Recall
Method Recall at 60% of Precision Recall at 80% of Precision
SIRENE 44.5% 17.6%
CLR 7.5% 5.5%
Relevance networks 4.7% 3.3%
ARACNe 1% 0%
Bayesian network 1% 0%
Compared with unsupervised methods (Mordelet and Vert, 2008)
Tuesday, October 12, 2010
12. 60
supervised (SIRENE)
45
True Positives
30
unsupervised (ARACNE)
15
0
0 100 200 300 400
Top N
prediction of new c-Myc regulations
True positives are validated with IPA
(www.ingenuity.com)
Tuesday, October 12, 2010
22. Supervised learning of
gene regulatory networks
+1 G2
G1 +1
G6
G3
+1
G7 Is this a negative
G5
example?
+1
G4
G8
Is this a negative
example?
Tuesday, October 12, 2010
23. Training set
Labeled Unlabeled
P Q N
Positive Negative
|P |
% of Known Positives
|P Q|
Tuesday, October 12, 2010
24. 1.0
0.9
0.8
AUROC
0.7
0.6
0.5
10 20 30 40 50 60 70 80 90 100
% of known positives
Effect of PU-learning
E.coli dataset [J.J. Faith et al., 2007]
Tuesday, October 12, 2010
28. Reliable negative
selection in text mining
B. Liu et al. Building Text Classi鍖ers Using
Positive and Unlabeled Examples, in ICDM
2003
Yu et al. PEBL: Positive Example Based
Learning for Web Page Classi鍖cation Using
SVM, in KDD 2002
Denis et al. Text classi鍖cation from positive
and unlabeled Examples, in IPMU 2002
Tuesday, October 12, 2010
29. Methods based on
reliable negative selection
Labeled Unlabeled
Original
training set P Q N
Negative selection
heuristic
New
training set P RN
Tuesday, October 12, 2010
30. Quality of RN
RN
RN could be contaminated with positives
embedded in unlabeled data
The fraction of positive contamination is the
ratio between the number of positives in RN
and the total number of unknown positives |Q|
Tuesday, October 12, 2010
31. 1.0
positive contamination = 0
0.9
0.8
AUROC
0.7
positive contamination = 1
(PU-learning)
0.6
0.5
10 20 30 40 50 60 70 80 90 100
% of known positives
Effect of positive contamination
E.coli dataset [J.J. Faith et al., 2007]
Tuesday, October 12, 2010
32. 0.6
0.5
positive contamination = 0
0.4
F-Measure
0.3
0.2
0.1
positive contamination = 1
(PU-learning)
0.0
10 20 30 40 50 60 70 80 90 100
% of known positives
Effect of positive contamination
E.coli dataset [J.J. Faith et al., 2007]
Tuesday, October 12, 2010
34. Network motifs
Network motifs are small connected
subnetworks a network exhibits in a
signi鍖cant higher or lower occurrences than
would be expected just by chance
A
A B C
B C D E
Tuesday, October 12, 2010
35. B. Goemann, E. Wingender, and A. P. Potapov, An approach to evaluate the
topological significance of motifs and other patterns in regulatory networks.
BMC System Biology, vol. 3, no. 53, May 2009.
S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, Network motifs in the
transcriptional regulation network of escherichia coli, Nature Genetics, vol. 31,
no. 1, pp. 6468, May 2002.
Tuesday, October 12, 2010
36. Network Motifs
Heuristic
For each three genes sub networks T:
If matches a network motifs M then
considers all connections not present in M
as negatives
B
A C
Tuesday, October 12, 2010
37. Network Motifs
Heuristic
For each three genes sub networks T:
If matches a network motifs M then
considers all connections not present in M
as negatives
B
A C
Tuesday, October 12, 2010
38. Network Motifs
Heuristic
For each three genes sub networks T:
If matches a network motifs M then
considers all connections not present in M
as negatives
B
A C
Tuesday, October 12, 2010
39. Network Motifs
Heuristic
For each three genes sub networks T:
If matches a network motifs M then
considers all connections not present in M
as negatives
B
A C
Tuesday, October 12, 2010
40. 1.0
positive contamination = 0
0.9
0.8
AUROC
0.7
positive contamination = 1
(PU-learning)
0.6
0.5
10 20 30 40 50 60 70 80 90 100
% of known positives
MOTIF selection performance
E.coli dataset [J.J. Faith et al., 2007 and RegulonDB]
Tuesday, October 12, 2010
41. 1.0
positive contamination = 0
0.9
0.8
AUROC
0.7
positive contamination = 1
MOTIF (PU-learning)
0.6
0.5
10 20 30 40 50 60 70 80 90 100
% of known positives
MOTIF selection performance
E.coli dataset [J.J. Faith et al., 2007 and RegulonDB]
Tuesday, October 12, 2010
42. 0.6
0.5
positive contamination = 0
0.4
F-Measure
0.3
0.2
0.1
positive contamination = 1
(PU-learning)
0.0
10 20 30 40 50 60 70 80 90 100
% of known positives
Effect of positive contamination
E.coli dataset [J.J. Faith et al., 2007]
Tuesday, October 12, 2010
43. 0.6
0.5
positive contamination = 0
0.4
F-Measure
0.3
MOTIF
0.2
0.1
positive contamination = 1
(PU-learning)
0.0
10 20 30 40 50 60 70 80 90 100
% of known positives
Effect of positive contamination
E.coli dataset [J.J. Faith et al., 2007]
Tuesday, October 12, 2010
44. Scale free networks
Albert-L叩szl坦 Barab叩si and Zolt叩n N. Oltvai
Network biology: Understanding the cells functional organization
Nature Reviews Genetics 5, 101-113 (2004)
Tuesday, October 12, 2010
45. Hierarchical networks
Hong-Wu Ma, Jan Buer, and An-Ping Zeng
Hierarchical structure and modules in the Escherichia coli transcriptional
regulatory network revealed by a new top-down approach
BMC Bioinformatics 2004 5:199
Tuesday, October 12, 2010
46. Experimental data
445 Affymetrix Antisense2 microarray
expression pro鍖les for 4345 genes of E.coli
[J.J. Faith et al., 2007]
Data were standardized (i.e. zero mean unit
standard deviation)
Regulations extracted from RegulonDB (v.
5) between 154 Transcription Factors and
1211 genes
Tuesday, October 12, 2010
47. Summary and
conclusions
Learning gene regulations is affected by the problem
of learning from positive only data
At least for E.coli
The study of positive contamination shows that
there is room for new heuristics
Topology based heuristics (eg. motifs) have shown
promising results.
Open issues arise on higher level organisms where
gene interactions are more complex
Tuesday, October 12, 2010
48. Re weighting strategy
(PosOnly)
Cerulo et al. BMC Bioinformatics 2010, 11:228
http://www.biomedcentral.com/1471-2105/11/228
RESEARCH ARTICLE Open Access
Learning gene regulatory networks from only
Research article
positive and unlabeled data
Luigi Cerulo*1,2, Charles Elkan3 and Michele Ceccarelli1,2
Abstract
Background: Recently, supervised learning methods have been exploited to reconstruct gene regulatory networks
from gene expression data. The reconstruction of a network is modeled as a binary classification problem for each pair
of genes. A statistical classifier is trained to recognize the relationships between the activation profiles of gene pairs.
This approach has been proven to outperform previous unsupervised methods. However, the supervised approach
raises open questions. In particular, although known regulatory connections can safely be assumed to be positive
training examples, obtaining negative examples is not straightforward, because definite knowledge is typically not
available that a given pair of genes do not interact.
Results: A recent advance in research on data mining is a method capable of learning a classifier from only positive and
unlabeled examples, that does not need labeled negative examples. Applied to the reconstruction of gene regulatory
networks, we show that this method significantly outperforms the current state of the art of machine learning
methods. We assess the new method using both simulated and experimental data, and obtain major performance
Tuesday, October 12, 2010 improvement.
49. PosOnly: How it works
Labeled Unlabeled
s=1 s=0
Let x be a P Q N
random example Positive Negative
y=1 y=0
s=1 iff x is labeled, s=0 iff x is unlabeled
y=1 iff x is positive, y=0 iff x is negative
If s=1 then y=1 (the contrary is not always true!)
p(s=1|x,y=0) = 0
Tuesday, October 12, 2010
50. PosOnly: How it works
Labeled Unlabeled
s=1 s=0
The goal is to learn P Q N
a classi鍖er such that:
f(x) = p(y=1|x) Positive
y=1
Negative
y=0
It is easy to see that (Elkan and Noto, 2008):
f(x) = p(s=1|x)/p(s=1|y=1)
= p(s=1 and y=1|x)/p(s=1|y=1)
= p(s=1|y=1,x)p(y=1|x)/p(s=1|y=1)
= p(y=1|x)
Tuesday, October 12, 2010
51. PosOnly: How it works
binary classi鍖er trained with
labeled and unlabeled examples
p(s = 1|x)
f (x) =
p(s = 1|y = 1)
unknown constant estimated
empirically in a number of ways
Tuesday, October 12, 2010
52. Mean of F-Measure
% of Known Positives
Results: experimental data
Tuesday, October 12, 2010