Boruta is a feature selection algorithm that works with random forests. For each feature, it generates a shadow feature by permutation and compares the importance of the original feature to the maximum importance of the shadow features. Features are deemed important if their importance exceeds the maximum shadow importance. It repeats this process iteratively until all features are either deemed important or unimportant. The document provides an example application of Boruta for detecting important k-mers in DNA sequences that are indicative of aptamers, and finds it performs well at selecting meaningful features.
1 of 21
Downloaded 32 times
More Related Content
Boruta
1. Boruta A System for Feature Selection
Boruta A System for Feature Selection
Miron B. Kursa, Aleksander Jankowski, Witold R. Rudnicki
Brandon Sherman
December 1, 2016
1 / 21
2. Boruta A System for Feature Selection
Background
Outline
1 Background
2 Boruta
3 Example Detecting Aptamer Sequences
4 Questions?
2 / 21
3. Boruta A System for Feature Selection
Background
Red deer; Biaowie甜a Forest - Podlaskie Voivodeship, Poland
3 / 21
4. Boruta A System for Feature Selection
Background
How do random forests work?
A random forest is a machine learning classi鍖er that works as
follows:
1 For a dataset (X, Y) with observations {x1, x2, . . . , xN } and
response {y1, y2, . . . , yN } (yi {0, 1}), bootstrap B datasets
of size N.
Bootstrapping - for a dataset (X, Y) with N observations,
generate another dataset (Xb, Yb) with N observations by
sampling from (X, Y) with replacement.
2 For b = {1, . . . , B}, train a decision tree on dataset (Xb, Yb).
3 Predict a new observation Xnew
on each decision tree by
taking the majority vote from all of the trees.
4 Calculate the out-of-bag error, which is the mean prediction
error for each decision trees predictions on observations not
included in its bootstrapped sample.
5 Calculate feature importance for feature Xj by 鍖tting a model
on a randomly permuted Xj (called X
(s)
j ), and comparing its
out-of-bag error to the out of bag error for Xj .
4 / 21
5. Boruta A System for Feature Selection
Background
Problems with Feature Importances
Feature importances cant tell you the emotional story. They can
give you the exact mathematical design, but whats missing is the
eyebrows. Frank Zappa (heavily paraphrased)
5 / 21
6. Boruta A System for Feature Selection
Background
Problems with Feature Importances
Vague, nebulous notion of how important is this feature?
Feature importances are relative, so no notion of how
important is this feature on its own?
No idea how many features are needed.
Is a feature with small importance unimportant or slightly
important?
The Breiman assumption, which states that feature
importance is normally distributed, is false.
6 / 21
7. Boruta A System for Feature Selection
Boruta
Outline
1 Background
2 Boruta
3 Example Detecting Aptamer Sequences
4 Questions?
7 / 21
8. Boruta A System for Feature Selection
Boruta
Boruta (Slavic forest spirit); artists depiction
8 / 21
9. Boruta A System for Feature Selection
Boruta
A Quick Diversion
Fire stock broker Dan
Aykroyd and hire
untrained homeless man
Eddie Murphy
Can Eddie Murphy be as
good a stockbroker as
Dan Aykroyd?
If so, then theres nothing
inherent about Dan
Aykroyd that makes him a
good stockbroker
9 / 21
10. Boruta A System for Feature Selection
Boruta
Boruta Algorithm
For each feature Xj , randomly permute it to generate a
shadow feature (random attribute) X
(s)
j .
Fit a random forest to the original and the shadow features
Calculate feature importances on original and shadow features
The feature is important for a single run if its importance is
higher than the maximum importance of all shadow features
(MIRA).
Eliminate all features whose importances across all runs are
low enough. Keep all features whose importances across all
runs are high enough.
Repeat from the beginning with all tentative features.
10 / 21
11. Boruta A System for Feature Selection
Boruta
Boruta Now with more math! (Part 1)
Iterate the following procedure N times for all original features
{X1, . . . , Xp}:
1 Create a random forest consisting of original and
newly-generated shadow features.
2 Calculate all Imp(Xj ) and MIRA
3 If a particular Imp(Xj ) > MIRA, then increment Hj and call Xj
important for the run.
11 / 21
12. Boruta A System for Feature Selection
Boruta
Boruta Now with more math! (Part 2)
Once {H1, . . . , Hp} have been calculated:
1 Perform the statistical test H0 : Hi = E(H) vs.
H1 : Hi = E(H).
Because hits follow a binomial distribution, we have
Hi
N (0.5N), (
0.25N)2
.
2 If Hi is signi鍖cantly greater than E(H), then we say the
feature is important.
3 If Hi is signi鍖cantly lower than E(H), then we say the feature
is unimportant.
4 Finish the procedure after some number of iterations or if all
features have been rejected or deemed important. Otherwise,
repeat the procedure from the beginning on all tentative
features.
12 / 21
13. Boruta A System for Feature Selection
Example Detecting Aptamer Sequences
Outline
1 Background
2 Boruta
3 Example Detecting Aptamer Sequences
4 Questions?
13 / 21
14. Boruta A System for Feature Selection
Example Detecting Aptamer Sequences
The Big Question
Which DNA sequences are indicitative of
aptamers?
14 / 21
15. Boruta A System for Feature Selection
Example Detecting Aptamer Sequences
15 / 21
16. Boruta A System for Feature Selection
Example Detecting Aptamer Sequences
The Aptamer Problem
An aptamer is an RNA or DNA chain that strongly binds to
various molecular targets.
An aptamer is represented as a genetic sequence that contains
certain k-mers.
A, GC, TGA, and AGAC are a 1-, 2-, 3-, and 4-mer,
respectively.
Each row of a dataset has features consisting of a sequence
split into p k-mers, and whether or not the k-mers represent
an aptamer sequence. The presence of a k-mer in the
sequence is marked with a 1 and the absence of a k-mer is
marked with a 0.
Very few sequences in the dataset (small n)
Many possible k-mers (high p)
How do we know which k-mers make up aptamers?
16 / 21
17. Boruta A System for Feature Selection
Example Detecting Aptamer Sequences
Boruta and the Aptamer Problem
For a dataset consisting of n genetic sequences, p 3-, 4-, and
5-mers, and whether or not the sequences make up an aptamer:
1 Create a shadow feature for each of the p k-mers
2 Run Boruta on the combined dataset.
3 Build a random forest on all k-mers selected by Boruta
4 Calculate out-of-bag (OOB) error on the new random forest
model. 30% is the maximum acceptable OOB error.
17 / 21
18. Boruta A System for Feature Selection
Example Detecting Aptamer Sequences
Example: ATP Binding Sites
See 2-Boruta.R
18 / 21
19. Boruta A System for Feature Selection
Example Detecting Aptamer Sequences
Aptamer Problem Results
Out of 23 genetic sequence datasets:
2 had OOB error greater than 30% and didnt select any
sequences.
1 had increased OOB error after Boruta from 11% to 38%.
20 had average out-of-bag error of 11% and, on average,
selected 18 out of 1170 k-mers.
The k-mers selected were known to be important based on
past biological knowledge.
Boruta does pretty darn well!
19 / 21
20. Boruta A System for Feature Selection
Questions?
Outline
1 Background
2 Boruta
3 Example Detecting Aptamer Sequences
4 Questions?
20 / 21
21. Boruta A System for Feature Selection
Questions?
Any questions?
21 / 21