FS is essential for the analysis of datasets with millions of features. In such a context, Big Data tools are paramount, but the use of standard ML models is limited for data with such low instances to features ratios. Since Apache Spark 2.0 is unable to cope with our dataset containing 「 0.74 million features, we propose here a pipeline to solve this problem using partitioning strategies, both vertical and horizontal.
1 of 1
More Related Content
Big Data Analytics for Obesity Prediction
1. Ahsan Bilal (ahsan.bilal@est.fib.upc.edu)
Alfredo Vellido (avellido@cs.upc.edu)
Vicent Ribas (vicent.ribas@eurecat.org)
Big Data Analytics for Obesity Prediction
[1] Li, J, and Liu, H. Challenges of feature selection for big data analytics. IEEE Intelligent Systems 32(2), 2017: 9-15.
[2] Ramrez-Gallego, S, et al. Fast-mRMR: Fast Minimum Redundancy Maximum Relevance algorithm for high-dimensional Big Data. International Journal of Intelligent Systems, 32, 2017: 134-152.
[3] Ding, C, and Peng, H. Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology 3(2), 2005: 185-205.
[4] Severinsen et al. Genetic susceptibility, smoking, obesity and risk of venous thromboembolism, British Journal of Haematology, 149(2), 2010: 273C279.
INTRODUCTION
Conclusion
? Vertical and Horizontal data partitioning techniques led us to interpret the data seamlessly.
? Through feature engineering and FS-based DR, reduced original bulk of 736,990 SNPs to an extremely lean 3,040 SNP selection.
? Providing a quite accurate obesity prediction
? 0.965 AUC for female
? 0.971 AUC for males.
Existing Solution
Previous solutions solely relies on a single node.
? Existing pipeline is not able to scale to process the potential
future influx of much larger datasets.
Challenges
Data consists of;
? 4,988 patients (variants)
? 7,360,990 SNPs (samples)
A scalable solution required to process upcoming larger data
in distributed environment on multiple nodes.
Obesity is one of the chronic diseases, defined by genetic and
environmental factors.
? Estimated 112,000 excess deaths per year in the U.S. alone.
? U.S. spends appx. $190 billion per year.
Obesity puts individuals at risk of suffering more than 30
chronic health conditions including:
? Type 2 diabetes
? Heart diseases
? High blood pressure
? High cholesterol
? Birth defects
? Miscarriages
? Numerous cancers
Objectives
The objectives of this paper are three-fold and can be briefly
summarized as follows:
? Design and implementation of a data pipeline to process
and analyze feature-oriented genetic datasets with
millions of items;
? Finding the most relevant features (SNPs) and their
ranking for obesity, for each chromosome; and
? Forecasting obesity among males and females
separately.
Science for Dialysis2: Artificial Intelligence and Machine
Learning for achieving safety in Artificial Kidney
Hospital Universitari de Bellvitge
II Reuni┏ de ci┬ncia i di┐lisi: intel,lig┬ncia artificial
PROPOSED DATA PIPELINE
Phase1
Phase2
Sampling
(Classifier)
Gender Test AUC CV AUC
Weight (LR) Male 0.971 0.962
Weight (LR) Female 0.965 0.948
No Sampling
(LR)
Male 0.963 0.941
No Sampling
(LR)
Female 0.925 0.923
Down-sampling
(RF)
Male 0.782 0.784
Down-sampling
(RF)
Female 0.632 0.678
No Sampling
(RF)
Male 0.500 0.501
No Sampling
(RF)
Female 0.500 0.500
Data Partitioning Data Transposition
id P1 P2 ´ PN
S1 1 0 ´ 2
S2 0 1 ´ 0
S3 2 0 ´ 0
´ 1 0 ´ 1
SN 0 0 ´ 1
id S1 S2 S3 S4 S5 S6 S7 ´ SN
P1 1 1 1 1 0 2 0 ´ 1
P2 0 0 2 0 1 0 2 ´ 1
´ 1 1 2 0 0 2 0 ´ 0
PN 2 0 0 2 0 1 1 ´ 2
736,990 features
4,988samples
? Top 20 features from each partition of
every chromosome with respect to their
ranking.
? Collectively, 3,040 SNPs variants were
selected from all 22 chromosomes i.e.
0.41% of the total SNPs.
? The performance of mRMR was
extremely slow for 5,000 features, it took
us several days to process all the
partitions.
Feature Selection
Gender Time per
partition (Avg)
No. of
files
Hours
Females 18 min 152 45.6
Males 14 min 152 35.5
mRMR Classifier
? 5-fold CV
? Imbalanced binary
class distribution
? SparkML was unable
to handle this in RF
? Down-sampling for
reducing the cases in
the majority class ^0 ̄
with minority class
randomly with the
ration of 1:1.5 for the
^1 ̄ and ^0 ̄ classes
respectively.
? Alternatively, Spark can
manage the weights
with imbalanced binary
classification using the
LR method.
Observations
Chromosome mRMR (Male) Accuracy (Male) mRMR (Female) Accuracy (Female)
1 0.6256 0.755 0.5113 0.737
2 0.6663 0.762 0.5318 0.738
3 0.6061 0.733 0.4681 0.697
4 0.5782 0.765 0.3942 0.692
5 0.4956 0.727 0.4159 0.705
6 0.5658 0.752 0.4457 0.715
7 0.4219 0.682 0.3537 0.682
8 0.4656 0.742 0.3704 0.666
9 0.3848 0.710 0.3177 0.685
10 0.2697 0.592 0.2672 0.698
11 0.4154 0.696 0.3514 0.685
12 0.4088 0.732 0.3185 0.669
13 0.3262 0.658 0.2543 0.691
14 0.2813 0.662 0.2318 0.650
15 0.2723 0.677 0.2273 0.646
16 0.2904 0.692 0.2258 0.636
17 0.2314 0.657 0.1828 0.627
18 0.2757 0.685 0.2292 0.651
19 0.1781 0.661 0.1357 0.628
20 0.2107 0.650 0.1778 0.635
21 0.1224 0.621 0.0939 0.627
22 0.1098 0.578 0.0981 0.586
? The accuracy for all 22 chromosomes also shows that the percentage of the
accuracy is comparatively higher in the chromosomes with higher number of
variants (SNPs), e.g., chromosome 22 shows the poorest performance for
males and females.
? The mutual relevant information (~ 67% for males and ~ 53% for females) is
contributed by chromosome 2 alone, whereas others provide less
information. Thus, it appears that the top 220 features selected from
chromosome 2 are more relevant in predicting the obesity.
? The graphical illustration shows that the first six chromosomes already
increase the performance of the final model over the 90% threshold for both
females and males), and it can be concluded that performance increase by
adding more chromosomes only marginally.
? We have found preliminary evidence that combining the features of just 6
chromosomes (a very parsimonious selection if compared with the complete
original dataset), obesity can be predicted quite accurately.