際際滷

際際滷Share a Scribd company logo
Ahsan Bilal (ahsan.bilal@est.fib.upc.edu)
Alfredo Vellido (avellido@cs.upc.edu)
Vicent Ribas (vicent.ribas@eurecat.org)
Big Data Analytics for Obesity Prediction
[1] Li, J, and Liu, H. Challenges of feature selection for big data analytics. IEEE Intelligent Systems 32(2), 2017: 9-15.
[2] Ramrez-Gallego, S, et al. Fast-mRMR: Fast Minimum Redundancy Maximum Relevance algorithm for high-dimensional Big Data. International Journal of Intelligent Systems, 32, 2017: 134-152.
[3] Ding, C, and Peng, H. Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology 3(2), 2005: 185-205.
[4] Severinsen et al. Genetic susceptibility, smoking, obesity and risk of venous thromboembolism, British Journal of Haematology, 149(2), 2010: 273C279.
INTRODUCTION
Conclusion
? Vertical and Horizontal data partitioning techniques led us to interpret the data seamlessly.
? Through feature engineering and FS-based DR, reduced original bulk of 736,990 SNPs to an extremely lean 3,040 SNP selection.
? Providing a quite accurate obesity prediction
? 0.965 AUC for female
? 0.971 AUC for males.
Existing Solution
Previous solutions solely relies on a single node.
? Existing pipeline is not able to scale to process the potential
future influx of much larger datasets.
Challenges
Data consists of;
? 4,988 patients (variants)
? 7,360,990 SNPs (samples)
A scalable solution required to process upcoming larger data
in distributed environment on multiple nodes.
Obesity is one of the chronic diseases, defined by genetic and
environmental factors.
? Estimated 112,000 excess deaths per year in the U.S. alone.
? U.S. spends appx. $190 billion per year.
Obesity puts individuals at risk of suffering more than 30
chronic health conditions including:
? Type 2 diabetes
? Heart diseases
? High blood pressure
? High cholesterol
? Birth defects
? Miscarriages
? Numerous cancers
Objectives
The objectives of this paper are three-fold and can be briefly
summarized as follows:
? Design and implementation of a data pipeline to process
and analyze feature-oriented genetic datasets with
millions of items;
? Finding the most relevant features (SNPs) and their
ranking for obesity, for each chromosome; and
? Forecasting obesity among males and females
separately.
Science for Dialysis2: Artificial Intelligence and Machine
Learning for achieving safety in Artificial Kidney
Hospital Universitari de Bellvitge
II Reuni┏ de ci┬ncia i di┐lisi: intel,lig┬ncia artificial
PROPOSED DATA PIPELINE
Phase1
Phase2
Sampling
(Classifier)
Gender Test AUC CV AUC
Weight (LR) Male 0.971 0.962
Weight (LR) Female 0.965 0.948
No Sampling
(LR)
Male 0.963 0.941
No Sampling
(LR)
Female 0.925 0.923
Down-sampling
(RF)
Male 0.782 0.784
Down-sampling
(RF)
Female 0.632 0.678
No Sampling
(RF)
Male 0.500 0.501
No Sampling
(RF)
Female 0.500 0.500
Data Partitioning Data Transposition
id P1 P2 ´ PN
S1 1 0 ´ 2
S2 0 1 ´ 0
S3 2 0 ´ 0
´ 1 0 ´ 1
SN 0 0 ´ 1
id S1 S2 S3 S4 S5 S6 S7 ´ SN
P1 1 1 1 1 0 2 0 ´ 1
P2 0 0 2 0 1 0 2 ´ 1
´ 1 1 2 0 0 2 0 ´ 0
PN 2 0 0 2 0 1 1 ´ 2
736,990 features
4,988samples
? Top 20 features from each partition of
every chromosome with respect to their
ranking.
? Collectively, 3,040 SNPs variants were
selected from all 22 chromosomes i.e.
0.41% of the total SNPs.
? The performance of mRMR was
extremely slow for 5,000 features, it took
us several days to process all the
partitions.
Feature Selection
Gender Time per
partition (Avg)
No. of
files
Hours
Females 18 min 152 45.6
Males 14 min 152 35.5
mRMR Classifier
? 5-fold CV
? Imbalanced binary
class distribution
? SparkML was unable
to handle this in RF
? Down-sampling for
reducing the cases in
the majority class ^0 ̄
with minority class
randomly with the
ration of 1:1.5 for the
^1 ̄ and ^0 ̄ classes
respectively.
? Alternatively, Spark can
manage the weights
with imbalanced binary
classification using the
LR method.
Observations
Chromosome mRMR (Male) Accuracy (Male) mRMR (Female) Accuracy (Female)
1 0.6256 0.755 0.5113 0.737
2 0.6663 0.762 0.5318 0.738
3 0.6061 0.733 0.4681 0.697
4 0.5782 0.765 0.3942 0.692
5 0.4956 0.727 0.4159 0.705
6 0.5658 0.752 0.4457 0.715
7 0.4219 0.682 0.3537 0.682
8 0.4656 0.742 0.3704 0.666
9 0.3848 0.710 0.3177 0.685
10 0.2697 0.592 0.2672 0.698
11 0.4154 0.696 0.3514 0.685
12 0.4088 0.732 0.3185 0.669
13 0.3262 0.658 0.2543 0.691
14 0.2813 0.662 0.2318 0.650
15 0.2723 0.677 0.2273 0.646
16 0.2904 0.692 0.2258 0.636
17 0.2314 0.657 0.1828 0.627
18 0.2757 0.685 0.2292 0.651
19 0.1781 0.661 0.1357 0.628
20 0.2107 0.650 0.1778 0.635
21 0.1224 0.621 0.0939 0.627
22 0.1098 0.578 0.0981 0.586
? The accuracy for all 22 chromosomes also shows that the percentage of the
accuracy is comparatively higher in the chromosomes with higher number of
variants (SNPs), e.g., chromosome 22 shows the poorest performance for
males and females.
? The mutual relevant information (~ 67% for males and ~ 53% for females) is
contributed by chromosome 2 alone, whereas others provide less
information. Thus, it appears that the top 220 features selected from
chromosome 2 are more relevant in predicting the obesity.
? The graphical illustration shows that the first six chromosomes already
increase the performance of the final model over the 90% threshold for both
females and males), and it can be concluded that performance increase by
adding more chromosomes only marginally.
? We have found preliminary evidence that combining the features of just 6
chromosomes (a very parsimonious selection if compared with the complete
original dataset), obesity can be predicted quite accurately.

More Related Content

Big Data Analytics for Obesity Prediction

  • 1. Ahsan Bilal (ahsan.bilal@est.fib.upc.edu) Alfredo Vellido (avellido@cs.upc.edu) Vicent Ribas (vicent.ribas@eurecat.org) Big Data Analytics for Obesity Prediction [1] Li, J, and Liu, H. Challenges of feature selection for big data analytics. IEEE Intelligent Systems 32(2), 2017: 9-15. [2] Ramrez-Gallego, S, et al. Fast-mRMR: Fast Minimum Redundancy Maximum Relevance algorithm for high-dimensional Big Data. International Journal of Intelligent Systems, 32, 2017: 134-152. [3] Ding, C, and Peng, H. Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology 3(2), 2005: 185-205. [4] Severinsen et al. Genetic susceptibility, smoking, obesity and risk of venous thromboembolism, British Journal of Haematology, 149(2), 2010: 273C279. INTRODUCTION Conclusion ? Vertical and Horizontal data partitioning techniques led us to interpret the data seamlessly. ? Through feature engineering and FS-based DR, reduced original bulk of 736,990 SNPs to an extremely lean 3,040 SNP selection. ? Providing a quite accurate obesity prediction ? 0.965 AUC for female ? 0.971 AUC for males. Existing Solution Previous solutions solely relies on a single node. ? Existing pipeline is not able to scale to process the potential future influx of much larger datasets. Challenges Data consists of; ? 4,988 patients (variants) ? 7,360,990 SNPs (samples) A scalable solution required to process upcoming larger data in distributed environment on multiple nodes. Obesity is one of the chronic diseases, defined by genetic and environmental factors. ? Estimated 112,000 excess deaths per year in the U.S. alone. ? U.S. spends appx. $190 billion per year. Obesity puts individuals at risk of suffering more than 30 chronic health conditions including: ? Type 2 diabetes ? Heart diseases ? High blood pressure ? High cholesterol ? Birth defects ? Miscarriages ? Numerous cancers Objectives The objectives of this paper are three-fold and can be briefly summarized as follows: ? Design and implementation of a data pipeline to process and analyze feature-oriented genetic datasets with millions of items; ? Finding the most relevant features (SNPs) and their ranking for obesity, for each chromosome; and ? Forecasting obesity among males and females separately. Science for Dialysis2: Artificial Intelligence and Machine Learning for achieving safety in Artificial Kidney Hospital Universitari de Bellvitge II Reuni┏ de ci┬ncia i di┐lisi: intel,lig┬ncia artificial PROPOSED DATA PIPELINE Phase1 Phase2 Sampling (Classifier) Gender Test AUC CV AUC Weight (LR) Male 0.971 0.962 Weight (LR) Female 0.965 0.948 No Sampling (LR) Male 0.963 0.941 No Sampling (LR) Female 0.925 0.923 Down-sampling (RF) Male 0.782 0.784 Down-sampling (RF) Female 0.632 0.678 No Sampling (RF) Male 0.500 0.501 No Sampling (RF) Female 0.500 0.500 Data Partitioning Data Transposition id P1 P2 ´ PN S1 1 0 ´ 2 S2 0 1 ´ 0 S3 2 0 ´ 0 ´ 1 0 ´ 1 SN 0 0 ´ 1 id S1 S2 S3 S4 S5 S6 S7 ´ SN P1 1 1 1 1 0 2 0 ´ 1 P2 0 0 2 0 1 0 2 ´ 1 ´ 1 1 2 0 0 2 0 ´ 0 PN 2 0 0 2 0 1 1 ´ 2 736,990 features 4,988samples ? Top 20 features from each partition of every chromosome with respect to their ranking. ? Collectively, 3,040 SNPs variants were selected from all 22 chromosomes i.e. 0.41% of the total SNPs. ? The performance of mRMR was extremely slow for 5,000 features, it took us several days to process all the partitions. Feature Selection Gender Time per partition (Avg) No. of files Hours Females 18 min 152 45.6 Males 14 min 152 35.5 mRMR Classifier ? 5-fold CV ? Imbalanced binary class distribution ? SparkML was unable to handle this in RF ? Down-sampling for reducing the cases in the majority class ^0 ̄ with minority class randomly with the ration of 1:1.5 for the ^1 ̄ and ^0 ̄ classes respectively. ? Alternatively, Spark can manage the weights with imbalanced binary classification using the LR method. Observations Chromosome mRMR (Male) Accuracy (Male) mRMR (Female) Accuracy (Female) 1 0.6256 0.755 0.5113 0.737 2 0.6663 0.762 0.5318 0.738 3 0.6061 0.733 0.4681 0.697 4 0.5782 0.765 0.3942 0.692 5 0.4956 0.727 0.4159 0.705 6 0.5658 0.752 0.4457 0.715 7 0.4219 0.682 0.3537 0.682 8 0.4656 0.742 0.3704 0.666 9 0.3848 0.710 0.3177 0.685 10 0.2697 0.592 0.2672 0.698 11 0.4154 0.696 0.3514 0.685 12 0.4088 0.732 0.3185 0.669 13 0.3262 0.658 0.2543 0.691 14 0.2813 0.662 0.2318 0.650 15 0.2723 0.677 0.2273 0.646 16 0.2904 0.692 0.2258 0.636 17 0.2314 0.657 0.1828 0.627 18 0.2757 0.685 0.2292 0.651 19 0.1781 0.661 0.1357 0.628 20 0.2107 0.650 0.1778 0.635 21 0.1224 0.621 0.0939 0.627 22 0.1098 0.578 0.0981 0.586 ? The accuracy for all 22 chromosomes also shows that the percentage of the accuracy is comparatively higher in the chromosomes with higher number of variants (SNPs), e.g., chromosome 22 shows the poorest performance for males and females. ? The mutual relevant information (~ 67% for males and ~ 53% for females) is contributed by chromosome 2 alone, whereas others provide less information. Thus, it appears that the top 220 features selected from chromosome 2 are more relevant in predicting the obesity. ? The graphical illustration shows that the first six chromosomes already increase the performance of the final model over the 90% threshold for both females and males), and it can be concluded that performance increase by adding more chromosomes only marginally. ? We have found preliminary evidence that combining the features of just 6 chromosomes (a very parsimonious selection if compared with the complete original dataset), obesity can be predicted quite accurately.