ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
A brief introduction to 
epistasis detection in GWAS 
2014. 01. 27. 
Hyun-hwan Jeong
Agenda 
? Introduction 
? Problem definition 
? Computational detection methods 
? Challenges 
2
Introduction
Single Nucleotide Polymorphism 
? A single letter change in DNA sequence 
? DNA sequence : 99.9% identical 
? Common type of genetic variation 
? ¡Ý 1% changes in general population 
¡­ATTCGCCGGCTGCAACGTTAGA¡­ 
¡­ATTCGCCGGCTGCAGCGTTAGA¡­ 
¡­ATTCGCCGGCTGCATCGTTAGA¡­ 
4
Genotype, phenotype and Allele 
http://en.wikipedia.org/wiki/Phenotype 
phenotype 
genotype 
allele 
5
Genome Wide Association Study 
for relation between Single SNP and disease 
Manhattan plot of the GWAS of the discovery cohort comprising 
2,346 SSc cases and 5,193 healthy controls. - Nature Genetics 42, 426¨C429 (2010) 
6
Why is detecting epistasis needed 
in GWAS? 
An illustration of interaction pattern between two SNPs with no marginal effect. - 
Bioinformatics 26, 30-37 (2010) 
7
Problem Definition 
8
Problem definition 
Epistasis detection problem 
? Object 
? Detection of causative SNPs for disease 
? Maximum value for defined measure 
? Dataset 
? 0.5M ~ 1M SNPs 
? 4,000 ~ 5,000 subjects 
? Binary disease status(case/control) 
? 100MB ~ 1GB genotype data file 
9
Problem definition ¨C Data format 
???? ???? ???? ???? ???? ???? ???? ???? ???? ???? ????? 
1 1 0 0 0 0 1 0 1 1 1 
0 0 1 0 0 0 1 1 0 2 1 
0 0 0 0 0 0 1 0 0 0 1 
1 1 0 0 0 0 0 1 0 2 1 
0 0 0 0 0 0 0 1 0 0 1 
0 0 0 0 0 0 0 0 0 1 0 
1 1 0 1 0 0 0 1 1 1 0 
0 0 0 1 1 1 0 1 1 1 0 
0 1 0 2 0 0 0 1 0 1 0 
0 0 0 1 0 0 1 2 1 0 0 
? 3 values for SNP columns - 0(AA), 1(Aa/aA), 2(AA/AA) 
? Binary values for CLASS ¨C 1(case/affected subjects), 0(control/normal) 
10
Problem definition ¨C measure(1/3) 
? On contingency table 
? Popular measure in epistasis detection 
? ?2 ? ???? 
? Mutual information 
Genotype ???? ???? ???? ???? ???? ???? ???? ???? ???? sum 
Case 39 91 95 92 14 31 63 4 71 500 
Control 100 15 55 5 22 150 50 93 10 500 
sum 139 106 150 97 36 181 113 97 81 1000 
11
Problem definition ¨C measure(2/3) 
? ?2 ? ???? based on ?2 ? distribution 
? ?0 : no association between SNPs and disease status 
?2 = 
(????? )2 
?? 
Genotype ???? ???? ???? ???? ???? ???? ???? ???? ???? sum 
Case 39 91 95 92 14 31 63 4 71 500 
Control 100 15 55 5 22 150 50 93 10 500 
sum 139 106 150 97 36 181 113 97 81 1000 
Expected contingency table 
Genotype ???? ???? ???? ???? ???? ???? ???? ???? ???? sum 
Case 69.5 53 75 48.5 18 90.5 56.5 48.5 40.5 500 
Control 69.5 53 75 48.5 18 90.5 56.5 48.5 40.5 500 
sum 139 106 150 97 36 181 113 97 81 1000 
?2 ? ????? ¡Ã 379.07, ? ? ????? = 2.76 ¡Á 10?77 
12
Problem definition ¨C measure(3/3) 
? Mutual information(1/3) 
? Non-parametric measure 
13
Problem definition ¨C measure(3/3) 
? Mutual information(2/3) 
Genotype ???? ???? ???? ???? ???? ???? ???? ???? ???? sum 
Case 39 91 95 92 14 31 63 4 71 500 
Control 100 15 55 5 22 150 50 93 10 500 
sum 139 106 150 97 36 181 113 97 81 1000 
Frequency Table 
Genotype ???? ???? ???? ???? ???? ???? ???? ???? ???? sum 
Case 0.039 0.091 0.095 0.092 0.014 0.031 0.063 0.004 0.071 0.500 
Control 0.100 0.015 0.055 0.005 0.022 0.150 0.050 0.093 0.010 0.500 
sum 0.139 0.106 0.150 0.097 0.036 0.181 0.113 0.097 0.081 1.000 
14
Problem definition ¨C measure(3/3) 
? Mutual information(3/3) 
Entropy Table 
Genotype ???? ???? ???? ???? ???? ???? ???? ???? ???? sum 
Case 0.183 0.315 0.323 0.317 0.086 0.155 0.251 0.032 0.271 0.500 
Control 0.332 0.091 0.230 0.038 0.121 0.411 0.216 0.319 0.066 0.500 
sum 0.396 0.343 0.411 0.326 0.173 0.446 0.355 0.326 0.294 
? ????????; ??????? = ? ???????? + ? ??????? ? ? ????????, ??????? 
= 3.07 + 1.00 ? 3.76 
= 0.31 
15
Methods to detect 
epistasis 
16
Methods ¨C Computational 
Approaches 
? Multifactor Dimensionality Reduction 
(Ritchie et al. 2002) 
? SNPHarvester (Yang et al. 2009) 
? SNPRuler (Wan et al. 2010) 
? Mutual Information With Clustering 
(Leem et al. 2014) 
17
Methods 
Multifactor dimensionality reduction(1/2) 
18
Methods 
Multifactor dimensionality reduction(2/2) 
? Model free, non-parametric methods 
? Pattern-based method 
? Association rule for each combinations of SNPs and 
phenotypes 
? i.e. ?? ???10 = 0 ??? ???13 = 4 ???? ????? = 1 
? Exhaustive Search 
? Computational Burden 
? Cross Validation Consistency 
? To select best model 
19
Methods 
SNPHarvester(1/2) 
20
Methods 
SNPHarvester(2/2) 
? Local search 
? Local optima problem 
? PathSeeker algorithm 
? Successive Runs 
? Score function : ?2 ? ????? 
21
Methods 
SNPRuler 
? Pattern-based method 
? Predictive rule 
? Branch-and-bound algorithm 
? Upper bound of ?2 ? ????? in d.f. is 1 
22
Methods 
Mutual Information With Clustering(1/2) 
: SNPs 
: causative SNPs 
d1 
d2 
distance 
Score=d1+d2 
Centroid 1 
Centroid 2 
Centroid 3 
3 SNPs with the 
highest mutual 
information value 
m candidates 
m candidates 
m candidates 
23
Methods 
Mutual Information With Clustering(2/2) 
? Mutual information 
? As distance measure for clustering 
? K-means clustering algorithm 
? Candidate selection 
? Reduce search space dramtically 
? Can detect high-order epistatic interaction 
? Also, shows better performance (power, execution time) 
than previous methods 
24
Challenges in epistasis 
detection 
25
Challenges 
? Reducing computational burden 
? Filtering 
? Parallel processing 
? Higher-order epistatic interaction detection 
? Larger than 2 
? Novel measure of association between SNPs and 
disease 
26

More Related Content

a brief introduction to epistasis detection

  • 1. A brief introduction to epistasis detection in GWAS 2014. 01. 27. Hyun-hwan Jeong
  • 2. Agenda ? Introduction ? Problem definition ? Computational detection methods ? Challenges 2
  • 4. Single Nucleotide Polymorphism ? A single letter change in DNA sequence ? DNA sequence : 99.9% identical ? Common type of genetic variation ? ¡Ý 1% changes in general population ¡­ATTCGCCGGCTGCAACGTTAGA¡­ ¡­ATTCGCCGGCTGCAGCGTTAGA¡­ ¡­ATTCGCCGGCTGCATCGTTAGA¡­ 4
  • 5. Genotype, phenotype and Allele http://en.wikipedia.org/wiki/Phenotype phenotype genotype allele 5
  • 6. Genome Wide Association Study for relation between Single SNP and disease Manhattan plot of the GWAS of the discovery cohort comprising 2,346 SSc cases and 5,193 healthy controls. - Nature Genetics 42, 426¨C429 (2010) 6
  • 7. Why is detecting epistasis needed in GWAS? An illustration of interaction pattern between two SNPs with no marginal effect. - Bioinformatics 26, 30-37 (2010) 7
  • 9. Problem definition Epistasis detection problem ? Object ? Detection of causative SNPs for disease ? Maximum value for defined measure ? Dataset ? 0.5M ~ 1M SNPs ? 4,000 ~ 5,000 subjects ? Binary disease status(case/control) ? 100MB ~ 1GB genotype data file 9
  • 10. Problem definition ¨C Data format ???? ???? ???? ???? ???? ???? ???? ???? ???? ???? ????? 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 0 2 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 2 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 1 0 1 1 1 0 0 1 0 2 0 0 0 1 0 1 0 0 0 0 1 0 0 1 2 1 0 0 ? 3 values for SNP columns - 0(AA), 1(Aa/aA), 2(AA/AA) ? Binary values for CLASS ¨C 1(case/affected subjects), 0(control/normal) 10
  • 11. Problem definition ¨C measure(1/3) ? On contingency table ? Popular measure in epistasis detection ? ?2 ? ???? ? Mutual information Genotype ???? ???? ???? ???? ???? ???? ???? ???? ???? sum Case 39 91 95 92 14 31 63 4 71 500 Control 100 15 55 5 22 150 50 93 10 500 sum 139 106 150 97 36 181 113 97 81 1000 11
  • 12. Problem definition ¨C measure(2/3) ? ?2 ? ???? based on ?2 ? distribution ? ?0 : no association between SNPs and disease status ?2 = (????? )2 ?? Genotype ???? ???? ???? ???? ???? ???? ???? ???? ???? sum Case 39 91 95 92 14 31 63 4 71 500 Control 100 15 55 5 22 150 50 93 10 500 sum 139 106 150 97 36 181 113 97 81 1000 Expected contingency table Genotype ???? ???? ???? ???? ???? ???? ???? ???? ???? sum Case 69.5 53 75 48.5 18 90.5 56.5 48.5 40.5 500 Control 69.5 53 75 48.5 18 90.5 56.5 48.5 40.5 500 sum 139 106 150 97 36 181 113 97 81 1000 ?2 ? ????? ¡Ã 379.07, ? ? ????? = 2.76 ¡Á 10?77 12
  • 13. Problem definition ¨C measure(3/3) ? Mutual information(1/3) ? Non-parametric measure 13
  • 14. Problem definition ¨C measure(3/3) ? Mutual information(2/3) Genotype ???? ???? ???? ???? ???? ???? ???? ???? ???? sum Case 39 91 95 92 14 31 63 4 71 500 Control 100 15 55 5 22 150 50 93 10 500 sum 139 106 150 97 36 181 113 97 81 1000 Frequency Table Genotype ???? ???? ???? ???? ???? ???? ???? ???? ???? sum Case 0.039 0.091 0.095 0.092 0.014 0.031 0.063 0.004 0.071 0.500 Control 0.100 0.015 0.055 0.005 0.022 0.150 0.050 0.093 0.010 0.500 sum 0.139 0.106 0.150 0.097 0.036 0.181 0.113 0.097 0.081 1.000 14
  • 15. Problem definition ¨C measure(3/3) ? Mutual information(3/3) Entropy Table Genotype ???? ???? ???? ???? ???? ???? ???? ???? ???? sum Case 0.183 0.315 0.323 0.317 0.086 0.155 0.251 0.032 0.271 0.500 Control 0.332 0.091 0.230 0.038 0.121 0.411 0.216 0.319 0.066 0.500 sum 0.396 0.343 0.411 0.326 0.173 0.446 0.355 0.326 0.294 ? ????????; ??????? = ? ???????? + ? ??????? ? ? ????????, ??????? = 3.07 + 1.00 ? 3.76 = 0.31 15
  • 16. Methods to detect epistasis 16
  • 17. Methods ¨C Computational Approaches ? Multifactor Dimensionality Reduction (Ritchie et al. 2002) ? SNPHarvester (Yang et al. 2009) ? SNPRuler (Wan et al. 2010) ? Mutual Information With Clustering (Leem et al. 2014) 17
  • 19. Methods Multifactor dimensionality reduction(2/2) ? Model free, non-parametric methods ? Pattern-based method ? Association rule for each combinations of SNPs and phenotypes ? i.e. ?? ???10 = 0 ??? ???13 = 4 ???? ????? = 1 ? Exhaustive Search ? Computational Burden ? Cross Validation Consistency ? To select best model 19
  • 21. Methods SNPHarvester(2/2) ? Local search ? Local optima problem ? PathSeeker algorithm ? Successive Runs ? Score function : ?2 ? ????? 21
  • 22. Methods SNPRuler ? Pattern-based method ? Predictive rule ? Branch-and-bound algorithm ? Upper bound of ?2 ? ????? in d.f. is 1 22
  • 23. Methods Mutual Information With Clustering(1/2) : SNPs : causative SNPs d1 d2 distance Score=d1+d2 Centroid 1 Centroid 2 Centroid 3 3 SNPs with the highest mutual information value m candidates m candidates m candidates 23
  • 24. Methods Mutual Information With Clustering(2/2) ? Mutual information ? As distance measure for clustering ? K-means clustering algorithm ? Candidate selection ? Reduce search space dramtically ? Can detect high-order epistatic interaction ? Also, shows better performance (power, execution time) than previous methods 24
  • 25. Challenges in epistasis detection 25
  • 26. Challenges ? Reducing computational burden ? Filtering ? Parallel processing ? Higher-order epistatic interaction detection ? Larger than 2 ? Novel measure of association between SNPs and disease 26