際際滷

際際滷Share a Scribd company logo
DATA MINING AND STATISTICAL ANALYSIS SOLUTIONS
Readmission probability prediction
using open health care data
David Budaghyan
Erik Hambardzumyan
Habet Madoyan
Lilit Simonyan
Lusine Sargsyan
Vahe Movsisyan
The project and the team
David Budaghyan
Erik Hambardzumyan
Habet Madoyan
Lilit Simonyan
Lusine Sargsyan
Vahe Movsisyan
Our team
 The team was formed to participate in OpenData hackathon
organized by Kolba Labs, December 3-4th
 We continued working after the hackathon, this is the result of
later efforts.
 The findings here are not final and can be revised with better
data and more thorough approach.
 This is just a prototype, we are open to any criticism, and
suggestions.
The context
 The initial data is collected by Ministry of Health
 Around 13mln records
 Each record is an encounter of a citizen with Health Care institution
 The data is messy with lot of missing values and inconsistencies
Marz
Gender
Age Group
Payment Type
Eligibility Type
Encounter Purpose
Phc Diagnose Group
Encounter Outcome Treatment
The following variables were present in the data
The context: Data Transformation
 The data is transformed to identify patients
 The patient identification is done by using 3 variables (gender,
birthday, marz) -Thanks to Fimetech team.
The goal of analysis
 Can we predict the patients readmission over time?
 What is the probability that the patient will not need new encounter within
30, 180, 365 days after the first encounter.
 Why ?
 Healthcare costs: If there will be a program for mandatory health insurance-
 The model will allow to predict overall costs for the economy
 Fraud detection: Fraud comes hand to hand with insurance, the survival analysis helps to identify
deviant behavior (too many repeating visits for a given disease?)
 Deviant behavior on clinic/doctor level  Will help to understand the skills gap, ineffective
management and etc, on the level of local clinics,
 Modeling insurance premiums: If we understand how much is the cost on marz/disease/ levels, we
can offer tailored premiums
Cox Proportional-Hazards Model
- Marz
- Gender
- Age Group
- Payment Type
- Eligibility Type
- Encounter Purpose
- Phc Diagnose Group
- Encounter Outcome Treatment Type
coef exp(coef) robust se z Pr(>|z|)
Marz_ID_Aragatsotn
Marz_ID_Ararat -0.122 0.885 0.042 -2.921 0.0035 **
Marz_ID_Armavir 0.094 1.098 0.036 2.607 0.0091 **
Marz_ID_Gegharqunik -0.051 0.951 0.035 -1.431 0.1525
Marz_ID_Kotayk 0.063 1.065 0.041 1.545 0.1224
Marz_ID_Lori -0.059 0.943 0.037 -1.594 0.1109
Marz_ID_Shirak 0.076 1.079 0.035 2.161 0.0307 *
Marz_ID_Syunik -0.199 0.819 0.268 -0.743 0.4575
Marz_ID_Tavush 0.049 1.051 0.049 1.009 0.3131
Marz_ID_Vayots_Dzor -0.035 0.966 1.094 -0.032 0.9746
Marz_ID_Yerevan -0.086 0.918 0.041 -2.087 0.0369 *
Gender_ID_female
Gender_ID_male -0.042 0.959 0.017 -2.547 0.0109 *
Age_ID_0-5
Age_ID_5-10 -0.187 0.829 0.063 -2.974 0.0029 **
Age_ID_10-18 -0.310 0.734 0.145 -2.140 0.0324 *
Age_ID_18-30 -0.133 0.875 0.142 -0.942 0.3460
Age_ID_30-60 -0.049 0.952 0.122 -0.406 0.6850
Age_ID_60+ 0.083 1.086 0.122 0.680 0.4965
Payment_Type_ID_paid
Payment_Type_ID_state_ordered 0.153 1.166 0.166 0.923 0.3558
Eligibility_ID_Armed Forces
Eligibility_ID_children_vulnerable 0.240 1.271 0.244 0.984 0.3251
Eligibility_ID_disabled_people 0.379 1.461 0.212 1.785 0.0743 .
Eligibility_ID_elderly_people 0.277 1.320 0.218 1.270 0.2040
Eligibility_ID_family_vulnerable 0.432 1.540 0.230 1.880 0.0601 .
Eligibility_ID_other 0.018 1.019 0.221 0.083 0.9337
Eligibility_ID_poverty_beneficiary -0.084 0.919 0.296 -0.284 0.7765
Eligibility_ID_pregnancy -0.154 0.857 0.360 -0.428 0.6688
Eligibility_ID_social_package_beneficiary -0.350 0.704 0.231 -1.520 0.1285
Eligibility_ID_young_men -1.052 0.349 0.502 -2.094 0.0362 *
Encounter_Purpose_ID_disease
Encounter_Purpose_ID_control 0.120 1.127 0.027 4.487 0.0000 ***
Encounter_Purpose_ID_administrative 0.195 1.215 0.192 1.014 0.3108
Encounter_Purpose_ID_preventive -0.022 0.978 0.054 -0.404 0.6861
Encounter_Purpose_ID_reproductive 0.459 1.583 0.278 1.654 0.0982 .
Encounter_Purpose_ID_other -1.332 0.264 0.050 -26.386 0.0000 ***
Phc_Diagnose_ID_A
Phc_Diagnose_ID_I -0.107 0.899 0.024 -4.430 0.0000 ***
Phc_Diagnose_ID_J -0.028 0.973 0.039 -0.715 0.4744
Phc_Diagnose_ID_K 0.145 1.156 0.143 1.011 0.3119
Encounter_Outcome_Treatment_ID_chronic_condition
Encounter_Outcome_Treatment_ID_death -0.142 0.868 0.441 -0.322 0.7475
Encounter_Outcome_Treatment_ID_improvement 0.091 1.096 0.059 1.546 0.1222
Encounter_Outcome_Treatment_ID_recovery 0.049 1.050 0.057 0.855 0.3927
Encounter_Outcome_Treatment_ID_stabilisation 0.091 1.095 0.042 2.164 0.0305 *
Encounter_Outcome_Treatment_ID_treatment_stop 0.420 1.522 0.159 2.651 0.0080 **
Encounter_Outcome_Treatment_ID_unchanged 0.045 1.046 0.043 1.055 0.2915
Encounter_Outcome_Treatment_ID_worsening 0.132 1.141 0.084 1.567 0.1170
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
Concordance= 0.537 (se = 0.003 )
Rsquare= 0.038 (max possible= 1 )
Likelihood ratio test= 660 on 41 df, p=0
Wald test = 7450 on 41 df, p=0
Score (logrank) test = 607.2 on 41 df, p=0, Robust = 472.8 p=0
n= 17067, number of events= 15504 (90933 observations deleted due to missingness)
 = 1 
. p 
.    .  
S ≠1
 -Proportion of patients who didnt return
to hospital in t days after first visit
0 = 1 
. p0
. 0
Findings
 Marzes: Ararat, Armavir, Shirak and Yerevan have significant coefficients and based on the signs of the
coefficients we can conclude that Armavir and Shirak have higher probability of readmission than Aragatsotn
and Ararat and Yerevan have less readmission probability than Aragatsotn.
 Gender: the significant negative coefficient of Gender_ID_Male indicates that females are more frequent
visitors to hospital in comparison to males.
 Age: people aged from 5-18 have lower readmission probability than children of 0-5 age. And this can be
easily observed from the data presented in the output, particularly the significant negative coefficients are
showing that fact.
 Eligibility: The only significant coefficient in the scope of eligibility have the young men, which has a
negative sign, meaning that armed forces have more attendance probability in comparison to young men.
 Purpose: Hospital visits for control have higher readmission probability than those for disease purpose,
which can be explained by the fact that control purpose supposes regular attendances.
 Diagnose: Certain infectious and parasitic diseases have higher readmission probability than some heart
diseases.
Findings
30 days 180 days 365 days
Aragatsotn 0.522 0.106 0.046
Ararat 0.521 0.106 0.045
Armavir 0.504 0.103 0.041
Gegharqunik 0.533 0.103 0.042
Kotayk 0.466 0.095 0.039
Lori 0.554 0.132 0.063
Shirak 0.429 0.068 0.022
Syunik 0.485 0.138 0.069
Tavush 0.558 0.079 0.034
Vayots Dzor 0.635 0.163 0.070
Yerevan 0.486 0.119 0.047
Shirak and Tavush have the lowest 
(Proportion of patients who didnt return to
hospital in t days after first visit)
Vayots Dzor, Syunik, and Lori have the highest 
Gender
30 days 180 days 365 days
female 0.491 0.102 0.042
male 0.493 0.097 0.038
Gender and the hospital visit intensity are not
related
Gender
30 days 180 days 365 days
0-5 0.469 0.066 0.021
5-10 0.571 0.201 0.086
10-18 0.595 0.272 0.126
18-30 0.562 0.206 0.100
30-60 0.512 0.113 0.048
60+ 0.455 0.056 0.019
0-5 and 60+ are the most risky age groups
(lowest )
10-18 years old patients have the lowest
pro$ $ p $  ≠  
.
Age
30days 180days 365days
paid 0.513 0.138 0.032
state ordered 0.492 0.099 0.041
Patients who have state ordered payment type
have lower survival probability meaning their
visits are more frequent.
Payment type
30 days 180 days 365 days
children vulnerable 0.479 0.069 0.021
disabled people 0.470 0.031 0.006
elderly people 0.466 0.041 0.019
pregnancy 0.514 0.105 0.045
social package
beneficiary
0.593 0.341 0.249
The most frequent visitors are disabled and
elderly people an contrast with social package
beneficiaries who have the highest survival rate.
Eligibility
30 days 180 days 365 days
disease 0.508 0.109 0.029
control 0.478 0.045 0.012
administrative 0.543 0.183 0.079
preventive 0.499 0.156 0.079
reproductive 0.532 0.126 0.041
other 0.418 0.044 0.009
As visits for administrative purposes are mostly
occasional they have higher survival rate,
whereas the same figure for control purpose
visits is the least frequent one.
Encounter Purpose
30days 180days 365days
A 0.487 0.072 0.022
I 0.503 0.069 0.016
J 0.518 0.101 0.026
K 0.514 0.098 0.025
Certain infectious (A) and heart diseases (I) have
lower survival probability in comparison to upper
respiratory infections (J), sclerosis and other mental
disorders(K).
Diagnosis
30days 180days 365days
chronic condition 0.497 0.051 0.017
improvement 0.525 0.102 0.024
recovery 0.525 0.109 0.030
stabilisation 0.475 0.055 0.016
treatment stop 0.458 0.122 0.087
unchanged 0.501 0.088 0.030
worsening 0.407 0.054 0.012
Worsening, chronic condition and stabilization have the
lowest survival rates since they require more
attendances, whereas improvement, recovery and
treatment stop have the highest survival rates since they
require less visits.
Encounter Outcome Treatment
Predicting single readmission (machine
learning case)
 The data is transformed so each row is a person
 There is an indicator variable, showing if the patient was readmitted, thus have
more than 2 records in the database
 The goal of the modeling is to predict no-readmission rate based on 7 variables
(age, gender, payment, treatment, etc.)
The Business goal
 Low readmission is a sign of good patient care
 Low readmission means low insurance and healthcare costs
Tested models
Random Forests
Na誰ve Bayes
Decision trees
Gradient boosting
AUC  0.7035657
ROC curve of blender model for Testing set.
What next ?
 What other data can we obtain?
 From Ministry of health?
 From Clinics?
 Insurance companies?
 How can we make data cleaner and more reliable?
 What is the real need of stakeholders?
Datamotus LLC 21

More Related Content

Health care data - survivial analysis, draft

  • 1. DATA MINING AND STATISTICAL ANALYSIS SOLUTIONS Readmission probability prediction using open health care data David Budaghyan Erik Hambardzumyan Habet Madoyan Lilit Simonyan Lusine Sargsyan Vahe Movsisyan
  • 2. The project and the team David Budaghyan Erik Hambardzumyan Habet Madoyan Lilit Simonyan Lusine Sargsyan Vahe Movsisyan Our team The team was formed to participate in OpenData hackathon organized by Kolba Labs, December 3-4th We continued working after the hackathon, this is the result of later efforts. The findings here are not final and can be revised with better data and more thorough approach. This is just a prototype, we are open to any criticism, and suggestions.
  • 3. The context The initial data is collected by Ministry of Health Around 13mln records Each record is an encounter of a citizen with Health Care institution The data is messy with lot of missing values and inconsistencies Marz Gender Age Group Payment Type Eligibility Type Encounter Purpose Phc Diagnose Group Encounter Outcome Treatment The following variables were present in the data
  • 4. The context: Data Transformation The data is transformed to identify patients The patient identification is done by using 3 variables (gender, birthday, marz) -Thanks to Fimetech team.
  • 5. The goal of analysis Can we predict the patients readmission over time? What is the probability that the patient will not need new encounter within 30, 180, 365 days after the first encounter. Why ? Healthcare costs: If there will be a program for mandatory health insurance- The model will allow to predict overall costs for the economy Fraud detection: Fraud comes hand to hand with insurance, the survival analysis helps to identify deviant behavior (too many repeating visits for a given disease?) Deviant behavior on clinic/doctor level Will help to understand the skills gap, ineffective management and etc, on the level of local clinics, Modeling insurance premiums: If we understand how much is the cost on marz/disease/ levels, we can offer tailored premiums
  • 6. Cox Proportional-Hazards Model - Marz - Gender - Age Group - Payment Type - Eligibility Type - Encounter Purpose - Phc Diagnose Group - Encounter Outcome Treatment Type coef exp(coef) robust se z Pr(>|z|) Marz_ID_Aragatsotn Marz_ID_Ararat -0.122 0.885 0.042 -2.921 0.0035 ** Marz_ID_Armavir 0.094 1.098 0.036 2.607 0.0091 ** Marz_ID_Gegharqunik -0.051 0.951 0.035 -1.431 0.1525 Marz_ID_Kotayk 0.063 1.065 0.041 1.545 0.1224 Marz_ID_Lori -0.059 0.943 0.037 -1.594 0.1109 Marz_ID_Shirak 0.076 1.079 0.035 2.161 0.0307 * Marz_ID_Syunik -0.199 0.819 0.268 -0.743 0.4575 Marz_ID_Tavush 0.049 1.051 0.049 1.009 0.3131 Marz_ID_Vayots_Dzor -0.035 0.966 1.094 -0.032 0.9746 Marz_ID_Yerevan -0.086 0.918 0.041 -2.087 0.0369 * Gender_ID_female Gender_ID_male -0.042 0.959 0.017 -2.547 0.0109 * Age_ID_0-5 Age_ID_5-10 -0.187 0.829 0.063 -2.974 0.0029 ** Age_ID_10-18 -0.310 0.734 0.145 -2.140 0.0324 * Age_ID_18-30 -0.133 0.875 0.142 -0.942 0.3460 Age_ID_30-60 -0.049 0.952 0.122 -0.406 0.6850 Age_ID_60+ 0.083 1.086 0.122 0.680 0.4965 Payment_Type_ID_paid Payment_Type_ID_state_ordered 0.153 1.166 0.166 0.923 0.3558 Eligibility_ID_Armed Forces Eligibility_ID_children_vulnerable 0.240 1.271 0.244 0.984 0.3251 Eligibility_ID_disabled_people 0.379 1.461 0.212 1.785 0.0743 . Eligibility_ID_elderly_people 0.277 1.320 0.218 1.270 0.2040 Eligibility_ID_family_vulnerable 0.432 1.540 0.230 1.880 0.0601 . Eligibility_ID_other 0.018 1.019 0.221 0.083 0.9337 Eligibility_ID_poverty_beneficiary -0.084 0.919 0.296 -0.284 0.7765 Eligibility_ID_pregnancy -0.154 0.857 0.360 -0.428 0.6688 Eligibility_ID_social_package_beneficiary -0.350 0.704 0.231 -1.520 0.1285 Eligibility_ID_young_men -1.052 0.349 0.502 -2.094 0.0362 * Encounter_Purpose_ID_disease Encounter_Purpose_ID_control 0.120 1.127 0.027 4.487 0.0000 *** Encounter_Purpose_ID_administrative 0.195 1.215 0.192 1.014 0.3108 Encounter_Purpose_ID_preventive -0.022 0.978 0.054 -0.404 0.6861 Encounter_Purpose_ID_reproductive 0.459 1.583 0.278 1.654 0.0982 . Encounter_Purpose_ID_other -1.332 0.264 0.050 -26.386 0.0000 *** Phc_Diagnose_ID_A Phc_Diagnose_ID_I -0.107 0.899 0.024 -4.430 0.0000 *** Phc_Diagnose_ID_J -0.028 0.973 0.039 -0.715 0.4744 Phc_Diagnose_ID_K 0.145 1.156 0.143 1.011 0.3119 Encounter_Outcome_Treatment_ID_chronic_condition Encounter_Outcome_Treatment_ID_death -0.142 0.868 0.441 -0.322 0.7475 Encounter_Outcome_Treatment_ID_improvement 0.091 1.096 0.059 1.546 0.1222 Encounter_Outcome_Treatment_ID_recovery 0.049 1.050 0.057 0.855 0.3927 Encounter_Outcome_Treatment_ID_stabilisation 0.091 1.095 0.042 2.164 0.0305 * Encounter_Outcome_Treatment_ID_treatment_stop 0.420 1.522 0.159 2.651 0.0080 ** Encounter_Outcome_Treatment_ID_unchanged 0.045 1.046 0.043 1.055 0.2915 Encounter_Outcome_Treatment_ID_worsening 0.132 1.141 0.084 1.567 0.1170 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 Concordance= 0.537 (se = 0.003 ) Rsquare= 0.038 (max possible= 1 ) Likelihood ratio test= 660 on 41 df, p=0 Wald test = 7450 on 41 df, p=0 Score (logrank) test = 607.2 on 41 df, p=0, Robust = 472.8 p=0 n= 17067, number of events= 15504 (90933 observations deleted due to missingness) = 1 . p . . S ≠1 -Proportion of patients who didnt return to hospital in t days after first visit 0 = 1 . p0 . 0
  • 7. Findings Marzes: Ararat, Armavir, Shirak and Yerevan have significant coefficients and based on the signs of the coefficients we can conclude that Armavir and Shirak have higher probability of readmission than Aragatsotn and Ararat and Yerevan have less readmission probability than Aragatsotn. Gender: the significant negative coefficient of Gender_ID_Male indicates that females are more frequent visitors to hospital in comparison to males. Age: people aged from 5-18 have lower readmission probability than children of 0-5 age. And this can be easily observed from the data presented in the output, particularly the significant negative coefficients are showing that fact.
  • 8. Eligibility: The only significant coefficient in the scope of eligibility have the young men, which has a negative sign, meaning that armed forces have more attendance probability in comparison to young men. Purpose: Hospital visits for control have higher readmission probability than those for disease purpose, which can be explained by the fact that control purpose supposes regular attendances. Diagnose: Certain infectious and parasitic diseases have higher readmission probability than some heart diseases. Findings
  • 9. 30 days 180 days 365 days Aragatsotn 0.522 0.106 0.046 Ararat 0.521 0.106 0.045 Armavir 0.504 0.103 0.041 Gegharqunik 0.533 0.103 0.042 Kotayk 0.466 0.095 0.039 Lori 0.554 0.132 0.063 Shirak 0.429 0.068 0.022 Syunik 0.485 0.138 0.069 Tavush 0.558 0.079 0.034 Vayots Dzor 0.635 0.163 0.070 Yerevan 0.486 0.119 0.047 Shirak and Tavush have the lowest (Proportion of patients who didnt return to hospital in t days after first visit) Vayots Dzor, Syunik, and Lori have the highest Gender
  • 10. 30 days 180 days 365 days female 0.491 0.102 0.042 male 0.493 0.097 0.038 Gender and the hospital visit intensity are not related Gender
  • 11. 30 days 180 days 365 days 0-5 0.469 0.066 0.021 5-10 0.571 0.201 0.086 10-18 0.595 0.272 0.126 18-30 0.562 0.206 0.100 30-60 0.512 0.113 0.048 60+ 0.455 0.056 0.019 0-5 and 60+ are the most risky age groups (lowest ) 10-18 years old patients have the lowest pro$ $ p $ ≠ . Age
  • 12. 30days 180days 365days paid 0.513 0.138 0.032 state ordered 0.492 0.099 0.041 Patients who have state ordered payment type have lower survival probability meaning their visits are more frequent. Payment type
  • 13. 30 days 180 days 365 days children vulnerable 0.479 0.069 0.021 disabled people 0.470 0.031 0.006 elderly people 0.466 0.041 0.019 pregnancy 0.514 0.105 0.045 social package beneficiary 0.593 0.341 0.249 The most frequent visitors are disabled and elderly people an contrast with social package beneficiaries who have the highest survival rate. Eligibility
  • 14. 30 days 180 days 365 days disease 0.508 0.109 0.029 control 0.478 0.045 0.012 administrative 0.543 0.183 0.079 preventive 0.499 0.156 0.079 reproductive 0.532 0.126 0.041 other 0.418 0.044 0.009 As visits for administrative purposes are mostly occasional they have higher survival rate, whereas the same figure for control purpose visits is the least frequent one. Encounter Purpose
  • 15. 30days 180days 365days A 0.487 0.072 0.022 I 0.503 0.069 0.016 J 0.518 0.101 0.026 K 0.514 0.098 0.025 Certain infectious (A) and heart diseases (I) have lower survival probability in comparison to upper respiratory infections (J), sclerosis and other mental disorders(K). Diagnosis
  • 16. 30days 180days 365days chronic condition 0.497 0.051 0.017 improvement 0.525 0.102 0.024 recovery 0.525 0.109 0.030 stabilisation 0.475 0.055 0.016 treatment stop 0.458 0.122 0.087 unchanged 0.501 0.088 0.030 worsening 0.407 0.054 0.012 Worsening, chronic condition and stabilization have the lowest survival rates since they require more attendances, whereas improvement, recovery and treatment stop have the highest survival rates since they require less visits. Encounter Outcome Treatment
  • 17. Predicting single readmission (machine learning case) The data is transformed so each row is a person There is an indicator variable, showing if the patient was readmitted, thus have more than 2 records in the database The goal of the modeling is to predict no-readmission rate based on 7 variables (age, gender, payment, treatment, etc.) The Business goal Low readmission is a sign of good patient care Low readmission means low insurance and healthcare costs
  • 18. Tested models Random Forests Na誰ve Bayes Decision trees Gradient boosting
  • 19. AUC 0.7035657 ROC curve of blender model for Testing set.
  • 20. What next ? What other data can we obtain? From Ministry of health? From Clinics? Insurance companies? How can we make data cleaner and more reliable? What is the real need of stakeholders?