Survival analysis methods are applied on Armenian healthcare data.
The analysis was done during Armenian OpenData Hackathon, December 3-4 2016.
1 of 21
Download to read offline
More Related Content
Health care data - survivial analysis, draft
1. DATA MINING AND STATISTICAL ANALYSIS SOLUTIONS
Readmission probability prediction
using open health care data
David Budaghyan
Erik Hambardzumyan
Habet Madoyan
Lilit Simonyan
Lusine Sargsyan
Vahe Movsisyan
2. The project and the team
David Budaghyan
Erik Hambardzumyan
Habet Madoyan
Lilit Simonyan
Lusine Sargsyan
Vahe Movsisyan
Our team
The team was formed to participate in OpenData hackathon
organized by Kolba Labs, December 3-4th
We continued working after the hackathon, this is the result of
later efforts.
The findings here are not final and can be revised with better
data and more thorough approach.
This is just a prototype, we are open to any criticism, and
suggestions.
3. The context
The initial data is collected by Ministry of Health
Around 13mln records
Each record is an encounter of a citizen with Health Care institution
The data is messy with lot of missing values and inconsistencies
Marz
Gender
Age Group
Payment Type
Eligibility Type
Encounter Purpose
Phc Diagnose Group
Encounter Outcome Treatment
The following variables were present in the data
4. The context: Data Transformation
The data is transformed to identify patients
The patient identification is done by using 3 variables (gender,
birthday, marz) -Thanks to Fimetech team.
5. The goal of analysis
Can we predict the patients readmission over time?
What is the probability that the patient will not need new encounter within
30, 180, 365 days after the first encounter.
Why ?
Healthcare costs: If there will be a program for mandatory health insurance-
The model will allow to predict overall costs for the economy
Fraud detection: Fraud comes hand to hand with insurance, the survival analysis helps to identify
deviant behavior (too many repeating visits for a given disease?)
Deviant behavior on clinic/doctor level Will help to understand the skills gap, ineffective
management and etc, on the level of local clinics,
Modeling insurance premiums: If we understand how much is the cost on marz/disease/ levels, we
can offer tailored premiums
7. Findings
Marzes: Ararat, Armavir, Shirak and Yerevan have significant coefficients and based on the signs of the
coefficients we can conclude that Armavir and Shirak have higher probability of readmission than Aragatsotn
and Ararat and Yerevan have less readmission probability than Aragatsotn.
Gender: the significant negative coefficient of Gender_ID_Male indicates that females are more frequent
visitors to hospital in comparison to males.
Age: people aged from 5-18 have lower readmission probability than children of 0-5 age. And this can be
easily observed from the data presented in the output, particularly the significant negative coefficients are
showing that fact.
8. Eligibility: The only significant coefficient in the scope of eligibility have the young men, which has a
negative sign, meaning that armed forces have more attendance probability in comparison to young men.
Purpose: Hospital visits for control have higher readmission probability than those for disease purpose,
which can be explained by the fact that control purpose supposes regular attendances.
Diagnose: Certain infectious and parasitic diseases have higher readmission probability than some heart
diseases.
Findings
9. 30 days 180 days 365 days
Aragatsotn 0.522 0.106 0.046
Ararat 0.521 0.106 0.045
Armavir 0.504 0.103 0.041
Gegharqunik 0.533 0.103 0.042
Kotayk 0.466 0.095 0.039
Lori 0.554 0.132 0.063
Shirak 0.429 0.068 0.022
Syunik 0.485 0.138 0.069
Tavush 0.558 0.079 0.034
Vayots Dzor 0.635 0.163 0.070
Yerevan 0.486 0.119 0.047
Shirak and Tavush have the lowest
(Proportion of patients who didnt return to
hospital in t days after first visit)
Vayots Dzor, Syunik, and Lori have the highest
Gender
10. 30 days 180 days 365 days
female 0.491 0.102 0.042
male 0.493 0.097 0.038
Gender and the hospital visit intensity are not
related
Gender
11. 30 days 180 days 365 days
0-5 0.469 0.066 0.021
5-10 0.571 0.201 0.086
10-18 0.595 0.272 0.126
18-30 0.562 0.206 0.100
30-60 0.512 0.113 0.048
60+ 0.455 0.056 0.019
0-5 and 60+ are the most risky age groups
(lowest )
10-18 years old patients have the lowest
pro$ $ p $ ≠
.
Age
12. 30days 180days 365days
paid 0.513 0.138 0.032
state ordered 0.492 0.099 0.041
Patients who have state ordered payment type
have lower survival probability meaning their
visits are more frequent.
Payment type
13. 30 days 180 days 365 days
children vulnerable 0.479 0.069 0.021
disabled people 0.470 0.031 0.006
elderly people 0.466 0.041 0.019
pregnancy 0.514 0.105 0.045
social package
beneficiary
0.593 0.341 0.249
The most frequent visitors are disabled and
elderly people an contrast with social package
beneficiaries who have the highest survival rate.
Eligibility
14. 30 days 180 days 365 days
disease 0.508 0.109 0.029
control 0.478 0.045 0.012
administrative 0.543 0.183 0.079
preventive 0.499 0.156 0.079
reproductive 0.532 0.126 0.041
other 0.418 0.044 0.009
As visits for administrative purposes are mostly
occasional they have higher survival rate,
whereas the same figure for control purpose
visits is the least frequent one.
Encounter Purpose
15. 30days 180days 365days
A 0.487 0.072 0.022
I 0.503 0.069 0.016
J 0.518 0.101 0.026
K 0.514 0.098 0.025
Certain infectious (A) and heart diseases (I) have
lower survival probability in comparison to upper
respiratory infections (J), sclerosis and other mental
disorders(K).
Diagnosis
16. 30days 180days 365days
chronic condition 0.497 0.051 0.017
improvement 0.525 0.102 0.024
recovery 0.525 0.109 0.030
stabilisation 0.475 0.055 0.016
treatment stop 0.458 0.122 0.087
unchanged 0.501 0.088 0.030
worsening 0.407 0.054 0.012
Worsening, chronic condition and stabilization have the
lowest survival rates since they require more
attendances, whereas improvement, recovery and
treatment stop have the highest survival rates since they
require less visits.
Encounter Outcome Treatment
17. Predicting single readmission (machine
learning case)
The data is transformed so each row is a person
There is an indicator variable, showing if the patient was readmitted, thus have
more than 2 records in the database
The goal of the modeling is to predict no-readmission rate based on 7 variables
(age, gender, payment, treatment, etc.)
The Business goal
Low readmission is a sign of good patient care
Low readmission means low insurance and healthcare costs
20. What next ?
What other data can we obtain?
From Ministry of health?
From Clinics?
Insurance companies?
How can we make data cleaner and more reliable?
What is the real need of stakeholders?