walk through end to end and in detail how a machine learning process on Healthcare related model works ( here i picked LengthOfStay probelm) as a touch point to start the discussion, the scope is set to POC
1 of 52
Download to read offline
More Related Content
how to build a Length of Stay model for a ProofOfConcept project
1. • The difference between POC Pilot and Production
• What makes healthcare POC special ?
• Things to watch-out for
• The most important steps
• The simplified data science process
• Conceptual architecture - How to deploy & comsume a model
AGENDA
2. POC
Stakeholder
Output
• Model +
prediction
• dashboard
Pilot
End user
Output
• User oriented
MVP
• Match azure
components vs
user flow
Production
Integration
Output
• Gains : value-
adds when
optimize
• Explore vs
exploit
4. Patient data
anonymization
Hard to
control data
quality
Hard to add IOT data sources
( can’t join with individual
patient data)
Hard to change existing way-of-
working
Hard to find a good level of data
science application which can be
directly utilized to show impact
6. Start
Define clearly –
The Objective ïƒ goals
End
Map carefully –
Goals ïƒ deliverables
Engage frequently –
The data science process & sub-activities
7. The importance of domain
expertise
If a patient has blood pressure measurements from every
hour, then how should the model use those
measurements? (Take the average? Take the daily
average? Use some weight function?)
The prediction has most value if it’s done in the beginning
of the hospital visit (e.g. when moving the patient from
the Emergency Clinic to the Surgery Ward, but obviously
more data is available later in the stay (close to the
discharge time), so where do we draw the line?
11. Does your model
make $ for the
company ? Innovation
Officer
Let’s run a
POC to
understand
how data
science
work
Goal1 : set recall as
the main
measurement of
model performance
Goal2 : Document
all activites , make
sure to incl ...
a. why the model
works
b. what does it look
like in production
c. how to scale &
integrate with IT
Goal3 :Make sure the
deliverables incl. a
killer-looking
dashboard/app so i
can easily show/tell
others
(1)+(2) (3)
Model Performance
Mapping Goal 1 to deliverable 1
Mapping Goal 2 to deliverable 2
Mapping Goal 3 to deliverable 3
(4)
Project leader
advocate
14. SIMPLIFICATION OF DATA SCIENCE PROCESS
Group 1 - Business
Understanding
+
Data understanding
Group 2 -Data explore
+
Feature Engineering
Group 3 - Model
selection
+
Performance Evaluation
Group 4 – Model
Deployment
+
Application
development
15. GROUP 1 : INVESTIGATE WHETHER
THE DATA SUPPORTS THE OBJECTIVE
Group 1 - Business
Understanding
+
Data understanding
Group 2 -Data explore
+
Feature Engineering
Group 3 - Model selection
+
Performance Evaluation
Group 4 – Model
Deployment
+
Application development
(1) Break down objective to goals and iterate through to form data scope and
deliverables
(2) Evaluate whether the goals can be achieved via available data
(3) Establish the data pipeline (on-prem + cloud )
16. On prem data sources merging
anonymization
Model
training/selectio
n/deployment
On prem
Trained model
Wrap into
Applications
Upload to the cloud
push to
download
Applications
Wrap into
CONSIDER DATA PIPELINE : ON-PREM & IN THE CLOUD
datastor
e
explore and feature selection
Trained model
Save to
17. Patient demographic
info table
Hospital visits table
Diagonosis code
table
Lab result table
Hospital departmens
info tableNurse schedule table
join
join
join
join
join
join
join
Flat table
EXAMPLE OF ON-PREM DATA MERGING ïƒ FLAT
TABLE
18. THIS IS THE FLAT TABLE FOR FURTHER USE
Pseudo ID features
Labe
l
19. CONSIDER TIME . . .
Time as part
of the keys features
Labe
l
features
Labe
l
Time as part of the
feature
21. GROUP 2 : EXPLORE DATA AND FEATURE
ENGINEERING
Group 1 - Business
Understanding
+
Data understanding
Group 2 -Data explore
+
Feature Engineering
Group 3 - Model selection
+
Performance Evaluation
Group 4 – Model
Deployment
+
Application development
(1) archive all the statistical plots + scripts used + intermediate output data for
reproducibility and documentation
(2) so what about time aspect ?
(3) include domain expertise in all activities ( for transparency)
22. COMMON WAYS TO FIND OUTLIERS
Find outliers by plots Find outliers by statistic
methods
23. COMMON WAYS TO DEAL WITH MISSING DATA
Impute by mean /median / constant
Impute by K’s Nearest Neighbors
Library(Hmisc)
impute(dataset$column, mean) # replace with mean
impute(dataset$column, median) # replace with median
Library(DMwR)
knnOutput <- knnImputation(dataset[,!names(dataset) %in%
"column"])
# replaced by K’s nearest neighbors
24. DATA EXPLORATION ïƒ FEATURE SELECTION
Statistical plots Find features that might have
prediction power
rcount(Recency count )
Lengthofstay(days
)
31. FEATURE SELECTION – AUTOMATIC
( EXAMPLE : GENETIC ALGORITHM CONTINUED... )
fitness
select
crossovermutation
Generation is
generated each
time with this cycle
of select, crossover
and mutation
32. run the genetic algorithm
What features were
selected by the genetic
algorithm
the result
33. GROUP 3 : MODEL DEVELOPMENT &
PERFORMANCE EVALUATION
Group 1 - Business
Understanding
+
Data understanding
Group 2 -Data explore
+
Feature Engineering
Group 3 - Model selection
+
Performance Evaluation
Group 4 – Model
Deployment
+
Application development
(1) what types of models are available to use and
considerations
(2) beware of model explainability vs. model performance
(3) can it be scaled up and out ? (consider production)
40. When to use what Azure ML services to train & develop the model detail is in
another ppt
Local develop the model and train on remote compute target
41. Processed dataset
Validation setTraining set Test set
Split into
Train
ML models
Check
Select one winning
model
Models that pass
the testing set
Winning
model
Document
+
dashboard
ïƒ next step
local develop ( experiment) ïƒ remote compute ( for
45. EXAMPLE OF MODEL EVALUATION
criteria SGD trees RF GBM weights
Regressor
Performance 86.30% 92.20% 91.40% 96.60% 50%
interpretability 1 0.9 0.8 0.9 20%
Time to compute 1 0.9 0.2 0.2 10%
# of parameters 1 1 0.7 1 10%
Ranking 83% 56% 56% 58% 100%
46. GROUP 4 : APPLY THE MODEL
Group 1 - Business
Understanding
+
Data understanding
Group 2 -Data explore
+
Feature Engineering
Group 3 - Model selection
+
Performance Evaluation
Group 4 – Model
Deployment
+
Application development
(1) in POC , there will only build a prototype
(2) in Pilot , the entire data pipeline + architecture is important to test out
(3) in production , optimization cost vs performance & monitoring model lifecycle for
retire/retrain model become important
(4) the application development usually hand-over to IT dept
47. DATA SCIENCE MACHINE LEARNING
MODEL DEPLOYMENT (ONE-OFF)
Usually used in POC to show the potential , one off deployment as a web API
service
49. DATA SCIENCE MACHINE LEARNING
PIPELINE
model train-deploy-manage architecture framework consolidate all in one pipeline