ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
• The difference between POC Pilot and Production
• What makes healthcare POC special ?
• Things to watch-out for
• The most important steps
• The simplified data science process
• Conceptual architecture - How to deploy & comsume a model
AGENDA
POC
Stakeholder
Output
• Model +
prediction
• dashboard
Pilot
End user
Output
• User oriented
MVP
• Match azure
components vs
user flow
Production
Integration
Output
• Gains : value-
adds when
optimize
• Explore vs
exploit
WHAT MAKES
HEALTHCARE POC
SPECIAL ?
Patient data
anonymization
Hard to
control data
quality
Hard to add IOT data sources
( can’t join with individual
patient data)
Hard to change existing way-of-
working
Hard to find a good level of data
science application which can be
directly utilized to show impact
THINGS TO
WATCHOUT FOR
Generic key elements in data science which you want to get it right from the start !
Start
Define clearly –
The Objective  goals
End
Map carefully –
Goals  deliverables
Engage frequently –
The data science process & sub-activities
The importance of domain
expertise
If a patient has blood pressure measurements from every
hour, then how should the model use those
measurements? (Take the average? Take the daily
average? Use some weight function?)
The prediction has most value if it’s done in the beginning
of the hospital visit (e.g. when moving the patient from
the Emergency Clinic to the Surgery Ward, but obviously
more data is available later in the stay (close to the
discharge time), so where do we draw the line?
THE MOST
IMPORTANT STEP
IT’S ALL ABOUT THE OBJECTIVE
(1)
Define the
objective Break down
objectives into
goals
goal1
goal2
goal3
goal4
...
goal N
(2)
Data Science activities
(4) (3)
Deliverables
deliverable1
deliverable2
deliverable3
deliverable14
...
deliverable N
EXAMPLE OF
MAPPIGN (1)+(2) 
(3)
Does your model
make $ for the
company ? Innovation
Officer
Let’s run a
POC to
understand
how data
science
work
Goal1 : set recall as
the main
measurement of
model performance
Goal2 : Document
all activites , make
sure to incl ...
a. why the model
works
b. what does it look
like in production
c. how to scale &
integrate with IT
Goal3 :Make sure the
deliverables incl. a
killer-looking
dashboard/app so i
can easily show/tell
others
(1)+(2) (3)
Model Performance
Mapping Goal 1 to deliverable 1
Mapping Goal 2 to deliverable 2
Mapping Goal 3 to deliverable 3
(4)
Project leader
advocate
NOW THE DATA
SCIENC STUFF (4)
The Microsoft Data Science Process
SIMPLIFICATION OF DATA SCIENCE PROCESS
Group 1 - Business
Understanding
+
Data understanding
Group 2 -Data explore
+
Feature Engineering
Group 3 - Model
selection
+
Performance Evaluation
Group 4 – Model
Deployment
+
Application
development
GROUP 1 : INVESTIGATE WHETHER
THE DATA SUPPORTS THE OBJECTIVE
Group 1 - Business
Understanding
+
Data understanding
Group 2 -Data explore
+
Feature Engineering
Group 3 - Model selection
+
Performance Evaluation
Group 4 – Model
Deployment
+
Application development
(1) Break down objective to goals and iterate through to form data scope and
deliverables
(2) Evaluate whether the goals can be achieved via available data
(3) Establish the data pipeline (on-prem + cloud )
On prem data sources merging
anonymization
Model
training/selectio
n/deployment
On prem
Trained model
Wrap into
Applications
Upload to the cloud
push to
download
Applications
Wrap into
CONSIDER DATA PIPELINE : ON-PREM & IN THE CLOUD
datastor
e
explore and feature selection
Trained model
Save to
Patient demographic
info table
Hospital visits table
Diagonosis code
table
Lab result table
Hospital departmens
info tableNurse schedule table
join
join
join
join
join
join
join
Flat table
EXAMPLE OF ON-PREM DATA MERGING  FLAT
TABLE
THIS IS THE FLAT TABLE FOR FURTHER USE
Pseudo ID features
Labe
l
CONSIDER TIME . . .
Time as part
of the keys features
Labe
l
features
Labe
l
Time as part of the
feature
WHY DO IT IN
THE CLOUD
AT ALL?
GROUP 2 : EXPLORE DATA AND FEATURE
ENGINEERING
Group 1 - Business
Understanding
+
Data understanding
Group 2 -Data explore
+
Feature Engineering
Group 3 - Model selection
+
Performance Evaluation
Group 4 – Model
Deployment
+
Application development
(1) archive all the statistical plots + scripts used + intermediate output data for
reproducibility and documentation
(2) so what about time aspect ?
(3) include domain expertise in all activities ( for transparency)
COMMON WAYS TO FIND OUTLIERS
Find outliers by plots Find outliers by statistic
methods
COMMON WAYS TO DEAL WITH MISSING DATA
Impute by mean /median / constant
Impute by K’s Nearest Neighbors
Library(Hmisc)
impute(dataset$column, mean) # replace with mean
impute(dataset$column, median) # replace with median
Library(DMwR)
knnOutput <- knnImputation(dataset[,!names(dataset) %in%
"column"])
# replaced by K’s nearest neighbors
DATA EXPLORATION  FEATURE SELECTION
Statistical plots Find features that might have
prediction power
rcount(Recency count )
Lengthofstay(days
)
DIMENSION
REDUCTION
USE WITH CARE , ESPECIALLY
IN HEALTH CARE DOMAIN
 RISK OF LOSING
INTERPRETABILITY !
Cluster features by correlation matrix
FEATURE SELECTION MANUALLY
FEATURE SELECTION – MANUALLY SELECT
use correlation
Forward/backward selection
FEATURE SELECTION AUTOMATICALLY
FEATURE SELECTION – AUTOMATIC
( EXAMPLE : GENETIC ALGORITHM )
cross over
mutation
FEATURE SELECTION – AUTOMATIC
( EXAMPLE : GENETIC ALGORITHM CONTINUED... )
fitness
select
crossovermutation
Generation is
generated each
time with this cycle
of select, crossover
and mutation
run the genetic algorithm
What features were
selected by the genetic
algorithm
the result
GROUP 3 : MODEL DEVELOPMENT &
PERFORMANCE EVALUATION
Group 1 - Business
Understanding
+
Data understanding
Group 2 -Data explore
+
Feature Engineering
Group 3 - Model selection
+
Performance Evaluation
Group 4 – Model
Deployment
+
Application development
(1) what types of models are available to use and
considerations
(2) beware of model explainability vs. model performance
(3) can it be scaled up and out ? (consider production)
TYPE OF MODELS
 SELECT THE ONES THAT
FITS YOUR CRITERIA
Source url :
https://www.datasciencecentral.com/profiles/blog/show?id=6448529%3ABlogPost%3A598753&commentId=6448529%3AComment%3A708763&xg_source=activity
Understand what types of models can be used for the specific task
source url : https://towardsdatascience.com/the-mostly-complete-chart-of-neural-networks-explained-
3fb6f2367464
ALSO ---
WHAT’S NORMALLY
USED IN HEALTHCARE
Source in end note
Mainly due to
interpretability
MODEL TRAINING :
LOCAL DEVELOP  REMOTE COMPUTE
When to use what Azure ML services to train & develop the model detail is in
another ppt
Local develop the model and train on remote compute target
Processed dataset
Validation setTraining set Test set
Split into
Train
ML models
Check
Select one winning
model
Models that pass
the testing set
Winning
model
Document
+
dashboard
next step
local develop ( experiment)  remote compute ( for
MODEL
TRAINING
UTILIZING THE CLOUD  FAST
DISTRIBUTED TRAINING
EVALUATE MODELS
Model performance
Costs Gain
Other consideration
EXAMPLE OF MODEL EVALUATION
criteria SGD trees RF GBM weights
Regressor
Performance 86.30% 92.20% 91.40% 96.60% 50%
interpretability 1 0.9 0.8 0.9 20%
Time to compute 1 0.9 0.2 0.2 10%
# of parameters 1 1 0.7 1 10%
Ranking 83% 56% 56% 58% 100%
GROUP 4 : APPLY THE MODEL
Group 1 - Business
Understanding
+
Data understanding
Group 2 -Data explore
+
Feature Engineering
Group 3 - Model selection
+
Performance Evaluation
Group 4 – Model
Deployment
+
Application development
(1) in POC , there will only build a prototype
(2) in Pilot , the entire data pipeline + architecture is important to test out
(3) in production , optimization cost vs performance & monitoring model lifecycle for
retire/retrain model become important
(4) the application development usually hand-over to IT dept
DATA SCIENCE MACHINE LEARNING
MODEL DEPLOYMENT (ONE-OFF)
Usually used in POC to show the potential , one off deployment as a web API
service
how to build a Length of Stay model for a ProofOfConcept project
DATA SCIENCE MACHINE LEARNING
PIPELINE
model train-deploy-manage architecture framework consolidate all in one pipeline
ONE-STOP
PIPELINE
TRAIN+DEPLOY
+MANAGE
ML MODEL
EMBEDDED
DASHBOARD
Thank you

More Related Content

how to build a Length of Stay model for a ProofOfConcept project

  • 1. • The difference between POC Pilot and Production • What makes healthcare POC special ? • Things to watch-out for • The most important steps • The simplified data science process • Conceptual architecture - How to deploy & comsume a model AGENDA
  • 2. POC Stakeholder Output • Model + prediction • dashboard Pilot End user Output • User oriented MVP • Match azure components vs user flow Production Integration Output • Gains : value- adds when optimize • Explore vs exploit
  • 4. Patient data anonymization Hard to control data quality Hard to add IOT data sources ( can’t join with individual patient data) Hard to change existing way-of- working Hard to find a good level of data science application which can be directly utilized to show impact
  • 5. THINGS TO WATCHOUT FOR Generic key elements in data science which you want to get it right from the start !
  • 6. Start Define clearly – The Objective  goals End Map carefully – Goals  deliverables Engage frequently – The data science process & sub-activities
  • 7. The importance of domain expertise If a patient has blood pressure measurements from every hour, then how should the model use those measurements? (Take the average? Take the daily average? Use some weight function?) The prediction has most value if it’s done in the beginning of the hospital visit (e.g. when moving the patient from the Emergency Clinic to the Surgery Ward, but obviously more data is available later in the stay (close to the discharge time), so where do we draw the line?
  • 8. THE MOST IMPORTANT STEP IT’S ALL ABOUT THE OBJECTIVE
  • 9. (1) Define the objective Break down objectives into goals goal1 goal2 goal3 goal4 ... goal N (2) Data Science activities (4) (3) Deliverables deliverable1 deliverable2 deliverable3 deliverable14 ... deliverable N
  • 11. Does your model make $ for the company ? Innovation Officer Let’s run a POC to understand how data science work Goal1 : set recall as the main measurement of model performance Goal2 : Document all activites , make sure to incl ... a. why the model works b. what does it look like in production c. how to scale & integrate with IT Goal3 :Make sure the deliverables incl. a killer-looking dashboard/app so i can easily show/tell others (1)+(2) (3) Model Performance Mapping Goal 1 to deliverable 1 Mapping Goal 2 to deliverable 2 Mapping Goal 3 to deliverable 3 (4) Project leader advocate
  • 12. NOW THE DATA SCIENC STUFF (4)
  • 13. The Microsoft Data Science Process
  • 14. SIMPLIFICATION OF DATA SCIENCE PROCESS Group 1 - Business Understanding + Data understanding Group 2 -Data explore + Feature Engineering Group 3 - Model selection + Performance Evaluation Group 4 – Model Deployment + Application development
  • 15. GROUP 1 : INVESTIGATE WHETHER THE DATA SUPPORTS THE OBJECTIVE Group 1 - Business Understanding + Data understanding Group 2 -Data explore + Feature Engineering Group 3 - Model selection + Performance Evaluation Group 4 – Model Deployment + Application development (1) Break down objective to goals and iterate through to form data scope and deliverables (2) Evaluate whether the goals can be achieved via available data (3) Establish the data pipeline (on-prem + cloud )
  • 16. On prem data sources merging anonymization Model training/selectio n/deployment On prem Trained model Wrap into Applications Upload to the cloud push to download Applications Wrap into CONSIDER DATA PIPELINE : ON-PREM & IN THE CLOUD datastor e explore and feature selection Trained model Save to
  • 17. Patient demographic info table Hospital visits table Diagonosis code table Lab result table Hospital departmens info tableNurse schedule table join join join join join join join Flat table EXAMPLE OF ON-PREM DATA MERGING  FLAT TABLE
  • 18. THIS IS THE FLAT TABLE FOR FURTHER USE Pseudo ID features Labe l
  • 19. CONSIDER TIME . . . Time as part of the keys features Labe l features Labe l Time as part of the feature
  • 20. WHY DO IT IN THE CLOUD AT ALL?
  • 21. GROUP 2 : EXPLORE DATA AND FEATURE ENGINEERING Group 1 - Business Understanding + Data understanding Group 2 -Data explore + Feature Engineering Group 3 - Model selection + Performance Evaluation Group 4 – Model Deployment + Application development (1) archive all the statistical plots + scripts used + intermediate output data for reproducibility and documentation (2) so what about time aspect ? (3) include domain expertise in all activities ( for transparency)
  • 22. COMMON WAYS TO FIND OUTLIERS Find outliers by plots Find outliers by statistic methods
  • 23. COMMON WAYS TO DEAL WITH MISSING DATA Impute by mean /median / constant Impute by K’s Nearest Neighbors Library(Hmisc) impute(dataset$column, mean) # replace with mean impute(dataset$column, median) # replace with median Library(DMwR) knnOutput <- knnImputation(dataset[,!names(dataset) %in% "column"]) # replaced by K’s nearest neighbors
  • 24. DATA EXPLORATION  FEATURE SELECTION Statistical plots Find features that might have prediction power rcount(Recency count ) Lengthofstay(days )
  • 25. DIMENSION REDUCTION USE WITH CARE , ESPECIALLY IN HEALTH CARE DOMAIN  RISK OF LOSING INTERPRETABILITY !
  • 26. Cluster features by correlation matrix
  • 28. FEATURE SELECTION – MANUALLY SELECT use correlation Forward/backward selection
  • 30. FEATURE SELECTION – AUTOMATIC ( EXAMPLE : GENETIC ALGORITHM ) cross over mutation
  • 31. FEATURE SELECTION – AUTOMATIC ( EXAMPLE : GENETIC ALGORITHM CONTINUED... ) fitness select crossovermutation Generation is generated each time with this cycle of select, crossover and mutation
  • 32. run the genetic algorithm What features were selected by the genetic algorithm the result
  • 33. GROUP 3 : MODEL DEVELOPMENT & PERFORMANCE EVALUATION Group 1 - Business Understanding + Data understanding Group 2 -Data explore + Feature Engineering Group 3 - Model selection + Performance Evaluation Group 4 – Model Deployment + Application development (1) what types of models are available to use and considerations (2) beware of model explainability vs. model performance (3) can it be scaled up and out ? (consider production)
  • 34. TYPE OF MODELS  SELECT THE ONES THAT FITS YOUR CRITERIA
  • 36. source url : https://towardsdatascience.com/the-mostly-complete-chart-of-neural-networks-explained- 3fb6f2367464
  • 38. Source in end note Mainly due to interpretability
  • 39. MODEL TRAINING : LOCAL DEVELOP  REMOTE COMPUTE
  • 40. When to use what Azure ML services to train & develop the model detail is in another ppt Local develop the model and train on remote compute target
  • 41. Processed dataset Validation setTraining set Test set Split into Train ML models Check Select one winning model Models that pass the testing set Winning model Document + dashboard next step local develop ( experiment)  remote compute ( for
  • 42. MODEL TRAINING UTILIZING THE CLOUD  FAST DISTRIBUTED TRAINING
  • 45. EXAMPLE OF MODEL EVALUATION criteria SGD trees RF GBM weights Regressor Performance 86.30% 92.20% 91.40% 96.60% 50% interpretability 1 0.9 0.8 0.9 20% Time to compute 1 0.9 0.2 0.2 10% # of parameters 1 1 0.7 1 10% Ranking 83% 56% 56% 58% 100%
  • 46. GROUP 4 : APPLY THE MODEL Group 1 - Business Understanding + Data understanding Group 2 -Data explore + Feature Engineering Group 3 - Model selection + Performance Evaluation Group 4 – Model Deployment + Application development (1) in POC , there will only build a prototype (2) in Pilot , the entire data pipeline + architecture is important to test out (3) in production , optimization cost vs performance & monitoring model lifecycle for retire/retrain model become important (4) the application development usually hand-over to IT dept
  • 47. DATA SCIENCE MACHINE LEARNING MODEL DEPLOYMENT (ONE-OFF) Usually used in POC to show the potential , one off deployment as a web API service
  • 49. DATA SCIENCE MACHINE LEARNING PIPELINE model train-deploy-manage architecture framework consolidate all in one pipeline

Editor's Notes

  • #5: https://www.healthcatalyst.com/success_stories/machine-learning-to-reduce-readmissions-mission-health
  • #14: Microsoft : https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview
  • #23: Source http://r-statistics.co/Missing-Value-Treatment-With-R.html https://blog.revolutionanalytics.com/2018/03/outliers.html
  • #24: Source http://r-statistics.co/Missing-Value-Treatment-With-R.html https://blog.revolutionanalytics.com/2018/03/outliers.html
  • #31: Source : https://www.r-bloggers.com/feature-selection-using-genetic-algorithms-in-r/
  • #32: Source : https://www.r-bloggers.com/feature-selection-using-genetic-algorithms-in-r/
  • #39: Source url : https://www.mckinsey.com/featured-insights/artificial-intelligence/notes-from-the-ai-frontier-applications-and-value-of-deep-learning
  • #42: https://docs.microsoft.com/en-us/azure/machine-learning/service/overview-more-machine-learning
  • #46: Refer to the auto-ml-regression_LOS.html
  • #52: url : https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/enable-data-collection-for-models-in-aks/enable-data-collection-for-models-in-aks.ipynb