This document discusses machine learning engineering and the importance of addressing technical debt. It notes that while developing and deploying ML systems is fast, maintaining them over time can be difficult and expensive due to various sources of technical debt, such as complex models, expensive data dependencies, feedback loops, and changes in the external world. It provides examples and recommendations from papers on how to monitor systems, test features and data, and measure technical debt to help reduce maintenance costs over the long run.
10. ML engineering reading list
[Zinkevich]
Reliable Machine Learning in the Wild - NIPS 2016 Workshop
11. ML engineering reading list
[Breck]
There is also a presentation on this topic
https://sites.google.com/site/wildml2016nips/Sculley際際滷s1.pdf
Reliable Machine Learning in the Wild - NIPS 2016 Workshop
12. One more cool thing about the above papers
ML NOW
DISCUSSED
PAPERS
THE HYPE CURVE
VISIBILITY
TIME
14. Wisdom learnt the hard way [Sculley]
As the machine learning (ML) community continues
to accumulate years of experience with live systems,
a wide-spread and uncomfortable trend has emerged:
developing and deploying ML systems is relatively
fast and cheap, but maintaining them over time is
difficult and expensive.
This dichotomy can be understood through the lens of
technical debt (...)
17. Sources of technical debt in ML [Sculley]
Complex models and boundaries erosion
Expensive data dependencies
Feedback loops
Common anti-patterns
Configuration management deficiencies
Changes in the external world
18. Complex models, boundaries erosion [Sculley]
In programming we strive for separation of concerns, isolation,
encapsulation. More often than not, ML makes that difficult
Entanglement
CACE principle = changing anything changes everything
Correction cascades
Undeclared customers
Undeclared consumers are expensive at best and dangerous at worst
19. Expensive data dependencies [Sculley]
Data dependencies cost more than code dependencies.
Unstable data dependencies
Underutilized data dependencies
Legacy features
Bundled features
Epsilon features
Correlated features, esp. with one root-cause feature
Static analysis of data dependencies is extremely helpful
Think workflow tools and provenance tracking!
20. Feedback loops [Sculley]
Direct feedback loops
Hidden feedback loops
Especially, indirect feedback loops are difficult to track!
21. Common anti-patterns [Sculley]
Glue code
Real systems = 5% ML code + 95% glue code
Rewrite general purpose packages or wrap in a common API
Pipeline jungles
Especially, indirect feedback loops are difficult to track!
Dead experimental code paths
Knight Capital case, 465M$ lost in 45 min. from obsolete exp. code
Abstraction debt
ML abstractions much less developed than, e.g., in relational databases
Bad code smells (less severe anti-patterns)
Plain old data smell
Multi-language smell
Prototype smell
22. Configuration debt [Sculley]
Another potentially surprising area where debt can accumulate is
in the configuration of ML systems. (...) In a mature system which is
being actively developed, the number of lines of configuration can far
exceed the number of lines of the traditional code. Each configuration
line has a potential for mistakes.
It should be easy to specify a configuration as a small change from a
previous configuration
Configurations should undergo a full code review and be checked into a
repository
It should be hard to make manual errors, omissions, or oversights
It should be easy to see, visually, the difference in configuration between
two models
It should be easy to automatically assert and verify basic facts about the
configuration: features used, transitive closure of data dependencies, etc.
It should be possible to detect unused or redundant settings
23. Changes in the external world [Sculley]
External world not stable and beyond control of ML system
maintainers
Comprehensive live monitoring of the system is crucial for
maintenance
Prediction bias
Action limits
Up-stream producers
What to monitor?
Sample sources of problems
Fixed or manually updated thresholds in configuration
Spurious/vanishing correlations
24. Monitoring [Zinkevich]
Rule #8: Know the freshness requirements of your system
Rule #9: Detect problems before exporting models
Rule #10: Watch for silent failures
Rule #11: Give feature sets owners and documentation
25. What should be tested/monitored in ML sys. [Breck]
Testing features and data
Test distribution, correlation, other statistical properties, cost of each feature ...
Testing model development
Test off-line scores vs. on-line performance (e.g., via A/B test), impact of
hyperparameters, impact of model freshness, quality on data slices,
comparison with simple baseline, ...
Testing ML infrastructure
Reproducibility of training, model quality before serving, fast roll-backs to
previous versions, ...
Monitoring ML in production
Nans or infinities in the output, computational performance problems or RAM
usage, decrease in quality of results, ...
26. Other areas of ML-related debt [Sculley]
Culture
Deletion of features, reduction of complexity, improvements in
reproducibility, stability, and monitoring are valued the same (or more!) as
improvements in accuracy
(...) This is most likely to occur within heterogeneous teams with strengths in
both ML research and engineering
Reproducibility debt
ML-system behaviour is difficult to reproduce exactly, because of randomized
algorithms, non-determinism inherent in parallel processing, reliance on initial
conditions, interactions with the external world, ...
Data testing debt
ML converts data into code. For that code to be correct, data need to be
correct. But how do you test data?
Process management debt
How is deployment, maintenance, configuration, recovery of the infrastructure
handled? Bad smell a lot of manual work
27. Measuring technical debt [Sculley]
Does improving one model or signal degrade others?
What is the transitive closure of all data dependencies?
How easily can an entirely new algorithmic approach be tested at
full scale?
How precisely can the impact of a new change to the system
be measured?
How quickly can new members of the team be brought up to speed?