際際滷

際際滷Share a Scribd company logo
XGBOOST: A SCALABLE
TREE BOOSTING SYSTEM
(T. CHEN, C. GUESTRIN, 2016)
NATALLIE BAIKEVICH
HARDWARE ACCELERATION FOR
DATA PROCESSING SEMINAR
ETH ZRICH
MOTIVATION
 Effective
statistical
models
 Scalable system
 Successful
real-world
applications
XGBoost
eXtreme
Gradient
Boosting
BIAS-VARIANCE TRADEOFF
Random Forest
Variance 
Boosting
Bias 
Voting
+ +
A BIT OF HISTORY
AdaBoost, 1996
Random Forests, 1999
Gradient Boosting Machine, 2001
AdaBoost, 1996
Random Forests, 1999
Gradient Boosting Machine, 2001
Various improvements in tree
boosting
XGBoost package
A BIT OF HISTORY
AdaBoost, 1996
Random Forests, 1999
Gradient Boosting Machine, 2001
Various improvements in tree
boosting
XGBoost package
1st Kaggle success: Higgs Boson
Challenge
17/29 winning solutions in 2015
A BIT OF HISTORY
WHY DOES XGBOOST WIN "EVERY" MACHINE
LEARNING COMPETITION?
- (MASTER THESIS, D. NIELSEN, 2016)
Source: https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions
TREE ENSEMBLE
REGULARIZED LEARNING
OBJECTIVE
L = l( yi, yi )
i
奪 + W( fk )
k
奪
W( f ) =gT +
1
2
l w
2
Source: http://xgboost.readthedocs.io/en/latest/model.html
yi = fk (xi )
k=1
K
奪
loss
regularization
# of leaves
SCORE CALCULATION
1st order gradient 2nd order gradient
Statistics for each leaf
Score
ALGORITHM FEATURES
 Regularized objective
 Shrinkage and column subsampling
 Split finding: exact & approximate,
global & local
 Weighted quantile sketch
 Sparsity-awareness
SYSTEM DESIGN:
BLOCK STRUCTURE
O(Kd x 0
logn) O(Kd x 0
+ x 0
logB)
Blocks can be
 Distributed across machines
 Stored on disk in out-of-core setting
Sorted structure > linear scan
# trees
Max depth
# non-missing entries
SYSTEM DESIGN:
CACHE-AWARE ACCESS
Improved split finding
 Allocate internal buffer
 Prefetch gradient statistics
Non-continuous memory access
Datasets:
Larger vs Smaller
SYSTEM DESIGN:
BLOCK STRUCTURE
Compression by
columns (CSC):
Decompression
vs
Disk Reading
Block sharding:
Use multiple disks
Too large blocks, cache misses
Too small, inefficient
parallelization
Prefetch
in independent thread
EVALUATION
AWS c3.8xlarge machine:
32 virtual cores, 2x320GB SSD,
60 GB RAM
32 m3.2xlarge machines, each:
8 virtual cores, 2x80GB SSD,
30GB RAM
DATASETS
Dataset n m Task
Allstate 10M 4227 Insurance claim classification
Higgs Boson 10M 28 Event classification
Yahoo LTRC 473K 700 Learning to rank
Criteo 1.7B 67 Click through rate prediction
WHATS NEXT?
Model Extensions
DART (+ Dropouts)
LinXGBoost
Parallel Processing
GPU
FPGA
Tuning
Hyperparameter
optimization
More Applications
XGBoost
Scalability
Weighted quantiles
Sparsity-awareness
Cache-awarereness
Data compression

More Related Content

XGBoost (System Overview)

  • 1. XGBOOST: A SCALABLE TREE BOOSTING SYSTEM (T. CHEN, C. GUESTRIN, 2016) NATALLIE BAIKEVICH HARDWARE ACCELERATION FOR DATA PROCESSING SEMINAR ETH ZRICH
  • 2. MOTIVATION Effective statistical models Scalable system Successful real-world applications XGBoost eXtreme Gradient Boosting
  • 4. A BIT OF HISTORY AdaBoost, 1996 Random Forests, 1999 Gradient Boosting Machine, 2001
  • 5. AdaBoost, 1996 Random Forests, 1999 Gradient Boosting Machine, 2001 Various improvements in tree boosting XGBoost package A BIT OF HISTORY
  • 6. AdaBoost, 1996 Random Forests, 1999 Gradient Boosting Machine, 2001 Various improvements in tree boosting XGBoost package 1st Kaggle success: Higgs Boson Challenge 17/29 winning solutions in 2015 A BIT OF HISTORY
  • 7. WHY DOES XGBOOST WIN "EVERY" MACHINE LEARNING COMPETITION? - (MASTER THESIS, D. NIELSEN, 2016) Source: https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions
  • 9. REGULARIZED LEARNING OBJECTIVE L = l( yi, yi ) i 奪 + W( fk ) k 奪 W( f ) =gT + 1 2 l w 2 Source: http://xgboost.readthedocs.io/en/latest/model.html yi = fk (xi ) k=1 K 奪 loss regularization # of leaves
  • 10. SCORE CALCULATION 1st order gradient 2nd order gradient Statistics for each leaf Score
  • 11. ALGORITHM FEATURES Regularized objective Shrinkage and column subsampling Split finding: exact & approximate, global & local Weighted quantile sketch Sparsity-awareness
  • 12. SYSTEM DESIGN: BLOCK STRUCTURE O(Kd x 0 logn) O(Kd x 0 + x 0 logB) Blocks can be Distributed across machines Stored on disk in out-of-core setting Sorted structure > linear scan # trees Max depth # non-missing entries
  • 13. SYSTEM DESIGN: CACHE-AWARE ACCESS Improved split finding Allocate internal buffer Prefetch gradient statistics Non-continuous memory access Datasets: Larger vs Smaller
  • 14. SYSTEM DESIGN: BLOCK STRUCTURE Compression by columns (CSC): Decompression vs Disk Reading Block sharding: Use multiple disks Too large blocks, cache misses Too small, inefficient parallelization Prefetch in independent thread
  • 15. EVALUATION AWS c3.8xlarge machine: 32 virtual cores, 2x320GB SSD, 60 GB RAM 32 m3.2xlarge machines, each: 8 virtual cores, 2x80GB SSD, 30GB RAM
  • 16. DATASETS Dataset n m Task Allstate 10M 4227 Insurance claim classification Higgs Boson 10M 28 Event classification Yahoo LTRC 473K 700 Learning to rank Criteo 1.7B 67 Click through rate prediction
  • 17. WHATS NEXT? Model Extensions DART (+ Dropouts) LinXGBoost Parallel Processing GPU FPGA Tuning Hyperparameter optimization More Applications XGBoost Scalability Weighted quantiles Sparsity-awareness Cache-awarereness Data compression