狠狠撸

狠狠撸Share a Scribd company logo
Scaling XGBoost to Large-Scale
Clusters with Fault Tolerance and
Recovery
Aug 20, 2019
Chen Qin Big Data - ML @ Uber
1
01 Production ML with Spark / XGBoost at Uber
02 Fault Recovery AllReduce on MapReduce
03 Results
Agenda
2
Typical XGB Pipeline at Uber
USE CASES:
● Uber Map
● Uber Eats
● Uber Freight
● Uber Driver Safety
3
Distributed XGBoost - AllReduce
AllReduce
● Widely used in iterative distributed ML
● All-OR-Nothing Fail-All in ring/tree graph
● Checkpoint-restart
4
Distributed XGBoost - MapReduce
MapReduce
● De facto standard in data processing
● Retryable parallel tasks in bipartite graph
5
AllReduce on MapReduce
“I will suggest to somehow build some
fault tolerance in the job/framework as you
can not guarantee all the machines in the
cluster are healthy … broadly we need to
have fault tolerance built into the job.”
6
Goals
● Improve AllReduce fault tolerance running on top of MapReduce frameworks
● Improve stability of long running XGB jobs on preemptive cluster
● Minimize failure caused additional data reshuffle on large dataset
7
Rabit - Retryable XGB Synchronization
Rabit was initially designed and implemented by Tianqi Chen, Ignacio Cano,
and Tianyi Zhou to solve iterative ML synchronization problem.
Rabit consists two major parts to enable retryable XGB trainer
● AllReduce Consensus Protocol
○ Reach agreement on next transaction
● Peer-to-Peer Recovery
○ Backfill failed trainer catch up
We recently contributed bootstrap cache feature, debugging tools, fixes, etc
8
AllReduce Consensus Protocol
Every node has to be alive and
connected, it sends proposal when it calls
the Rabit API
Allreduce were executed to run reduce
function on those proposals, share same ack
to every node with possible different proposals
Certain proposal(s) were accepted, routing and
message passing were executed
Eventually, every node runs on same phase
where all proposals were accepted at the
same time and executed collectively
9
10
Failure Recovery - P2P
Allreduce/Broadcasts results were cached up to
one iteration
Backfill latest checkpoint / results from nearest
neighbour
Failed trainer retry (not entire job)
11
Better AllReduce on MapReduce
Recall our goals
Improve AllReduce fault tolerance running on top
of MapReduce frameworks
Improve stability of long running XGB jobs on
preemptive cluster
Minimize failure caused additional data
reshuffle on large dataset
12
Results
● Support very large dataset
● Resilient towards series of failures/preemptions
● Also runs much faster 1? - 4x
● Production roll out: alpha customer tests run
13
Proprietary and confidential ? 2019 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any
form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains
information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified
that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate,
or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent
necessary for consultations with authorized personnel of Uber.
14

More Related Content

Recently uploaded (17)

Unlocking the Power of SIM Card IoT Connectivity.pdf
Unlocking the Power of SIM Card IoT Connectivity.pdfUnlocking the Power of SIM Card IoT Connectivity.pdf
Unlocking the Power of SIM Card IoT Connectivity.pdf
elite virtual staffing solutions
?
原版西班牙马拉加大学毕业证(鲍惭础毕业证书)如何办理
原版西班牙马拉加大学毕业证(鲍惭础毕业证书)如何办理原版西班牙马拉加大学毕业证(鲍惭础毕业证书)如何办理
原版西班牙马拉加大学毕业证(鲍惭础毕业证书)如何办理
Taqyea
?
Presentation About The Buttons |?Selma SALTIK
Presentation About The Buttons |?Selma SALTIKPresentation About The Buttons |?Selma SALTIK
Presentation About The Buttons |?Selma SALTIK
SELMA SALTIK
?
All-4 Chapters-Emerging-technology-ppt.pptx
All-4 Chapters-Emerging-technology-ppt.pptxAll-4 Chapters-Emerging-technology-ppt.pptx
All-4 Chapters-Emerging-technology-ppt.pptx
beletetesfaw1
?
IPv6 Mostly Experience at APRICOT by Yoshinobu Matsuzaki (IIJ)
IPv6 Mostly Experience at APRICOT by Yoshinobu Matsuzaki (IIJ)IPv6 Mostly Experience at APRICOT by Yoshinobu Matsuzaki (IIJ)
IPv6 Mostly Experience at APRICOT by Yoshinobu Matsuzaki (IIJ)
Bangladesh Network Operators Group
?
DATA COMMUNICATION components, modes of transmission & communication devices ...
DATA COMMUNICATION components, modes of transmission & communication devices ...DATA COMMUNICATION components, modes of transmission & communication devices ...
DATA COMMUNICATION components, modes of transmission & communication devices ...
samina khan
?
Networking concepts from zero to hero that covers the security aspects
Networking concepts from zero to hero that covers the security aspectsNetworking concepts from zero to hero that covers the security aspects
Networking concepts from zero to hero that covers the security aspects
amansinght675
?
Cloud VPS Provider in India: The Best Hosting Solution for Your Business
Cloud VPS Provider in India: The Best Hosting Solution for Your BusinessCloud VPS Provider in India: The Best Hosting Solution for Your Business
Cloud VPS Provider in India: The Best Hosting Solution for Your Business
DanaJohnson510230
?
Bsjsudhjsidudjdudjdudidjjdjdkdel-se-br.ppt
Bsjsudhjsidudjdudjdudidjjdjdkdel-se-br.pptBsjsudhjsidudjdudjdudidjjdjdkdel-se-br.ppt
Bsjsudhjsidudjdudjdudidjjdjdkdel-se-br.ppt
ssuserb171f7
?
Fast Reroute in SR-MPLS by Md Abdullah Al Naser
Fast Reroute in SR-MPLS by Md Abdullah Al NaserFast Reroute in SR-MPLS by Md Abdullah Al Naser
Fast Reroute in SR-MPLS by Md Abdullah Al Naser
Bangladesh Network Operators Group
?
DNS & DNSSEC operational best practices - Sleep better at night with KINDNS i...
DNS & DNSSEC operational best practices - Sleep better at night with KINDNS i...DNS & DNSSEC operational best practices - Sleep better at night with KINDNS i...
DNS & DNSSEC operational best practices - Sleep better at night with KINDNS i...
Bangladesh Network Operators Group
?
HPC_Course_Presentation_No_Images included.pptx
HPC_Course_Presentation_No_Images included.pptxHPC_Course_Presentation_No_Images included.pptx
HPC_Course_Presentation_No_Images included.pptx
naziaahmadnm
?
Concept and purpose of community diagnosis
Concept and purpose of community diagnosisConcept and purpose of community diagnosis
Concept and purpose of community diagnosis
felixsakwa55
?
Transport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by 狠狠撸sgo.pptx
Transport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by 狠狠撸sgo.pptxTransport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by 狠狠撸sgo.pptx
Transport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by 狠狠撸sgo.pptx
ssuser80a7e81
?
Essential Tech Stack for Effective Shopify Dropshipping Integration.pdf
Essential Tech Stack for Effective Shopify Dropshipping Integration.pdfEssential Tech Stack for Effective Shopify Dropshipping Integration.pdf
Essential Tech Stack for Effective Shopify Dropshipping Integration.pdf
CartCoders
?
5 Reasons cheap WordPress hosting is costing you more | Reversed Out
5 Reasons cheap WordPress hosting is costing you more | Reversed Out5 Reasons cheap WordPress hosting is costing you more | Reversed Out
5 Reasons cheap WordPress hosting is costing you more | Reversed Out
Reversed Out Creative
?
all Practical Project LAST summary note.docx
all Practical Project LAST summary note.docxall Practical Project LAST summary note.docx
all Practical Project LAST summary note.docx
seidjemal94
?
原版西班牙马拉加大学毕业证(鲍惭础毕业证书)如何办理
原版西班牙马拉加大学毕业证(鲍惭础毕业证书)如何办理原版西班牙马拉加大学毕业证(鲍惭础毕业证书)如何办理
原版西班牙马拉加大学毕业证(鲍惭础毕业证书)如何办理
Taqyea
?
Presentation About The Buttons |?Selma SALTIK
Presentation About The Buttons |?Selma SALTIKPresentation About The Buttons |?Selma SALTIK
Presentation About The Buttons |?Selma SALTIK
SELMA SALTIK
?
All-4 Chapters-Emerging-technology-ppt.pptx
All-4 Chapters-Emerging-technology-ppt.pptxAll-4 Chapters-Emerging-technology-ppt.pptx
All-4 Chapters-Emerging-technology-ppt.pptx
beletetesfaw1
?
DATA COMMUNICATION components, modes of transmission & communication devices ...
DATA COMMUNICATION components, modes of transmission & communication devices ...DATA COMMUNICATION components, modes of transmission & communication devices ...
DATA COMMUNICATION components, modes of transmission & communication devices ...
samina khan
?
Networking concepts from zero to hero that covers the security aspects
Networking concepts from zero to hero that covers the security aspectsNetworking concepts from zero to hero that covers the security aspects
Networking concepts from zero to hero that covers the security aspects
amansinght675
?
Cloud VPS Provider in India: The Best Hosting Solution for Your Business
Cloud VPS Provider in India: The Best Hosting Solution for Your BusinessCloud VPS Provider in India: The Best Hosting Solution for Your Business
Cloud VPS Provider in India: The Best Hosting Solution for Your Business
DanaJohnson510230
?
Bsjsudhjsidudjdudjdudidjjdjdkdel-se-br.ppt
Bsjsudhjsidudjdudjdudidjjdjdkdel-se-br.pptBsjsudhjsidudjdudjdudidjjdjdkdel-se-br.ppt
Bsjsudhjsidudjdudjdudidjjdjdkdel-se-br.ppt
ssuserb171f7
?
DNS & DNSSEC operational best practices - Sleep better at night with KINDNS i...
DNS & DNSSEC operational best practices - Sleep better at night with KINDNS i...DNS & DNSSEC operational best practices - Sleep better at night with KINDNS i...
DNS & DNSSEC operational best practices - Sleep better at night with KINDNS i...
Bangladesh Network Operators Group
?
HPC_Course_Presentation_No_Images included.pptx
HPC_Course_Presentation_No_Images included.pptxHPC_Course_Presentation_No_Images included.pptx
HPC_Course_Presentation_No_Images included.pptx
naziaahmadnm
?
Concept and purpose of community diagnosis
Concept and purpose of community diagnosisConcept and purpose of community diagnosis
Concept and purpose of community diagnosis
felixsakwa55
?
Transport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by 狠狠撸sgo.pptx
Transport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by 狠狠撸sgo.pptxTransport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by 狠狠撸sgo.pptx
Transport Conjjjjjjjjjjjjjjjjjjjjjjjsulting by 狠狠撸sgo.pptx
ssuser80a7e81
?
Essential Tech Stack for Effective Shopify Dropshipping Integration.pdf
Essential Tech Stack for Effective Shopify Dropshipping Integration.pdfEssential Tech Stack for Effective Shopify Dropshipping Integration.pdf
Essential Tech Stack for Effective Shopify Dropshipping Integration.pdf
CartCoders
?
5 Reasons cheap WordPress hosting is costing you more | Reversed Out
5 Reasons cheap WordPress hosting is costing you more | Reversed Out5 Reasons cheap WordPress hosting is costing you more | Reversed Out
5 Reasons cheap WordPress hosting is costing you more | Reversed Out
Reversed Out Creative
?
all Practical Project LAST summary note.docx
all Practical Project LAST summary note.docxall Practical Project LAST summary note.docx
all Practical Project LAST summary note.docx
seidjemal94
?

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
?
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
?
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
?
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
?
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
?
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
?
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
?
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
?
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
?
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
?
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
?
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
?
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
?
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
?
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
?
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
?
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
?
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
?
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
?
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter
?
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
?
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
?
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
?
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
?
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
?
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
?
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
?
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
?
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
?
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
?
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
?
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
?
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
?
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
?
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
?
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter
?

Scaling XGBoost to large scale clusters with fault tolerance and recovery

  • 1. Scaling XGBoost to Large-Scale Clusters with Fault Tolerance and Recovery Aug 20, 2019 Chen Qin Big Data - ML @ Uber 1
  • 2. 01 Production ML with Spark / XGBoost at Uber 02 Fault Recovery AllReduce on MapReduce 03 Results Agenda 2
  • 3. Typical XGB Pipeline at Uber USE CASES: ● Uber Map ● Uber Eats ● Uber Freight ● Uber Driver Safety 3
  • 4. Distributed XGBoost - AllReduce AllReduce ● Widely used in iterative distributed ML ● All-OR-Nothing Fail-All in ring/tree graph ● Checkpoint-restart 4
  • 5. Distributed XGBoost - MapReduce MapReduce ● De facto standard in data processing ● Retryable parallel tasks in bipartite graph 5
  • 6. AllReduce on MapReduce “I will suggest to somehow build some fault tolerance in the job/framework as you can not guarantee all the machines in the cluster are healthy … broadly we need to have fault tolerance built into the job.” 6
  • 7. Goals ● Improve AllReduce fault tolerance running on top of MapReduce frameworks ● Improve stability of long running XGB jobs on preemptive cluster ● Minimize failure caused additional data reshuffle on large dataset 7
  • 8. Rabit - Retryable XGB Synchronization Rabit was initially designed and implemented by Tianqi Chen, Ignacio Cano, and Tianyi Zhou to solve iterative ML synchronization problem. Rabit consists two major parts to enable retryable XGB trainer ● AllReduce Consensus Protocol ○ Reach agreement on next transaction ● Peer-to-Peer Recovery ○ Backfill failed trainer catch up We recently contributed bootstrap cache feature, debugging tools, fixes, etc 8
  • 9. AllReduce Consensus Protocol Every node has to be alive and connected, it sends proposal when it calls the Rabit API Allreduce were executed to run reduce function on those proposals, share same ack to every node with possible different proposals Certain proposal(s) were accepted, routing and message passing were executed Eventually, every node runs on same phase where all proposals were accepted at the same time and executed collectively 9
  • 10. 10
  • 11. Failure Recovery - P2P Allreduce/Broadcasts results were cached up to one iteration Backfill latest checkpoint / results from nearest neighbour Failed trainer retry (not entire job) 11
  • 12. Better AllReduce on MapReduce Recall our goals Improve AllReduce fault tolerance running on top of MapReduce frameworks Improve stability of long running XGB jobs on preemptive cluster Minimize failure caused additional data reshuffle on large dataset 12
  • 13. Results ● Support very large dataset ● Resilient towards series of failures/preemptions ● Also runs much faster 1? - 4x ● Production roll out: alpha customer tests run 13
  • 14. Proprietary and confidential ? 2019 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. 14