際際滷

際際滷Share a Scribd company logo
Kubernetes Modifications for
GPUs
Sanjeev Mehrotra
Kubernetes resource scheduling
Terminology:
- Allocatable  what is
available at node
- Used  what is already
being used from node
(called
RequestedResource)
- Requests what is
requested by
container(s) for the pod
Scheduler  Keeps
track of Used
Worker 1
Worker 2
Worker N
Pod (Contianer) Spec
- Container Requests
Kubelets send
Allocatable
resources for nodes
Scheduling
Request
Resources
 All resources (allocatable, used, and requests) are represented as a
ResourceList which is simply a list of key-value pairs, e.g.
memory : 64GiB
cpu : 8
Simple scheduling
1. Find worker nodes that can fit a pod spec
 plugin/pkg/scheduler/algorithm/predicates
2. Prioritize list of nodes
 plugin/pkg/scheduler/algorithm/priorities
3. Try to schedule pod on node  node may have additional admission
policy so pod may fail
4. If fails, try next node on list
Find nodes that fit
 For simple scheduling, node will NOT fit if
Allocatable < Request + Used
 Example
if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU {
predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceCPU,
podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU))
}
if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory {
predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceMemory,
podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory))
}
if allocatable.NvidiaGPU < podRequest.NvidiaGPU+nodeInfo.RequestedResource().NvidiaGPU {
predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceNvidiaGPU,
podRequest.NvidiaGPU, nodeInfo.RequestedResource().NvidiaGPU, allocatable.NvidiaGPU))
}
Why do we need modifications?
 Only allows for constraints like following in pod spec
Need 4 GPUs
 Does NOT allow for constraints like following in pod spec
Need 4 GPUs with minimum memory 12GiB OR
Need 2 GPUs with minimum memory 4GiB and 2 GPUs with 12GiB
Need 2 GPUs interconnected via NVLink (peer-to-peer for high speed inter-
GPU communication)
Solution 1
 Label nodes and use node selector
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
 However, not optimal in cases with heterogeneous configurations
 For example, one machine may have GPUs of several types, some with large
amounts of memory and some with small
 If label used, then dont know which GPUs will get assigned. Thus only
minimally performant GPU can be used to label node
 Also even in homogenous configurations, kubelet running on worker
nodes needs to keep track of bookkeeping and which GPUs are in use
Solution 2  Group Scheduler
 Define richer syntax on ResourceLists to allow for such constraints to
be scheduled
 Example:
 Instead of:
NvidiaGPU: 2
 Use something like  now memory for each GPU is clearly specified
Gpu/0/cards: 1
Gpu/0/memory: 12GiB
Gpu/1/cards: 1
Gpu/1/memory: 6GiB
 Use of cards is present to prevent sharing of GPU cards
GpuGrp1
GpuGrp0
Example  GPU with NVLink
 For 4 GPUs with two groups, each connected via NVLink to another
GpuGrp/0/Gpu/0/cards: 1
GpuGrp/0/Gpu/0/memory: 12GiB
GpuGrp/0/Gpu/1/cards: 1
GpuGrp/0/Gpu/1/memory: 12GiB
GpuGrp/1/Gpu/2/cards: 1
GpuGrp/1/Gpu/2/memory: 8GiB
GpuGrp/1/Gpu/3/cards: 1
GpuGrp/1/Gpu/3/memory: 8GiB
Gpu0 Gpu1
Gpu2 Gpu3
Group scheduler
 All resource lists (allocatable, used, and requests) specified in this
manner
 Scheduling can no longer compare values with same key to see fit
 e.g: allocatable[memory] < used[memory] + requested[memory]
 Example
Allocatable:
GpuGrp/0/Gpu/0/cards: 1
GpuGrp/0/Gpu/0/memory: 12GiB
GpuGrp/0/Gpu/1/cards: 1
GpuGrp/0/Gpu/1/memory: 12GiB
GpuGrp/1/Gpu/2/cards: 1
GpuGrp/1/Gpu/2/memory: 8GiB
GpuGrp/1/Gpu/3/cards: 1
GpuGrp/1/Gpu/3/memory: 8GiB
Requested (two GPUs minimum memory
10GiB, dont require about NVLink):
GpuGrp/A/Gpu/0/cards: 1
GpuGrp/A/Gpu/0/memory: 10GiB
GpuGrp/B/Gpu/1/cards: 1
GpuGrp/B/Gpu/1/memory: 10GiB
Group scheduler
 Group scheduler  uses hierarchical group allocation with arbitrary
scorers to accomplish both checking for fit and allocation
 Allocation is a string-to-string key-value which specifies a mapping
from Requests to Allocatable
Allocatable:
GpuGrp/0/Gpu/0/cards: 1
GpuGrp/0/Gpu/0/memory: 12GiB
GpuGrp/0/Gpu/1/cards: 1
GpuGrp/0/Gpu/1/memory: 12GiB
GpuGrp/1/Gpu/2/cards: 1
GpuGrp/1/Gpu/2/memory: 8GiB
GpuGrp/1/Gpu/3/cards: 1
GpuGrp/1/Gpu/3/memory: 8GiB
Requested (two GPUs minimum memory
10GiB, dont require about NVLink):
GpuGrp/A/Gpu/0/cards: 1
GpuGrp/A/Gpu/0/memory: 10GiB
GpuGrp/B/Gpu/1/cards: 1
GpuGrp/B/Gpu/1/memory: 10GiB
Group Allocation
Allocatable
Gpugrp1/0/Gpugrp0/0/gpu/dev0/cards: 1
Gpugrp1/0/Gpugrp0/0/gpu/dev1/cards: 1
Gpugrp1/0/Gpugrp0/1/gpu/dev2/cards: 1
Gpugrp1/0/Gpugrp0/1/gpu/dev3/cards: 1
Gpugrp1/1/Gpugrp0/2/gpu/dev4/cards: 1
Gpugrp1/1/Gpugrp0/2/gpu/dev5/cards: 1
Gpugrp1/1/Gpugrp0/3/gpu/dev6/cards: 1
Gpugrp1/1/Gpugrp0/3/gpu/dev7/cards: 1
Requests
Gpugrp1/R0/Gpugrp0/RA/gpu/gpu0/cards: 1
Gpugrp1/R0/Gpugrp0/RA/gpu/gpu1/cards: 1
Gpugrp1/R1/Gpugrp0/RA/gpu/gpu2/cards: 1
Gpugrp1/R1/Gpugrp0/RA/gpu/gpu3/cards: 1
Gpugrp1/R1/Gpugrp0/RB/gpu/gpu4/cards: 1
Gpugrp1/R1/Gpugrp0/RB/gpu/gpu5/cards: 1
Requests
Allocatable
Main Modifications  scheduler side
1. Addition of AllocateFrom field in pod specification. This is a list of key-
value pairs which specify mapping from Requests to Allocatable
pkg/api/types.go
2. Addition of group scheduler code
plugin/pkg/scheduler/algorithm/predicates/grpallocate.go
plugin/pkg/scheduler/algorithm/scorer
3. Modification in scheduler to write pod update after scheduling and to
call group allocator
plugin/pkg/scheduler/generic_scheduler.go
plugin/pkg/scheduler/scheduler.go
Kubelet modifications
 Existing multi-GPU code makes the kubelet do the work of keeping
track of which GPUs are available and uses /dev/nvidia* to see
number of devices, both of which are hacks
 With addition of AllocateFrom field, scheduler decides which GPUs
to use and keeps track of which ones are in use.
Main Modifications  kubelet side
1. Use of AllocateFrom to decide which GPUs to use
2. Use of nvidia-docker-plugin to find GPUs (instead of looking at
/dev/nvidia*)
 This is also needed to get richer information such as memory in GPU, GPU
type, topology information (i.e. NVLink)
3. Use of nvidia-docker-plugin to find correct location for nvidia drivers
inside container (in conjunction with nvidia-docker driver)
4. Allow specification of driver when specifying mount  needed to
use nvidia-docker driver
Integration with community
 Eventual goal
Scheduler  Keeps
track of Used
Worker 1
Worker 2
Worker N
Pod (Contianer) Spec
- Container Requests
Kubelets send
Allocatable
resources for nodes
Device Plugins
(e.g. GPU)
Resources to
advertise
Resources usage /
docker params
Kubelets know
nothing about
GPUs
Scheduler
extender Scheduling
Request
Asks for fit
Performs group
allocation  writes
update to pod spec
with allocation
Needed in Kubernetes core
 We will need a few things in order to achieve separation with core
which will allow for directly using latest Kubernetes binaries
 Resource Class, scheduled for v1.9 will allow for non-identity
mappings between requests and allocatable
 Device plugins and native Nvidia GPU support is v1.13 for now
https://docs.google.com/a/google.com/spreadsheets/d/1NWarIgtSLsq3
izc5wOzV7ItdhDNRd-6oBVawmvs-LGw
Other future Kubernetes/Scheduler work
 Pod placement using other constraints such as pod-level constraints
or higher (e.g. multiple pods for distributed training)
 For example, networking constraints for distributed training when
scheduling
 Container networking for faster cross-pod communication (e.g. using
RDMA / IB)

More Related Content

Similar to Kubernetes Modifications powerpoint presentation (20)

[GSれろ] Google Kubernetes Engine
[GSれろ]  Google Kubernetes Engine [GSれろ]  Google Kubernetes Engine
[GSれろ] Google Kubernetes Engine
GS Neotek
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
Tigabu Yaya
CuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPUCuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPU
Shohei Hido
Deploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesDeploying PostgreSQL on Kubernetes
Deploying PostgreSQL on Kubernetes
Jimmy Angelakos
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
percona2013
NASA Advanced Supercomputing (NAS) Division - Overview of The New Cabeus Clus...
NASA Advanced Supercomputing (NAS) Division - Overview of The New Cabeus Clus...NASA Advanced Supercomputing (NAS) Division - Overview of The New Cabeus Clus...
NASA Advanced Supercomputing (NAS) Division - Overview of The New Cabeus Clus...
VICTOR MAESTRE RAMIREZ
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
VICTOR MAESTRE RAMIREZ
20180503 kube con eu kubernetes metrics deep dive
20180503 kube con eu   kubernetes metrics deep dive20180503 kube con eu   kubernetes metrics deep dive
20180503 kube con eu kubernetes metrics deep dive
Bob Cotton
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Spark Summit
Enabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via KubernetesEnabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via Kubernetes
mountpoint.io
Cuda Without a Phd - A practical guick start
Cuda Without a Phd - A practical guick startCuda Without a Phd - A practical guick start
Cuda Without a Phd - A practical guick start
LloydMoore
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4  Maximizing the utilization of GPU resources on-premise and in the cloudPart 4  Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
Univa, an Altair Company
Google Kubernetes Engine Deep Dive Meetup
Google Kubernetes Engine Deep Dive MeetupGoogle Kubernetes Engine Deep Dive Meetup
Google Kubernetes Engine Deep Dive Meetup
Iftach Schonbaum
Deploying Containers and Managing Them
Deploying Containers and Managing ThemDeploying Containers and Managing Them
Deploying Containers and Managing Them
Docker, Inc.
Red Hat Summit 2018 5 New High Performance Features in OpenShift
Red Hat Summit 2018 5 New High Performance Features in OpenShiftRed Hat Summit 2018 5 New High Performance Features in OpenShift
Red Hat Summit 2018 5 New High Performance Features in OpenShift
Jeremy Eder
K8s cluster autoscaler
K8s cluster autoscaler K8s cluster autoscaler
K8s cluster autoscaler
k8s study
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
Advanced Namespaces and cgroups
Advanced Namespaces and cgroupsAdvanced Namespaces and cgroups
Advanced Namespaces and cgroups
Kernel TLV
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
percona2013
Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
Facultad de Inform叩tica UCM
[GSれろ] Google Kubernetes Engine
[GSれろ]  Google Kubernetes Engine [GSれろ]  Google Kubernetes Engine
[GSれろ] Google Kubernetes Engine
GS Neotek
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
Tigabu Yaya
CuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPUCuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPU
Shohei Hido
Deploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesDeploying PostgreSQL on Kubernetes
Deploying PostgreSQL on Kubernetes
Jimmy Angelakos
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
percona2013
NASA Advanced Supercomputing (NAS) Division - Overview of The New Cabeus Clus...
NASA Advanced Supercomputing (NAS) Division - Overview of The New Cabeus Clus...NASA Advanced Supercomputing (NAS) Division - Overview of The New Cabeus Clus...
NASA Advanced Supercomputing (NAS) Division - Overview of The New Cabeus Clus...
VICTOR MAESTRE RAMIREZ
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
VICTOR MAESTRE RAMIREZ
20180503 kube con eu kubernetes metrics deep dive
20180503 kube con eu   kubernetes metrics deep dive20180503 kube con eu   kubernetes metrics deep dive
20180503 kube con eu kubernetes metrics deep dive
Bob Cotton
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Spark Summit
Enabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via KubernetesEnabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via Kubernetes
mountpoint.io
Cuda Without a Phd - A practical guick start
Cuda Without a Phd - A practical guick startCuda Without a Phd - A practical guick start
Cuda Without a Phd - A practical guick start
LloydMoore
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4  Maximizing the utilization of GPU resources on-premise and in the cloudPart 4  Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
Univa, an Altair Company
Google Kubernetes Engine Deep Dive Meetup
Google Kubernetes Engine Deep Dive MeetupGoogle Kubernetes Engine Deep Dive Meetup
Google Kubernetes Engine Deep Dive Meetup
Iftach Schonbaum
Deploying Containers and Managing Them
Deploying Containers and Managing ThemDeploying Containers and Managing Them
Deploying Containers and Managing Them
Docker, Inc.
Red Hat Summit 2018 5 New High Performance Features in OpenShift
Red Hat Summit 2018 5 New High Performance Features in OpenShiftRed Hat Summit 2018 5 New High Performance Features in OpenShift
Red Hat Summit 2018 5 New High Performance Features in OpenShift
Jeremy Eder
K8s cluster autoscaler
K8s cluster autoscaler K8s cluster autoscaler
K8s cluster autoscaler
k8s study
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
Advanced Namespaces and cgroups
Advanced Namespaces and cgroupsAdvanced Namespaces and cgroups
Advanced Namespaces and cgroups
Kernel TLV
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
percona2013

Recently uploaded (20)

surgicalinfections-.potentially one or both
surgicalinfections-.potentially one or bothsurgicalinfections-.potentially one or both
surgicalinfections-.potentially one or both
AdityaRaghav32
Biography of Ricardo Antonio Pelleranon Paradas
Biography of Ricardo Antonio Pelleranon ParadasBiography of Ricardo Antonio Pelleranon Paradas
Biography of Ricardo Antonio Pelleranon Paradas
JoseDavidRodriguez14
J58_S4HANA2020_BPD_EN_USJ58_S4HANA2020_BPD_EN_US.docx
J58_S4HANA2020_BPD_EN_USJ58_S4HANA2020_BPD_EN_US.docxJ58_S4HANA2020_BPD_EN_USJ58_S4HANA2020_BPD_EN_US.docx
J58_S4HANA2020_BPD_EN_USJ58_S4HANA2020_BPD_EN_US.docx
jagankilari1
Old_Homes need of the day Presentation.pptx
Old_Homes need of the day Presentation.pptxOld_Homes need of the day Presentation.pptx
Old_Homes need of the day Presentation.pptx
Asimayub10
MS-2-Military-Customs-and-Traditions.pptx
MS-2-Military-Customs-and-Traditions.pptxMS-2-Military-Customs-and-Traditions.pptx
MS-2-Military-Customs-and-Traditions.pptx
reneeeruuz
SAMPLE INTRO PDF.pdf Introduction powerpoint
SAMPLE INTRO PDF.pdf Introduction powerpointSAMPLE INTRO PDF.pdf Introduction powerpoint
SAMPLE INTRO PDF.pdf Introduction powerpoint
KeannaMaeCorpuzTorra
father'sday adalah hari penting bagiku.pptx
father'sday adalah hari penting bagiku.pptxfather'sday adalah hari penting bagiku.pptx
father'sday adalah hari penting bagiku.pptx
fitrisuci2
Global Eradication of disease.pptx nnnnnnnnnnnnnnnn
Global Eradication of disease.pptx nnnnnnnnnnnnnnnnGlobal Eradication of disease.pptx nnnnnnnnnnnnnnnn
Global Eradication of disease.pptx nnnnnnnnnnnnnnnn
aqeelshauket
際際滷Egg_500803-Indian is Heritage.pptx
際際滷Egg_500803-Indian is  Heritage.pptx際際滷Egg_500803-Indian is  Heritage.pptx
際際滷Egg_500803-Indian is Heritage.pptx
widarec381
jbhjbiihpptemailwritingbyme-200811100749
jbhjbiihpptemailwritingbyme-200811100749jbhjbiihpptemailwritingbyme-200811100749
jbhjbiihpptemailwritingbyme-200811100749
ssuser7c9ffe
All About Emails.pptxAll About Emails.pptxAll About Emails.pptxAll About Emai...
All About Emails.pptxAll About Emails.pptxAll About Emails.pptxAll About Emai...All About Emails.pptxAll About Emails.pptxAll About Emails.pptxAll About Emai...
All About Emails.pptxAll About Emails.pptxAll About Emails.pptxAll About Emai...
seemas120
31Q_S4HANA2020_BPD_EN_US 31Q_S4HANA2020_BPD_EN_US
31Q_S4HANA2020_BPD_EN_US 31Q_S4HANA2020_BPD_EN_US31Q_S4HANA2020_BPD_EN_US 31Q_S4HANA2020_BPD_EN_US
31Q_S4HANA2020_BPD_EN_US 31Q_S4HANA2020_BPD_EN_US
jagankilari1
RampJ-Shakespeare-Intro.ppt see it give you
RampJ-Shakespeare-Intro.ppt see it give youRampJ-Shakespeare-Intro.ppt see it give you
RampJ-Shakespeare-Intro.ppt see it give you
sanjaibavan555
PPT for BCS TY Seminagjjvcssjkbcddhjr.pptx
PPT for BCS TY Seminagjjvcssjkbcddhjr.pptxPPT for BCS TY Seminagjjvcssjkbcddhjr.pptx
PPT for BCS TY Seminagjjvcssjkbcddhjr.pptx
GauravKolhe17
revised-2024-emhc-guidelines-english.pdf
revised-2024-emhc-guidelines-english.pdfrevised-2024-emhc-guidelines-english.pdf
revised-2024-emhc-guidelines-english.pdf
PJPollo
O-DISEASES OF BUCCAL CAVITY.ppt JFK JFK of
O-DISEASES OF BUCCAL CAVITY.ppt JFK JFK ofO-DISEASES OF BUCCAL CAVITY.ppt JFK JFK of
O-DISEASES OF BUCCAL CAVITY.ppt JFK JFK of
AdityaRaghav32
SFDC Training Day 2SFDC Training Day 2.pptx
SFDC Training Day 2SFDC Training Day 2.pptxSFDC Training Day 2SFDC Training Day 2.pptx
SFDC Training Day 2SFDC Training Day 2.pptx
ssuser50a2cf
Guide to a Winning Interview March 2025
Guide to a  Winning Interview March 2025Guide to a  Winning Interview March 2025
Guide to a Winning Interview March 2025
Bruce Bennett
Presentation uw wjsjjsjs kennw kekke kkejnen
Presentation uw wjsjjsjs kennw kekke kkejnenPresentation uw wjsjjsjs kennw kekke kkejnen
Presentation uw wjsjjsjs kennw kekke kkejnen
iterlab1
INTRODUCTION OF THE PANEL OF JUDGES.docx
INTRODUCTION OF THE PANEL OF JUDGES.docxINTRODUCTION OF THE PANEL OF JUDGES.docx
INTRODUCTION OF THE PANEL OF JUDGES.docx
tsp2251
surgicalinfections-.potentially one or both
surgicalinfections-.potentially one or bothsurgicalinfections-.potentially one or both
surgicalinfections-.potentially one or both
AdityaRaghav32
Biography of Ricardo Antonio Pelleranon Paradas
Biography of Ricardo Antonio Pelleranon ParadasBiography of Ricardo Antonio Pelleranon Paradas
Biography of Ricardo Antonio Pelleranon Paradas
JoseDavidRodriguez14
J58_S4HANA2020_BPD_EN_USJ58_S4HANA2020_BPD_EN_US.docx
J58_S4HANA2020_BPD_EN_USJ58_S4HANA2020_BPD_EN_US.docxJ58_S4HANA2020_BPD_EN_USJ58_S4HANA2020_BPD_EN_US.docx
J58_S4HANA2020_BPD_EN_USJ58_S4HANA2020_BPD_EN_US.docx
jagankilari1
Old_Homes need of the day Presentation.pptx
Old_Homes need of the day Presentation.pptxOld_Homes need of the day Presentation.pptx
Old_Homes need of the day Presentation.pptx
Asimayub10
MS-2-Military-Customs-and-Traditions.pptx
MS-2-Military-Customs-and-Traditions.pptxMS-2-Military-Customs-and-Traditions.pptx
MS-2-Military-Customs-and-Traditions.pptx
reneeeruuz
SAMPLE INTRO PDF.pdf Introduction powerpoint
SAMPLE INTRO PDF.pdf Introduction powerpointSAMPLE INTRO PDF.pdf Introduction powerpoint
SAMPLE INTRO PDF.pdf Introduction powerpoint
KeannaMaeCorpuzTorra
father'sday adalah hari penting bagiku.pptx
father'sday adalah hari penting bagiku.pptxfather'sday adalah hari penting bagiku.pptx
father'sday adalah hari penting bagiku.pptx
fitrisuci2
Global Eradication of disease.pptx nnnnnnnnnnnnnnnn
Global Eradication of disease.pptx nnnnnnnnnnnnnnnnGlobal Eradication of disease.pptx nnnnnnnnnnnnnnnn
Global Eradication of disease.pptx nnnnnnnnnnnnnnnn
aqeelshauket
際際滷Egg_500803-Indian is Heritage.pptx
際際滷Egg_500803-Indian is  Heritage.pptx際際滷Egg_500803-Indian is  Heritage.pptx
際際滷Egg_500803-Indian is Heritage.pptx
widarec381
jbhjbiihpptemailwritingbyme-200811100749
jbhjbiihpptemailwritingbyme-200811100749jbhjbiihpptemailwritingbyme-200811100749
jbhjbiihpptemailwritingbyme-200811100749
ssuser7c9ffe
All About Emails.pptxAll About Emails.pptxAll About Emails.pptxAll About Emai...
All About Emails.pptxAll About Emails.pptxAll About Emails.pptxAll About Emai...All About Emails.pptxAll About Emails.pptxAll About Emails.pptxAll About Emai...
All About Emails.pptxAll About Emails.pptxAll About Emails.pptxAll About Emai...
seemas120
31Q_S4HANA2020_BPD_EN_US 31Q_S4HANA2020_BPD_EN_US
31Q_S4HANA2020_BPD_EN_US 31Q_S4HANA2020_BPD_EN_US31Q_S4HANA2020_BPD_EN_US 31Q_S4HANA2020_BPD_EN_US
31Q_S4HANA2020_BPD_EN_US 31Q_S4HANA2020_BPD_EN_US
jagankilari1
RampJ-Shakespeare-Intro.ppt see it give you
RampJ-Shakespeare-Intro.ppt see it give youRampJ-Shakespeare-Intro.ppt see it give you
RampJ-Shakespeare-Intro.ppt see it give you
sanjaibavan555
PPT for BCS TY Seminagjjvcssjkbcddhjr.pptx
PPT for BCS TY Seminagjjvcssjkbcddhjr.pptxPPT for BCS TY Seminagjjvcssjkbcddhjr.pptx
PPT for BCS TY Seminagjjvcssjkbcddhjr.pptx
GauravKolhe17
revised-2024-emhc-guidelines-english.pdf
revised-2024-emhc-guidelines-english.pdfrevised-2024-emhc-guidelines-english.pdf
revised-2024-emhc-guidelines-english.pdf
PJPollo
O-DISEASES OF BUCCAL CAVITY.ppt JFK JFK of
O-DISEASES OF BUCCAL CAVITY.ppt JFK JFK ofO-DISEASES OF BUCCAL CAVITY.ppt JFK JFK of
O-DISEASES OF BUCCAL CAVITY.ppt JFK JFK of
AdityaRaghav32
SFDC Training Day 2SFDC Training Day 2.pptx
SFDC Training Day 2SFDC Training Day 2.pptxSFDC Training Day 2SFDC Training Day 2.pptx
SFDC Training Day 2SFDC Training Day 2.pptx
ssuser50a2cf
Guide to a Winning Interview March 2025
Guide to a  Winning Interview March 2025Guide to a  Winning Interview March 2025
Guide to a Winning Interview March 2025
Bruce Bennett
Presentation uw wjsjjsjs kennw kekke kkejnen
Presentation uw wjsjjsjs kennw kekke kkejnenPresentation uw wjsjjsjs kennw kekke kkejnen
Presentation uw wjsjjsjs kennw kekke kkejnen
iterlab1
INTRODUCTION OF THE PANEL OF JUDGES.docx
INTRODUCTION OF THE PANEL OF JUDGES.docxINTRODUCTION OF THE PANEL OF JUDGES.docx
INTRODUCTION OF THE PANEL OF JUDGES.docx
tsp2251

Kubernetes Modifications powerpoint presentation

  • 2. Kubernetes resource scheduling Terminology: - Allocatable what is available at node - Used what is already being used from node (called RequestedResource) - Requests what is requested by container(s) for the pod Scheduler Keeps track of Used Worker 1 Worker 2 Worker N Pod (Contianer) Spec - Container Requests Kubelets send Allocatable resources for nodes Scheduling Request
  • 3. Resources All resources (allocatable, used, and requests) are represented as a ResourceList which is simply a list of key-value pairs, e.g. memory : 64GiB cpu : 8
  • 4. Simple scheduling 1. Find worker nodes that can fit a pod spec plugin/pkg/scheduler/algorithm/predicates 2. Prioritize list of nodes plugin/pkg/scheduler/algorithm/priorities 3. Try to schedule pod on node node may have additional admission policy so pod may fail 4. If fails, try next node on list
  • 5. Find nodes that fit For simple scheduling, node will NOT fit if Allocatable < Request + Used Example if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU)) } if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory)) } if allocatable.NvidiaGPU < podRequest.NvidiaGPU+nodeInfo.RequestedResource().NvidiaGPU { predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceNvidiaGPU, podRequest.NvidiaGPU, nodeInfo.RequestedResource().NvidiaGPU, allocatable.NvidiaGPU)) }
  • 6. Why do we need modifications? Only allows for constraints like following in pod spec Need 4 GPUs Does NOT allow for constraints like following in pod spec Need 4 GPUs with minimum memory 12GiB OR Need 2 GPUs with minimum memory 4GiB and 2 GPUs with 12GiB Need 2 GPUs interconnected via NVLink (peer-to-peer for high speed inter- GPU communication)
  • 7. Solution 1 Label nodes and use node selector https://kubernetes.io/docs/concepts/configuration/assign-pod-node/ However, not optimal in cases with heterogeneous configurations For example, one machine may have GPUs of several types, some with large amounts of memory and some with small If label used, then dont know which GPUs will get assigned. Thus only minimally performant GPU can be used to label node Also even in homogenous configurations, kubelet running on worker nodes needs to keep track of bookkeeping and which GPUs are in use
  • 8. Solution 2 Group Scheduler Define richer syntax on ResourceLists to allow for such constraints to be scheduled Example: Instead of: NvidiaGPU: 2 Use something like now memory for each GPU is clearly specified Gpu/0/cards: 1 Gpu/0/memory: 12GiB Gpu/1/cards: 1 Gpu/1/memory: 6GiB Use of cards is present to prevent sharing of GPU cards
  • 9. GpuGrp1 GpuGrp0 Example GPU with NVLink For 4 GPUs with two groups, each connected via NVLink to another GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB Gpu0 Gpu1 Gpu2 Gpu3
  • 10. Group scheduler All resource lists (allocatable, used, and requests) specified in this manner Scheduling can no longer compare values with same key to see fit e.g: allocatable[memory] < used[memory] + requested[memory] Example Allocatable: GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB Requested (two GPUs minimum memory 10GiB, dont require about NVLink): GpuGrp/A/Gpu/0/cards: 1 GpuGrp/A/Gpu/0/memory: 10GiB GpuGrp/B/Gpu/1/cards: 1 GpuGrp/B/Gpu/1/memory: 10GiB
  • 11. Group scheduler Group scheduler uses hierarchical group allocation with arbitrary scorers to accomplish both checking for fit and allocation Allocation is a string-to-string key-value which specifies a mapping from Requests to Allocatable Allocatable: GpuGrp/0/Gpu/0/cards: 1 GpuGrp/0/Gpu/0/memory: 12GiB GpuGrp/0/Gpu/1/cards: 1 GpuGrp/0/Gpu/1/memory: 12GiB GpuGrp/1/Gpu/2/cards: 1 GpuGrp/1/Gpu/2/memory: 8GiB GpuGrp/1/Gpu/3/cards: 1 GpuGrp/1/Gpu/3/memory: 8GiB Requested (two GPUs minimum memory 10GiB, dont require about NVLink): GpuGrp/A/Gpu/0/cards: 1 GpuGrp/A/Gpu/0/memory: 10GiB GpuGrp/B/Gpu/1/cards: 1 GpuGrp/B/Gpu/1/memory: 10GiB
  • 12. Group Allocation Allocatable Gpugrp1/0/Gpugrp0/0/gpu/dev0/cards: 1 Gpugrp1/0/Gpugrp0/0/gpu/dev1/cards: 1 Gpugrp1/0/Gpugrp0/1/gpu/dev2/cards: 1 Gpugrp1/0/Gpugrp0/1/gpu/dev3/cards: 1 Gpugrp1/1/Gpugrp0/2/gpu/dev4/cards: 1 Gpugrp1/1/Gpugrp0/2/gpu/dev5/cards: 1 Gpugrp1/1/Gpugrp0/3/gpu/dev6/cards: 1 Gpugrp1/1/Gpugrp0/3/gpu/dev7/cards: 1 Requests Gpugrp1/R0/Gpugrp0/RA/gpu/gpu0/cards: 1 Gpugrp1/R0/Gpugrp0/RA/gpu/gpu1/cards: 1 Gpugrp1/R1/Gpugrp0/RA/gpu/gpu2/cards: 1 Gpugrp1/R1/Gpugrp0/RA/gpu/gpu3/cards: 1 Gpugrp1/R1/Gpugrp0/RB/gpu/gpu4/cards: 1 Gpugrp1/R1/Gpugrp0/RB/gpu/gpu5/cards: 1 Requests Allocatable
  • 13. Main Modifications scheduler side 1. Addition of AllocateFrom field in pod specification. This is a list of key- value pairs which specify mapping from Requests to Allocatable pkg/api/types.go 2. Addition of group scheduler code plugin/pkg/scheduler/algorithm/predicates/grpallocate.go plugin/pkg/scheduler/algorithm/scorer 3. Modification in scheduler to write pod update after scheduling and to call group allocator plugin/pkg/scheduler/generic_scheduler.go plugin/pkg/scheduler/scheduler.go
  • 14. Kubelet modifications Existing multi-GPU code makes the kubelet do the work of keeping track of which GPUs are available and uses /dev/nvidia* to see number of devices, both of which are hacks With addition of AllocateFrom field, scheduler decides which GPUs to use and keeps track of which ones are in use.
  • 15. Main Modifications kubelet side 1. Use of AllocateFrom to decide which GPUs to use 2. Use of nvidia-docker-plugin to find GPUs (instead of looking at /dev/nvidia*) This is also needed to get richer information such as memory in GPU, GPU type, topology information (i.e. NVLink) 3. Use of nvidia-docker-plugin to find correct location for nvidia drivers inside container (in conjunction with nvidia-docker driver) 4. Allow specification of driver when specifying mount needed to use nvidia-docker driver
  • 16. Integration with community Eventual goal Scheduler Keeps track of Used Worker 1 Worker 2 Worker N Pod (Contianer) Spec - Container Requests Kubelets send Allocatable resources for nodes Device Plugins (e.g. GPU) Resources to advertise Resources usage / docker params Kubelets know nothing about GPUs Scheduler extender Scheduling Request Asks for fit Performs group allocation writes update to pod spec with allocation
  • 17. Needed in Kubernetes core We will need a few things in order to achieve separation with core which will allow for directly using latest Kubernetes binaries Resource Class, scheduled for v1.9 will allow for non-identity mappings between requests and allocatable Device plugins and native Nvidia GPU support is v1.13 for now https://docs.google.com/a/google.com/spreadsheets/d/1NWarIgtSLsq3 izc5wOzV7ItdhDNRd-6oBVawmvs-LGw
  • 18. Other future Kubernetes/Scheduler work Pod placement using other constraints such as pod-level constraints or higher (e.g. multiple pods for distributed training) For example, networking constraints for distributed training when scheduling Container networking for faster cross-pod communication (e.g. using RDMA / IB)