�ݺ�ߣ

Kubernetes Modifications for
GPUs
Sanjeev Mehrotra

Kubernetes resource scheduling
Terminology:
- Allocatable – what is
available at node
- Used – what is already
being used from node
(called
RequestedResource)
- Requests– what is
requested by
container(s) for the pod
Scheduler – Keeps
track of “Used”
Worker 1
Worker 2
Worker N
Pod (Contianer) Spec
- Container Requests
Kubelets send
“Allocatable”
resources for nodes
Scheduling
Request

Resources
• All resources (allocatable, used, and requests) are represented as a
“ResourceList” which is simply a list of key-value pairs, e.g.
memory : 64GiB
cpu : 8

Simple scheduling
1. Find worker nodes that can “fit” a pod spec
• plugin/pkg/scheduler/algorithm/predicates
2. Prioritize list of nodes
• plugin/pkg/scheduler/algorithm/priorities
3. Try to schedule pod on node – node may have additional admission
policy so pod may fail
4. If fails, try next node on list

Find nodes that fit
• For simple scheduling, node will NOT fit if
Allocatable < Request + Used
• Example
if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU {
predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceCPU,
podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU))
}
if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory {
predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceMemory,
podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory))
}
if allocatable.NvidiaGPU < podRequest.NvidiaGPU+nodeInfo.RequestedResource().NvidiaGPU {
predicateFails = append(predicateFails, NewInsufficientResourceError(api.ResourceNvidiaGPU,
podRequest.NvidiaGPU, nodeInfo.RequestedResource().NvidiaGPU, allocatable.NvidiaGPU))
}

Why do we need modifications?
• Only allows for constraints like following in pod spec
Need 4 GPUs
• Does NOT allow for constraints like following in pod spec
Need 4 GPUs with minimum memory 12GiB OR
Need 2 GPUs with minimum memory 4GiB and 2 GPUs with 12GiB
Need 2 GPUs interconnected via NVLink (peer-to-peer for high speed inter-
GPU communication)

Solution 1
• Label nodes and use node selector
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
• However, not optimal in cases with heterogeneous configurations
• For example, one machine may have GPUs of several types, some with large
amounts of memory and some with small
• If label used, then don’t know which GPUs will get assigned. Thus only
minimally performant GPU can be used to label node
• Also even in homogenous configurations, kubelet running on worker
nodes needs to keep track of bookkeeping and which GPUs are in use

Solution 2 – Group Scheduler
• Define richer syntax on ResourceLists to allow for such constraints to
be scheduled
• Example:
• Instead of:
NvidiaGPU: 2
• Use something like – now memory for each GPU is clearly specified
Gpu/0/cards: 1
Gpu/0/memory: 12GiB
Gpu/1/cards: 1
Gpu/1/memory: 6GiB
• Use of “cards” is present to prevent sharing of GPU cards

GpuGrp1
GpuGrp0
Example – GPU with NVLink
• For 4 GPUs with two groups, each connected via NVLink to another
GpuGrp/0/Gpu/0/cards: 1
GpuGrp/0/Gpu/0/memory: 12GiB
Gpu0 Gpu1
Gpu2 Gpu3

Group scheduler
• All resource lists (allocatable, used, and requests) specified in this
manner
• Scheduling can no longer compare values with same key to see “fit”
• e.g: allocatable[“memory”] < used[“memory”] + requested[“memory”]
• Example
Allocatable:
Requested (two GPUs minimum memory
10GiB, don’t require about NVLink):
GpuGrp/A/Gpu/0/cards: 1
GpuGrp/A/Gpu/0/memory: 10GiB
GpuGrp/B/Gpu/1/cards: 1
GpuGrp/B/Gpu/1/memory: 10GiB

Group scheduler
• Group scheduler – uses hierarchical group allocation with arbitrary
scorers to accomplish both checking for “fit” and “allocation”
• “Allocation” is a string-to-string key-value which specifies a mapping
from “Requests” to “Allocatable”
Allocatable:
Requested (two GPUs minimum memory
10GiB, don’t require about NVLink):
GpuGrp/A/Gpu/0/cards: 1
GpuGrp/A/Gpu/0/memory: 10GiB
GpuGrp/B/Gpu/1/cards: 1
GpuGrp/B/Gpu/1/memory: 10GiB

Group Allocation
Allocatable
Gpugrp1/0/Gpugrp0/0/gpu/dev0/cards: 1
Requests
Gpugrp1/R0/Gpugrp0/RA/gpu/gpu0/cards: 1
Gpugrp1/R1/Gpugrp0/RB/gpu/gpu4/cards: 1
Gpugrp1/R1/Gpugrp0/RB/gpu/gpu5/cards: 1
Requests
Allocatable

Main Modifications – scheduler side
1. Addition of AllocateFrom field in pod specification. This is a list of key-
value pairs which specify mapping from “Requests” to “Allocatable”
pkg/api/types.go
2. Addition of group scheduler code
plugin/pkg/scheduler/algorithm/predicates/grpallocate.go
plugin/pkg/scheduler/algorithm/scorer
3. Modification in scheduler to write pod update after scheduling and to
call group allocator
plugin/pkg/scheduler/generic_scheduler.go
plugin/pkg/scheduler/scheduler.go

Kubelet modifications
• Existing multi-GPU code makes the kubelet do the work of keeping
track of which GPUs are available and uses /dev/nvidia* to see
number of devices, both of which are hacks
• With addition of “AllocateFrom” field, scheduler decides which GPUs
to use and keeps track of which ones are in use.

Main Modifications – kubelet side
1. Use of AllocateFrom to decide which GPUs to use
2. Use of nvidia-docker-plugin to find GPUs (instead of looking at
/dev/nvidia*)
• This is also needed to get richer information such as memory in GPU, GPU
type, topology information (i.e. NVLink)
3. Use of nvidia-docker-plugin to find correct location for nvidia drivers
inside container (in conjunction with nvidia-docker driver)
4. Allow specification of driver when specifying mount – needed to
use nvidia-docker driver

Integration with community
• Eventual goal
Scheduler – Keeps
track of “Used”
Worker 1
Worker 2
Worker N
Pod (Contianer) Spec
- Container Requests
Kubelets send
“Allocatable”
resources for nodes
Device Plugins
(e.g. GPU)
Resources to
advertise
Resources usage /
docker params
Kubelets know
nothing about
GPUs
Scheduler
extender Scheduling
Request
Asks for fit
Performs group
allocation – writes
update to pod spec
with allocation

Needed in Kubernetes core
• We will need a few things in order to achieve separation with core
which will allow for directly using latest Kubernetes binaries
• Resource Class, scheduled for v1.9 will allow for non-identity
mappings between requests and allocatable
• Device plugins and native Nvidia GPU support is v1.13 for now
https://docs.google.com/a/google.com/spreadsheets/d/1NWarIgtSLsq3
izc5wOzV7ItdhDNRd-6oBVawmvs-LGw

Other future Kubernetes/Scheduler work
• Pod placement using other constraints such as pod-level constraints
or higher (e.g. multiple pods for distributed training)
• For example, networking constraints for distributed training when
scheduling
• Container networking for faster cross-pod communication (e.g. using
RDMA / IB)

�ݺ�ߣ

Kubernetes Modifications powerpoint presentation

Recommended

More Related Content

Similar to Kubernetes Modifications powerpoint presentation (20)

Recently uploaded (20)

Kubernetes Modifications powerpoint presentation