0%

【Kubernetes】Device Plugin

Kubernetes 原生支持对于CPU和内存资源的发现,但是有很多其他的设备 kubelet不能原生处理,比如GPU、FPGA、RDMA、存储设备和其他类似的异构计算资源设备。为了能够使用这些设备资源,我们需要进行各个设备的初始化和设置。按照 Kubernetes 的 OutOfTree 的哲学理念,我们不应该把各个厂商的设备初始化设置相关代码与 Kubernetes 核心代码放在一起。与之相反,我们需要一种机制能够让各个设备厂商向 Kubelet 上报设备资源,而不需要修改 Kubernetes 核心代码。这即是 Device Plugin 这一机制的来源,本文将介绍 Device Plugin 的实现原理,并介绍其使用。

Device 插件原理

Device Plugin 实际上是一个 gPRC server,Device 插件一般推荐使用 DaemonSet 的方式部署,并将 /var/lib/kubelet/device-plugins 以 Volume 的形式挂载到容器中。当然,也可以手动运行的方式来部署,但这样就没有失败自动恢复的功能了。

为了能够使用某个厂商的特定设备,一般有两步:

  • kubectl create -f http://vendor.com/device-plugin-daemonset.yaml
  • 执行 kubectl describe nodes的时候,相关设备会出现在node status中:vendor-domain/vendor-device

当 Device Plugin 向 kubelet 注册后,kubelet 就通过 RPC 与 Device Plugin 交互:

  • ListAndWatch() :让 kubelet 发现设备资源和对应属性,并且在设备资源发生变动的时候接收通知
  • Allocate() :kubelet 在创建容器前通过 Allocate来申请相关设备资源

Process

Registration

为了向 kubelet 告知 Device Plugin 的存在,Device Plugin 必须向 kubelet 发出注册请求,这之后 kubelet 才会和 Device Plugin 通过g RPC交互,具体过程如下:

  • Device Plugin 向 Kubelet 发送一个 RegisterRequest的请求
  • Kubelet 收到 RegisterRequest 请求后,返回一个 RegisterResponse,如果Kubelet碰到任何错误,会把错误附在Response中
  • 如果 Device Plugin 没有收到任何错误,则启动他的 gRPC server

插件启动后要持续监控 Kubelet 的状态,并在 Kubelet 重启后重新注册自己。比如,Kubelet 刚启动后会清空 /var/lib/kubelet/device-plugins/ 目录,所以插件作者可以监控自己监听的 unix socket 是否被删除了,并根据此事件重新注册自己

Unix Socket

Device Plugin 和 Kubelet 通过在一个 Unix Socket上使用 gRPC 交互,当启动 gRPC server的时候,Device Plugin 将会在 /var/lib/kubelet/device-plugins/ 这个 HostPath 创建一个 UnixSocket,比如 /var/lib/kubelet/device-plugins/nvidiaGPU.sock

在实现 Device 插件时需要注意

  • 插件启动时,需要通过 /var/lib/kubelet/device-plugins/kubelet.sock 向 Kubelet 注册,同时提供插件的 Unix Socket 名称、API 的版本号和插件名称(格式为 vendor-domain/resource,如 nvidia.com/gpu)。Kubelet 会将这些设备暴露到 Node 状态中,方便后续调度器使用
  • 插件启动后向 Kubelet 发送插件列表、按需分配设备并持续监控设备的实时状态

Protocol Overview

Protocol Overview

API specification

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
// Registration is the service advertised by the Kubelet
// Only when Kubelet answers with a success code to a Register Request
// may Device Plugins start their service
// Registration may fail when device plugin version is not supported by
// Kubelet or the registered resourceName is already taken by another
// active device plugin. Device plugin is expected to terminate upon registration failure
service Registration {
rpc Register(RegisterRequest) returns (Empty) {}
}

message DevicePluginOptions {
// Indicates if PreStartContainer call is required before each container start
bool pre_start_required = 1;
}

message RegisterRequest {
// Version of the API the Device Plugin was built against
string version = 1;
// Name of the unix socket the device plugin is listening on
// PATH = path.Join(DevicePluginPath, endpoint)
string endpoint = 2;
// Schedulable resource name. As of now it's expected to be a DNS Label
string resource_name = 3;
// Options to be communicated with Device Manager
DevicePluginOptions options = 4;
}

message Empty {
}

// DevicePlugin is the service advertised by Device Plugins
service DevicePlugin {
// GetDevicePluginOptions returns options to be communicated with Device
// Manager
rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {}

// ListAndWatch returns a stream of List of Devices
// Whenever a Device state change or a Device disapears, ListAndWatch
// returns the new list
rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}

// Allocate is called during container creation so that the Device
// Plugin can run device specific operations and instruct Kubelet
// of the steps to make the Device available in the container
rpc Allocate(AllocateRequest) returns (AllocateResponse) {}

// PreStartContainer is called, if indicated by Device Plugin during registeration phase,
// before each container start. Device plugin can run device specific operations
// such as reseting the device before making devices available to the container
rpc PreStartContainer(PreStartContainerRequest) returns (PreStartContainerResponse) {}
}

// ListAndWatch returns a stream of List of Devices
// Whenever a Device state change or a Device disapears, ListAndWatch
// returns the new list
message ListAndWatchResponse {
repeated Device devices = 1;
}

/* E.g:
* struct Device {
* ID: "GPU-fef8089b-4820-abfc-e83e-94318197576e",
* State: "Healthy",
*} */
message Device {
// A unique ID assigned by the device plugin used
// to identify devices during the communication
// Max length of this field is 63 characters
string ID = 1;
// Health of the device, can be healthy or unhealthy, see constants.go
string health = 2;
}

// - PreStartContainer is expected to be called before each container start if indicated by plugin during registration phase.
// - PreStartContainer allows kubelet to pass reinitialized devices to containers.
// - PreStartContainer allows Device Plugin to run device specific operations on
// the Devices requested
message PreStartContainerRequest {
repeated string devicesIDs = 1;
}

// PreStartContainerResponse will be send by plugin in response to PreStartContainerRequest
message PreStartContainerResponse {
}

// - Allocate is expected to be called during pod creation since allocation
// failures for any container would result in pod startup failure.
// - Allocate allows kubelet to exposes additional artifacts in a pod's
// environment as directed by the plugin.
// - Allocate allows Device Plugin to run device specific operations on
// the Devices requested
message AllocateRequest {
repeated ContainerAllocateRequest container_requests = 1;
}

message ContainerAllocateRequest {
repeated string devicesIDs = 1;
}

// AllocateResponse includes the artifacts that needs to be injected into
// a container for accessing 'deviceIDs' that were mentioned as part of
// 'AllocateRequest'.
// Failure Handling:
// if Kubelet sends an allocation request for dev1 and dev2.
// Allocation on dev1 succeeds but allocation on dev2 fails.
// The Device plugin should send a ListAndWatch update and fail the
// Allocation request
message AllocateResponse {
repeated ContainerAllocateResponse container_responses = 1;
}

message ContainerAllocateResponse {
// List of environment variable to be set in the container to access one of more devices.
map<string, string> envs = 1;
// Mounts for the container.
repeated Mount mounts = 2;
// Devices for the container.
repeated DeviceSpec devices = 3;
// Container annotations to pass to the container runtime
map<string, string> annotations = 4;
}

// Mount specifies a host volume to mount into a container.
// where device library or tools are installed on host and container
message Mount {
// Path of the mount within the container.
string container_path = 1;
// Path of the mount on the host.
string host_path = 2;
// If set, the mount is read-only.
bool read_only = 3;
}

// DeviceSpec specifies a host device to mount into a container.
message DeviceSpec {
// Path of the device within the container.
string container_path = 1;
// Path of the device on the host.
string host_path = 2;
// Cgroups permissions of the device, candidates are one or more of
// * r - allows container to read from the specified device.
// * w - allows container to write to the specified device.
// * m - allows container to create device files that do not yet exist.
string permissions = 3;
}

插件生命周期管理

插件启动时,以grpc的形式通过/var/lib/kubelet/device-plugins/kubelet.sock向Kubelet注册,同时提供插件的监听Unix Socket,API版本号和设备名称(比如nvidia.com/gpu)。Kubelet将会把这些设备暴露到Node状态中,以Extended Resource的要求发送到API server中,后续Scheduler会根据这些信息进行调度。

插件启动后,Kubelet会建立一个到插件的listAndWatch长连接,当插件检测到某个设备不健康的时候,就会主动通知Kubelet。此时如果这个设备处于空闲状态,Kubelet就会将其挪出可分配列表;如果该设备已经被某个pod使用,Kubelet就会将该Pod杀掉

插件启动后可以利用Kubelet的socket持续检查Kubelet的状态,如果Kubelet重启,插件也会相应的重启,并且重新向Kubelet注册自己

NVIDIA Device Plugin

NVIDIA 提供了一个基于 Device Plugins 接口的 GPU 设备插件 NVIDIA/k8s-device-plugin

编译

1
2
3
git clone https://github.com/NVIDIA/k8s-device-plugin
cd k8s-device-plugin
docker build -t nvidia-device-plugin:1.0.0 .

部署

1
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml

创建 Pod 时请求 GPU 资源

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: v1
kind: Pod
metadata:
name: pod1
spec:
restartPolicy: OnFailure
containers:
- image: nvidia/cuda
name: pod1-ctr
command: ["sleep"]
args: ["100000"]

resources:
limits:
nvidia.com/gpu: 1

注意:使用该插件时需要配置 nvidia-docker 2.0,并配置 nvidia 为默认运行时 (即配置 docker daemon 的选项 --default-runtime=nvidia。nvidia-docker 2.0 的安装方法为(以 Ubuntu Xenial 为例,其他系统的安装方法可以参考 这里):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Configure repository
curl -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
sudo tee /etc/apt/sources.list.d/nvidia-docker.list <<< \
"deb https://nvidia.github.io/libnvidia-container/ubuntu16.04/amd64 /
deb https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64 /
deb https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64 /"
sudo apt-get update

# Install nvidia-docker 2.0
sudo apt-get install nvidia-docker2
sudo pkill -SIGHUP dockerd

# Check installation
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

源码分析

Allocate 接口只做了一件事情,就是给容器加上 NVIDIA_VISIBLE_DEVICES 环境变量。 https://github.com/NVIDIA/k8s-device-plugin/blob/v1.11/server.go#L153

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Allocate which return list of devices.
func (m *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
devs := m.devs
responses := pluginapi.AllocateResponse{}
for _, req := range reqs.ContainerRequests {
response := pluginapi.ContainerAllocateResponse{
Envs: map[string]string{
"NVIDIA_VISIBLE_DEVICES": strings.Join(req.DevicesIDs, ","),
},
}

for _, id := range req.DevicesIDs {
if !deviceExists(devs, id) {
return nil, fmt.Errorf("invalid allocation request: unknown device: %s", id)
}
}

responses.ContainerResponses = append(responses.ContainerResponses, &response)
}

return &responses, nil
}

前面我们提到, Nvidia的 gpu-container-runtime 根据容器的 NVIDIA_VISIBLE_DEVICES 环境变量,会决定这个容器是否为GPU容器,并且可以使用哪些GPU设备。 而Nvidia GPU device plugin做的事情,就是根据kubelet 请求中的GPU DeviceId, 转换为 NVIDIA_VISIBLE_DEVICES 环境变量返回给kubelet, kubelet收到返回内容后,会自动将返回的环境变量注入到容器中。当容器中包含环境变量,启动时 gpu-container-runtime 会根据 NVIDIA_VISIBLE_DEVICES 里声明的设备信息,将设备映射到容器中,并将对应的Nvidia Driver Lib 也映射到容器中。

总体流程

整个Kubernetes调度GPU的过程如下:

  • GPU Device plugin 部署到GPU节点上,通过 ListAndWatch 接口,上报注册节点的GPU信息和对应的DeviceID。
  • 当有声明 nvidia.com/gpu 的GPU Pod创建出现,调度器会综合考虑GPU设备的空闲情况,将Pod调度到有充足GPU设备的节点上。
  • 节点上的kubelet 启动Pod时,根据request中的声明调用各个Device plugin 的 allocate接口, 由于容器声明了GPU。 kubelet 根据之前 ListAndWatch 接口收到的Device信息,选取合适的设备,DeviceID 作为参数,调用GPU DevicePlugin的 Allocate 接口
  • GPU DevicePlugin ,接收到调用,将DeviceID 转换为 NVIDIA_VISIBLE_DEVICES 环境变量,返回kubelet
  • kubelet将环境变量注入到Pod, 启动容器
  • 容器启动时, gpu-container-runtime 调用 gpu-containers-runtime-hook
  • gpu-containers-runtime-hook 根据容器的 NVIDIA_VISIBLE_DEVICES 环境变量,转换为 --devices 参数,调用 nvidia-container-cli prestart
  • nvidia-container-cli 根据 --devices ,将GPU设备映射到容器中。 并且将宿主机的Nvidia Driver Lib 的so文件也映射到容器中。 此时容器可以通过这些so文件,调用宿主机的Nvidia Driver。

参考资料