0%

【异构计算】在Docker中使用GPU

我们在 GPU 与 CUDA 编程入门 这篇博客中初步介绍了如何Linux上使用GPU的方法,随着容器和k8s的迅猛发展,人们对于在容器中使用GPU的需求越发强烈。本文将基于前文,继续介绍如何在容器中使用GPU,进一步地,介绍在Kubernetes中如何调度GPU,并以Tensorflow为例,介绍如何基于Docker搭建部署了GPU的深度学习开发环境。

NVIDIA Container Toolkit

背景介绍

容器最早是用于无缝部署基于CPU的应用,它们对于硬件和平台是无感知的,但是显然这种使用场景对于GPU并不适用。对于不同的GPU,需要机器安装不同的硬件驱动,这极大限制了在容器中使用GPU。为了解决这个问题,最早的一种使用方法是在容器中完全重新安装一次NVIDIA驱动,然后将在容器启动的时候将GPU以字符设备 /dev/nvidia0 的方式传递给容器。然而这种方法要求容器中安装的驱动版本与Host上的驱动版本完全一致,同一个Docker Image不能在各个机器上复用,这极大的限制了容器的扩展性。

为了解决上述问题,容器必须对于 NVIDIA 驱动是无感知的,基于此 NVIDIA 推出了 NVIDIA Container Toolkit

nvidia-gpu-docker

如上图所示, NVIDIA 将原来 CUDA 应用依赖的API环境划分为两个部分:

  • 驱动级API:由libcuda.so.major.minor动态库和内核module提供支持,图中表示为CUDA Driver
    • 驱动级API属于底层API,每当NVIDIA公司释放出某一个版本的驱动时,如果你要升级主机上的驱动,那么内核模块和libcuda.so.major.minor这2个文件就必须同时升级到同一个版本,这样原有的程序才能正常工作,
    • 不同版本的驱动不能同时存在于宿主机上
  • 非驱动级API:由动态库libcublas.so等用户空间级别的API组成,图中表示为CUDA Toolkit
    • 非驱动级API的版本号是以Toolkit自身的版本号来管理, 比如cuda-10,cuda-11
    • 不同版本的Toolkit可以同时运行在相同的宿主机上
    • 非驱动级API算是对驱动级API的一种更高级的封装,最终还是要调用驱动级API来实现功能

为了让使用GPU的容器更具可扩展性,关于非驱动级的API被 NVIDIA 打包进了 NVIDIA Container Toolkit,因此在容器中使用GPU之前,每个机器需要先安装好NVIDIA驱动,之后配置好 NVIDIA Container Toolkit之后,就可以在容器中方便使用GPU了。

整体架构

NVIDIA 的容器工具包本质是使用一个nvidia-runc的方式来提供GPU容器的创建, 在用户创建出来的OCI spec上补上几个hook函数,来达到GPU设备运行的准备工作。具体包括以下几个组件,从上到下展示如图:

  • nvidia-docker2
  • nvidia-container-runtime
  • nvidia-container-toolkit
  • libnvidia-container

下面对这几个组件依次介绍:

libnvidia-container

This component provides a library and a simple CLI utility to automatically configure GNU/Linux containers leveraging NVIDIA GPUs. The implementation relies on kernel primitives and is designed to be agnostic of the container runtime.

libnvidia-container provides a well-defined API and a wrapper CLI (called nvidia-container-cli) that different runtimes can invoke to inject NVIDIA GPU support into their containers.

nvidia-container-toolkit

This component includes a script that implements the interface required by a runC prestart hook. This script is invoked by runC after a container has been created, but before it has been started, and is given access to the config.json associated with the container (e.g. this config.json ). It then takes information contained in the config.json and uses it to invoke the libnvidia-container CLI with an appropriate set of flags. One of the most important flags being which specific GPU devices should be injected into the container.

Note that the previous name of this component was nvidia-container-runtime-hook. nvidia-container-runtime-hook is now simply a symlink to nvidia-container-toolkit on the system.

nvidia-container-runtime

This component used to be a complete fork of runC with NVIDIA specific code injected into it. Since 2019, it is a thin wrapper around the native runC installed on the host system. nvidia-container-runtime takes a runC spec as input, injects the nvidia-container-toolkit script as a prestart hook into it, and then calls out to the native runC, passing it the modified runC spec with that hook set. It’s important to note that this component is not necessarily specific to docker (but it is specific to runC).

When the package is installed, the Docker daemon.json is updated to point to the binary as can be seen below:

1
2
3
4
5
6
7
8
9
/etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}

nvidia-docker2

This package is the only docker-specific package of the hierarchy. It takes the script associated with the nvidia-container-runtime and installs it into docker’s /etc/docker/daemon.json file. This then allows you to run (for example) docker run --runtime=nvidia ... to automatically add GPU support to your containers. It also installs a wrapper script around the native docker CLI called nvidia-docker which lets you invoke docker without needing to specify --runtime=nvidia every single time. It also lets you set an environment variable on the host (NV_GPU) to specify which GPUs should be injected into a container.

部署验证

这里仍然基于腾讯云的 CentOS 7机器为例演示如何在安装配置 NVIDIA Container Toolkit,对于更多的平台可以参考其官方文档

安装 Docker CE

1
2
3
$ curl https://get.docker.com | sh \
&& sudo systemctl start docker \
&& sudo systemctl enable docker

安装 NVIDIA Container Toolkit

Setup the stable repository and the GPG key:

1
2
3
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

Install the nvidia-docker2 package (and dependencies) after updating the package listing:

1
$ sudo apt-get update
1
$ sudo apt-get install -y nvidia-docker2

Restart the Docker daemon to complete the installation after setting the default runtime:

1
$ sudo systemctl restart docker

At this point, a working setup can be tested by running a base CUDA container:

1
$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

This should result in a console output shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 34C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

配置 NVIDIA Runtime

To register the nvidia runtime, use the method below that is best suited to your environment. You might need to merge the new argument with your existing configuration. Three options are available:

Systemd drop-in file

1
$ sudo mkdir -p /etc/systemd/system/docker.service.d
1
2
3
4
5
$ sudo tee /etc/systemd/system/docker.service.d/override.conf <<EOF
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd --host=fd:// --add-runtime=nvidia=/usr/bin/nvidia-container-runtime
EOF
1
2
$ sudo systemctl daemon-reload \
&& sudo systemctl restart docker

Daemon configuration file

The nvidia runtime can also be registered with Docker using the daemon.json configuration file:

1
2
3
4
5
6
7
8
9
10
$ sudo tee /etc/docker/daemon.json <<EOF
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
EOF
1
sudo pkill -SIGHUP dockerd

You can optionally reconfigure the default runtime by adding the following to /etc/docker/daemon.json:

1
"default-runtime": "nvidia"

Command Line

Use dockerd to add the nvidia runtime:

1
$ sudo dockerd --add-runtime=nvidia=/usr/bin/nvidia-container-runtime [...]

在k8s中管理GPU

为了在 k8s 中管理和使用GPU,我们除了需要配置 NVIDIA Container Toolkit,还需要安装NVIDIA推出的 NVIDIA/k8s-device-plugin,具体安装可以参考 我的这篇博文。上面的步骤加起来显得还是有些繁琐,如果你直接使用腾讯云 TKE 的话,在集群添加装有GPU的Node时候,就会自动帮你安装配置好 NVIDIA Container ToolkitNVIDIA/k8s-device-plugin,十分方便。接下来我们以Tensorflow为例,演示在 k8s 环境运行有GPU的Tensorflow。

单机版Tensorflow

首先是单机版的Tensorflow,执行 kubectl apply -f tensorflow.yaml来运行 Jupiter Notebook

tensorflow.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensorflow
labels:
k8s-app: tensorflow
spec:
replicas: 1
selector:
matchLabels:
k8s-app: tensorflow
template:
metadata:
labels:
k8s-app: tensorflow
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.2.1-gpu-py3-jupyter
ports:
- containerPort: 8888
resources:
limits:
cpu: 4
memory: 2Gi
requests:
cpu: 2
memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: jupyter-service
spec:
type: NodePort
ports:
- port: 80
targetPort: 8888
name: tensorflow
selector:
k8s-app: tensorflow

我们看到容器很快运行起来,根据 http:<nodeIP>:<nodePort> 可以访问到 Jupiter Notebook,但是显示需要token:

查看 Tensorflow 日志,可以获得 token:aa06c9f12d80adac1a6288b97bf8030522cecc92202dbb20

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[root@VM-1-14-centos single]# kubectl get pods
NAME READY STATUS RESTARTS AGE
tensorflow-6cbc85744b-c567p 1/1 Running 0 7m37s
[root@VM-1-14-centos single]# kubectl logs tensorflow-6cbc85744b-c567p

________ _______________
___ __/__________________________________ ____/__ /________ __
__ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / /
_ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ /
/_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/


WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

[I 04:47:52.083 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 04:47:52.315 NotebookApp] Serving notebooks from local directory: /tf
[I 04:47:52.315 NotebookApp] Jupyter Notebook 6.1.4 is running at:
[I 04:47:52.315 NotebookApp] http://tensorflow-6cbc85744b-c567p:8888/?token=aa06c9f12d80adac1a6288b97bf8030522cecc92202dbb20
[I 04:47:52.315 NotebookApp] or http://127.0.0.1:8888/?token=aa06c9f12d80adac1a6288b97bf8030522cecc92202dbb20
[I 04:47:52.315 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 04:47:52.319 NotebookApp]

To access the notebook, open this file in a browser:
file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
Or copy and paste one of these URLs:
http://tensorflow-6cbc85744b-c567p:8888/?token=aa06c9f12d80adac1a6288b97bf8030522cecc92202dbb20
or http://127.0.0.1:8888/?token=aa06c9f12d80adac1a6288b97bf8030522cecc92202dbb20
[I 04:49:28.692 NotebookApp] 302 GET / (172.16.0.193) 0.57ms
[I 04:49:28.700 NotebookApp] 302 GET /tree? (172.16.0.193) 0.67ms

登陆之后即可看到 Jupiter Notebook

新建Notebook,运行命令如下:

可以看到,TensorFlow 支持在GPU上的运算

  • "/device:GPU:0":TensorFlow 可见的机器上第一个 GPU 的速记表示法。
  • "/job:localhost/replica:0/task:0/device:GPU:0":TensorFlow 可见的机器上第一个 GPU 的完全限定名称。

分布式Tensorflow

整体架构:

这个架构图是分布式tensorflow的实战图,其中有

  • 两个参数服务
  • 多个worker服务
  • 还有个shuffle和抽样的服务

shuffle就是对样根据其标签进行混排,然后对外提供batch抽样服务(可以是有放回和无放回,抽样是一门科学,详情可以参考抽样技术一书),每个batch的抽样是由每个worker去触发,worker拿到抽样的数据样本ID后就去基于kubernetes构建的分布式数据库里边提取该batchSize的样本数据,进行训练计算,由于分布式的tensorflow能够保证异步梯度下降算法,所以每次训练batch数据的时候都会基于最新的参数迭代,然而,更新参数操作就是两个参数服务做的,架构中模型(参数)的存储在NFS中,这样以来,参数服务与worker就可以共享参数了,最后说明一下,我们训练的所有数据都是存储在分布式数据库中(数据库的选型可以根据具体的场景而定)。为什么需要一个shuffle和抽样的服务,因为当数据量很大的时候,我们如果对所有的样本数据进行shuffle和抽样计算的话会浪费很大的资源,因此需要一个这样的服务专门提取数据的(id,label)来进行混排和抽样,这里如果(id, label)的数据量也很大的时候我们可以考虑基于spark 来分布式的进行shuffle和抽样,目前spark2.3已经原生支持kubernetes调度

首先是 Parameter Server

tf-ps.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: tensorflow-ps
spec:
replicas: 1
template:
metadata:
labels:
name: tensorflow-ps
role: ps
spec:
containers:
- name: ps
image: tensorflow/tensorflow:2.2.1-gpu-py3-jupyter
ports:
- containerPort: 2222
resources:
limits:
cpu: 4
memory: 2Gi
requests:
cpu: 2
memory: 1Gi
volumeMounts:
- mountPath: /datanfs
readOnly: false
name: nfs
volumes:
- name: nfs
nfs:
server: 你的nfs服务地址
path: "/data/nfs"
---
apiVersion: v1
kind: Service
metadata:
name: tensorflow-ps-service
labels:
name: tensorflow-ps
role: service
spec:
ports:
- port: 2222
targetPort: 2222
selector:
name: tensorflow-ps

然后是 Worker

tf-worker.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: tensorflow-worker
spec:
replicas: 3
template:
metadata:
labels:
name: tensorflow-worker
role: worker
spec:
containers:
- name: worker
image: tensorflow/tensorflow:2.2.1-gpu-py3-jupyter
ports:
- containerPort: 2222
resources:
limits:
cpu: 4
memory: 2Gi
requests:
cpu: 2
memory: 1Gi
volumeMounts:
- mountPath: /datanfs
readOnly: false
name: nfs
volumes:
- name: nfs
nfs:
server: 你的nfs服务地址
path: "/data/nfs"
---
apiVersion: v1
kind: Service
metadata:
name: tensorflow-wk-service
labels:
name: tensorflow-worker
spec:
ports:
- port: 2222
targetPort: 2222
selector:
name: tensorflow-worker

参考资料