0%

【异构计算】NVIDIA GPU MIG

MIG,也就是 Multi-Instance GPU 是 NVIDIA 在 NVIDIA GTC 2020 发布的最新 Ampere 架构的 NVIDIA A100 GPU 推出的新特性。当配置为 MIG 运行状态时,A100 可以通过分出最多 7 个核心来帮助供应商提高 GPU 服务器的利用率,无需额外投入。MIG 提供了一种多用户使用隔离的GPU资源、提高GPU资源使用率的新的方式,特别适合于云服务提供商的多租户场景,保证一个租户的运行不干扰另一个租户。本文将介绍 MIG 的新特性和使用方法,以及在容器和 k8s 中使用 MIG 的方案。

MIG 技术简介

随着深度学习的广泛应用,使用GPU加速训练和推理越来越普遍。然而,高昂的GPU价格在这里成为了不可忽视的成本,有时候单个GPU并没有得到充分的利用,在多租户之间如何能够共享GPU并且互不干扰成为了一个重要课题,尤其是在云服务环境使用GPU的场景下。针对这个问题,有很多种解决方案,分别是软件级虚拟化GPU和硬件级虚拟化GPU,而 MIG 即是硬件级虚拟化GPU的一种方式:

Data center managers aim to keep resource utilization high, so an ideal data center accelerator doesn’t just go big- it also efficiently accelerates many smaller workloads.

MIG主要技术特点

  1. 每个GI独立的SM,完全隔离的显存(包括隔离的显存,L2cache,独立的DMA控制器等),从而可以保证每个GI的QoS
  2. 支持虚拟机,容器,进程层面的使用

首先看一下传统GPU的内部架构,MIG的目的是使虚拟的每个GPU实例都拥有上面类似的架构。

基本概念

MIG对资源的划分可以分为两级,分别是GPU Instance、Compute Instance

GPU Instance

MIG功能可以将单个GPU划分为多个GPU分区,称为 GPU Insance。创建GPU实例可以认为是将一个大GPU拆分为多个较小的GPU,每个GPU实例都具有专用的计算和内存资源。

每个GPU实例的行为就像一个较小的,功能齐全的独立GPU,其中包括:

  • 预定义数量的GPC
  • SMs
  • L2 Cache
  • Frame buffer

注意:在MIG操作模式下,每个GPU实例中的单个GPC启用了7个TPC(14个SM),这使所有GPU切片具有相同的一致计算性能。

  • GPU Engine:一个 GPU Engine 是 GPU 中执行工作的组件,常用的GPU Engine 如下,每个Engine都能够被独立地调度和为不同 GPU Context 执行工作
    • Compute/Graphics engine that executes the compute instructions
    • the copy engine (CE) that is responsible for performing DMAs
    • NVDEC for video decoding
    • NVENC for encoding
  • GPU Memory Slice:一个 GPU Memory Slice 是 A100 GPU Memory 的一个最小片段,包括对应的 memory controllerscache,粗略来说一个 GPU Memory Slice 大致是总的GPU Memory资源的 1/8,包括memory的 capacity 和 bandwidth。
  • GPU SM Slice:一个 GPU SM Slice 是 A100 GPU SMs 的一个最小片段,粗略来说一个 GPU SM Slice 大致是总的GPU SM资源的 1/7

  • GPU Slice:一个 GPU Slice 是 A100 GPU 中集合一个 GPU Memory Slice 和 一个 GPU SM Slice 的最小片段

  • GPU Instance:一个 GPU Instance 是 GPU Slices 和 GPU Engines (DMAs, NVDECs, etc.)的结合

Compute Instance

一个 GPU Instance 可以被划分为多个 Compute Instance,多个Compute Instance之间共享Memory和Engine,它包含了原来GPU Instance里面 GPU SM slicesGPU Engines 的一个子集(DMAs, NVDECs, etc.):

  • 默认情况下,将在每个GPU实例下创建一个 Compute Instances,从而公开GPU实例中可用的所有GPU计算资源。
  • 可以将GPU实例细分为多个较小的 Compute Instances,以进一步拆分其计算资源。

架构对比

pre-A100 GPU每个用户独占SM、Frame Buffer、L2 Cache。

A100 MIG将GPU进行物理切割,每个虚拟GPU instance具有独立的SM、L2 Cache、DRAM。

下面是MIG 配置多个独立的GPU Compute workloads。每个GPC分配固定的CE和DEC。A100中有5个decoder。

当1个GPU instance中包含2个Compute instance时,2个Compute instance共享CE、DEC和L2、Frame Buffer。

  • GPC:Graphics Processor Cluster
  • TPC:Texture Processor Cluster

Compute instance使多个上下文可以在GPU实例上同时运行。

MIG 隔离

和上一代Volta MPS技术的对比

MPS was designed for sharing the GPU among applications from a single user, but not for multi-user or multi-tenant use cases.

解决了MPS存在的memory system resources were shared across all the applications问题,同时继承了Volta MPS所有功能

对比项 MPS MIG
Partition Type Logical Physical
Max Partitions 48 7
SM Performance Isolation Yes (by percentage, not partitioning) Yes
Memory Protection Yes Yes
Memory Bandwidth QoS No Yes
Error Isolation No Yes
Cross-Partition Interop IPC Limited IPC
Reconfigure Process Launch When Idle

GPU Partitioning

每个 GI 包括的资源不是随意定义的,NVIDIA 提供了 一系列的 GPU Instance Profiles,用户在创建 GI 时必须按照这个 Profile 来切割。我们知道,A100 总共有 8 个 GPU Memory Slice 和 7 个 SM Slice,那么切分总共有5种 Profile:

Profile Name Fraction of Memory Fraction of SMs Hardware Units Number of Instances Available
MIG 1g.5gb 1/8 1/7 0 NVDECs 7
MIG 2g.10gb 2/8 2/7 1 NVDECs 3
MIG 3g.20gb 4/8 3/7 2 NVDECs 2
MIG 4g.20gb 4/8 4/7 2 NVDECs 1
MIG 7g.40gb Full 7/7 5 NVDECs 1

注意:这里对于 A100-SXM4-40GB 总的 Memory大小是40GB,所以最小单位是 1g.5gb,如果对于 A100-SXM4-80GB,则最小单位是 1g.10gb

也就是说,这几种 Profile 确定了 A100 GPU 可以被切分的方式,如下图,所有可以切分的方式只是下图从左到右选择不同的Profile,并且两个Profile上下不重叠。唯一的例外是,现在 NVIDIA 不支持 (4 memory, 4 compute) 和 (4 memory, 3 compute) 的组合:

下图就是组合的一种方式:A100 GPU 被切割成了3个GPU Instance,分别的大小是

  • 4 memory,4 compute
  • 2 memory,2 compute
  • 1 memory,1 compute

Example Configuration of GPU Instances.

下图也是组合的一种可能:

前面提到, 硬件上 NVIDIA 不支持 (4 memory, 4 compute) 和 (4 memory, 3 compute) 的组合,但是支持两个 (4 memory, 3 compute) 的组合,这里左边的一个 (4 memory, 3 compute) 是将 (4 memory, 4 compute) 示例化为一个 (4 memory, 3 compute)。如下图就将 A100 切分成两个 GPU Instance,每个GPU Instance都有 (4 memory, 3 compute)

或者切分成3个GPU Instance:

也可以切分成下面这种4个GPU Instance:

总的来说,一共有 18 种切分方法:

注意,下图中的两种切分并不相同,因为每个切分的Instance 的 physical layout 也很重要:

Placement of GPU Instances.

MIG 技术使用

具体到A100卡,实际实现有两个型号,分别是

  • GA100 Full GPU with 128 SMs
  • A100 Tensor Core GPU with 108 SMs

本次调研中使用的卡是108 SM版本

驱动安装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ nvidia-smi
Wed Jan 13 11:42:34 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB Off | 00000000:00:08.0 Off | 0 |
| N/A 26C P0 43W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
1
2
3
4
5
6
7
8
9
$ tree /dev/
├── nvidia0
├── nvidia-caps
│   ├── nvidia-cap1
│   └── nvidia-cap2
├── nvidiactl
├── nvidia-modeset
├── nvidia-uvm
├── nvidia-uvm-tools

开启MIG支持

查询是否开启MIG

1
2
3
$ nvidia-smi -i 0 --query-gpu=pci.bus_id,mig.mode.current --format=csv
pci.bus_id, mig.mode.current
00000000:00:08.0, Disabled

对于指定卡开启mig,只有在卡空闲时才能更改mig enable 设置

1
2
3
4
$ nvidia-smi -i 0 -mig 1
Warning: MIG mode is in pending enable state for GPU 00000000:00:08.0:In use by another client
00000000:00:08.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using the device and retry the command or reboot the system to make MIG mode effective.
All done.

If you are using MIG inside a VM with GPU passthrough, then you may need to reboot the VM to allow the GPU to be in MIG mode as in some cases, GPU reset is not allowed via the hypervisor for security reasons. This can be seen in the following example:

重启之后

1
2
3
$ nvidia-smi -i 0 --query-gpu=pci.bus_id,mig.mode.current --format=csv
pci.bus_id, mig.mode.current
00000000:00:08.0, Enabled

查询可分配 GI 信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# nvidia-smi mig -lgip
+--------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| Free/Total GiB CE JPEG OFA |
|==========================================================================|
| 0 MIG 1g.5gb 19 7/7 4.75 No 14 0 0 |
| 1 0 0 |
+--------------------------------------------------------------------------+
| 0 MIG 2g.10gb 14 3/3 9.75 No 28 1 0 |
| 2 0 0 |
+--------------------------------------------------------------------------+
| 0 MIG 3g.20gb 9 2/2 19.62 No 42 2 0 |
| 3 0 0 |
+--------------------------------------------------------------------------+
| 0 MIG 4g.20gb 5 1/1 19.62 No 56 2 0 |
| 4 0 0 |
+--------------------------------------------------------------------------+
| 0 MIG 7g.40gb 0 1/1 39.50 No 98 5 0 |
| 7 1 1 |
+--------------------------------------------------------------------------+

查询 GI placements

1
2
3
4
5
6
# nvidia-smi mig -lgipp
GPU 0 Profile ID 19 Placements: {0,1,2,3,4,5,6}:1
GPU 0 Profile ID 14 Placements: {0,2,4}:2
GPU 0 Profile ID 9 Placements: {0,4}:4
GPU 0 Profile ID 5 Placement : {0}:4
GPU 0 Profile ID 0 Placement : {0}:8

创建 GPU Instances

1
2
3
4
# nvidia-smi mig -cgi 9,14,19
Successfully created GPU instance ID 2 on GPU 0 using profile MIG 3g.20gb (ID 9)
Successfully created GPU instance ID 3 on GPU 0 using profile MIG 2g.10gb (ID 14)
Successfully created GPU instance ID 9 on GPU 0 using profile MIG 1g.5gb (ID 19)

查询 GPU Instance

1
2
3
4
5
6
7
8
9
10
11
12
# nvidia-smi mig -lgi
+----------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|====================================================|
| 0 MIG 1g.5gb 19 9 2:1 |
+----------------------------------------------------+
| 0 MIG 2g.10gb 14 3 0:2 |
+----------------------------------------------------+
| 0 MIG 3g.20gb 9 2 4:4 |
+----------------------------------------------------+

创建 Compute Instance

创建CI前,首先需要查询对应的GI支持Profile列表,可以发现上文创建的ID为2的GI可以进一步分为3种类型的CI

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# nvidia-smi mig -lcip -gi 2
+--------------------------------------------------------------------------------------+
| Compute instance profiles: |
| GPU GPU Name Profile Instances Exclusive Shared |
| Instance ID Free/Total SM DEC ENC OFA |
| ID CE JPEG |
|======================================================================================|
| 0 2 MIG 1c.3g.20gb 0 3/3 14 2 0 0 |
| 3 0 |
+--------------------------------------------------------------------------------------+
| 0 2 MIG 2c.3g.20gb 1 1/1 28 2 0 0 |
| 3 0 |
+--------------------------------------------------------------------------------------+
| 0 2 MIG 3g.20gb 2* 1/1 42 2 0 0 |
| 3 0 |
+--------------------------------------------------------------------------------------+

# nvidia-smi mig -lcip -gi 3
+--------------------------------------------------------------------------------------+
| Compute instance profiles: |
| GPU GPU Name Profile Instances Exclusive Shared |
| Instance ID Free/Total SM DEC ENC OFA |
| ID CE JPEG |
|======================================================================================|
| 0 3 MIG 1c.2g.10gb 0 2/2 14 1 0 0 |
| 2 0 |
+--------------------------------------------------------------------------------------+
| 0 3 MIG 2g.10gb 1* 1/1 28 1 0 0 |
| 2 0 |
+--------------------------------------------------------------------------------------+

然后进一步将ID为2的GI划分为两个CI,Profile分别是1c.3g.20gb,2c.3g.20gb,具体命令如下

1
2
3
# nvidia-smi mig -cci 0,1 -gi 2
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 2 using profile MIG 1c.3g.20gb (ID 0)
Successfully created compute instance ID 1 on GPU 0 GPU instance ID 2 using profile MIG 2c.3g.20gb (ID 1)

查询 Compute Instance

1
2
3
4
5
6
7
8
9
10
11
# nvidia-smi mig -lci -gi 2
+--------------------------------------------------------------------+
| Compute instances: |
| GPU GPU Name Profile Instance Placement |
| Instance ID ID Start:Size |
| ID |
|====================================================================|
| 0 2 MIG 1c.3g.20gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 2 MIG 2c.3g.20gb 1 1 1:2 |
+--------------------------------------------------------------------+

执行 nvidia-smi 也可以看到如下输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# nvidia-smi
Wed Jan 13 12:04:54 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:00:08.0 Off | On |
| N/A 26C P0 43W / 400W | 11MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 2 0 0 | 5MiB / 20096MiB | 14 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+ +-----------+-----------------------+
| 0 2 1 1 | | 28 0 | 3 0 2 0 0 |
| | | | |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

执行nvidia-smi -L 可以列出每个设备的UUID,供后续计算时使用

1
2
3
4
# nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181)
MIG 1c.3g.20gb Device 0: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/2/0)
MIG 2c.3g.20gb Device 1: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/2/1)

删除 CPU Instance

可以使用如下命令删除gi实例1上的ci实例0

1
nvidia-smi mig -dci -ci 0 -gi 1

使用 MIG

Bare-Metal

暂时没有拿到 bare metal 的 A100 机器,TODO

Container

前置条件

  • 安装Docker
  • 安装NVIDIA Container Toolkit:
    • Nvidia-docker2 版本推荐在 v2.5.0 以上

运行容器

1
2
3
# docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/2/1 nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/2/1 --compute --utility --require=cuda>=11.1 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 brand=tesla,driver>=450,driver<451 --pid=11936 /var/lib/docker/overlay2/5ee3e036c29f6cd488a3ad1ab1c55a47e595ffff530075853396745de546e4a8/merged]\\\\nnvidia-container-cli: device error: unknown device id: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/2/1\\\\n\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled

怀疑是 NVIDIA Docker Toolkit 版本太老

1
2
3
4
# /usr/bin/nvidia-container-runtime -v
runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
spec: 1.0.1-dev

安装新版本的 NVIDIA Docker Toolkit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Dependencies Resolved

==========================================================================================================================================================================
Package Arch Version Repository Size
==========================================================================================================================================================================
Installing:
nvidia-docker2 noarch 2.5.0-1 nvidia-docker 8.4 k
Installing for dependencies:
container-selinux noarch 2:2.119.1-1.c57a6f9.tl2 tlinux 39 k
containerd.io x86_64 1.2.5-3.1.el7 tlinux 22 M
docker-ce x86_64 3:18.09.5-3.el7 tlinux 19 M
docker-ce-cli x86_64 1:18.09.5-3.el7 tlinux 14 M
Updating for dependencies:
libnvidia-container-tools x86_64 1.3.1-1 libnvidia-container 42 k
libnvidia-container1 x86_64 1.3.1-1 libnvidia-container 86 k
nvidia-container-runtime x86_64 3.4.0-1 nvidia-container-runtime 693 k
nvidia-container-toolkit x86_64 1.4.0-2 nvidia-container-runtime 819 k

环境配置好后,即可通过 docker 运行容器使用GPU:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
$ docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=MIG-GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1/2/0 nvidia/cuda nvidia-smi
Wed Jan 13 11:30:19 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:00:08.0 Off | On |
| N/A 26C P0 42W / 400W | N/A | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 2 0 0 | 5MiB / 20096MiB | 14 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Kubernetes

前置依赖

  • NVIDIA R450+ datacenter driver: 450.80.02+
  • NVIDIA Container Toolkit (nvidia-docker2): v2.5.0+
  • NVIDIA k8s-device-plugin: v0.7.0+
  • NVIDIA gpu-feature-discovery: v0.2.0+

None

确认 Node 上的 MIG 特性开启,此时没有创建任何GI:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Thu Jan 14 16:35:34 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:00:08.0 Off | On |
| N/A 26C P0 43W / 400W | 0MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

启动 Device Plugin,此时 mig-strategynone

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvidia/k8s-device-plugin:v0.7.0
name: nvidia-device-plugin-ctr
args: ["--fail-on-init-error=false"]
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins

可以看到 Node 上可以用 nvidia.com/gpu 资源数目:

1
2
3
4
5
6
Capacity:
...
nvidia.com/gpu: 1
Allocatable:
...
nvidia.com/gpu: 1

部署 Pod

1
2
3
4
5
6
$ kubectl run -it --rm \
--image=nvidia/cuda \
--restart=Never \
--limits=nvidia.com/gpu=1 \
GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181)
pod "mig-none-example" deleted

Single

确认 Node 上的MIG特性开启后,创建大小相同的7个GI,每个GI对应着一个CI:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C
Successfully created GPU instance ID 13 on GPU 0 using profile MIG 1g.5gb (ID 19)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 13 using profile MIG 1g.5gb (ID 0)
Successfully created GPU instance ID 11 on GPU 0 using profile MIG 1g.5gb (ID 19)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 11 using profile MIG 1g.5gb (ID 0)
Successfully created GPU instance ID 12 on GPU 0 using profile MIG 1g.5gb (ID 19)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 12 using profile MIG 1g.5gb (ID 0)
Successfully created GPU instance ID 7 on GPU 0 using profile MIG 1g.5gb (ID 19)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 7 using profile MIG 1g.5gb (ID 0)
Successfully created GPU instance ID 8 on GPU 0 using profile MIG 1g.5gb (ID 19)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 8 using profile MIG 1g.5gb (ID 0)
Successfully created GPU instance ID 9 on GPU 0 using profile MIG 1g.5gb (ID 19)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 9 using profile MIG 1g.5gb (ID 0)
Successfully created GPU instance ID 10 on GPU 0 using profile MIG 1g.5gb (ID 19)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 10 using profile MIG 1g.5gb (ID 0)
$ nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/7/0)
MIG 1g.5gb Device 1: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/8/0)
MIG 1g.5gb Device 2: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/9/0)
MIG 1g.5gb Device 3: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/10/0)
MIG 1g.5gb Device 4: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/11/0)
MIG 1g.5gb Device 5: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/12/0)
MIG 1g.5gb Device 6: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/13/0)

部署 Device Plugin

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvidia/k8s-device-plugin:v0.7.0
name: nvidia-device-plugin-ctr
args: ["--fail-on-init-error=false", "--mig-strategy=single"]
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins

这时候可以看到 Node 上面的标记 nvidia.com/gpu 变成了 7 个:

1
2
3
4
5
6
Capacity:
...
nvidia.com/gpu: 7
Allocatable:
...
nvidia.com/gpu: 7

部署 discovery

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-feature-discovery
labels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/version: 0.2.0
app.kubernetes.io/part-of: nvidia-gpu
spec:
selector:
matchLabels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/part-of: nvidia-gpu
template:
metadata:
labels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/version: 0.2.0
app.kubernetes.io/part-of: nvidia-gpu
spec:
containers:
- image: nvidia/gpu-feature-discovery:v0.2.0
name: gpu-feature-discovery
args: ["--mig-strategy=single"]
volumeMounts:
- name: output-dir
mountPath: "/etc/kubernetes/node-feature-discovery/features.d"
- name: dmi-product-name
mountPath: "/sys/class/dmi/id/product_name"
env:
- name: NVIDIA_MIG_MONITOR_DEVICES
value: all
securityContext:
privileged: true
nodeSelector:
feature.node.kubernetes.io/pci-10de.present: "true" # NVIDIA vendor ID
volumes:
- name: output-dir
hostPath:
path: "/etc/kubernetes/node-feature-discovery/features.d"
- name: dmi-product-name
hostPath:
path: "/sys/class/dmi/id/product_name"

运行 Pod 申请GPU:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
$  for i in $(seq 7); do
kubectl run \
--image=nvidia/cuda:11.0-base \
--restart=Never \
--limits=nvidia.com/gpu=1 \
mig-single-example-${i} -- bash -c "nvidia-smi -L; sleep infinity"
done
pod/mig-single-example-1 created
pod/mig-single-example-2 created
pod/mig-single-example-3 created
pod/mig-single-example-4 created
pod/mig-single-example-5 created
pod/mig-single-example-6 created
pod/mig-single-example-7 created

$ for i in $(seq 7); do
echo "mig-single-example-${i}";
kubectl logs mig-single-example-${i}
echo "";
done

mig-single-example-1
GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/11/0)

mig-single-example-2
GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/7/0)

mig-single-example-3
GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/8/0)

...

$ for i in $(seq 7); do
kubectl delete pod mig-single-example-${i};
done

pod "mig-single-example-1" deleted
pod "mig-single-example-2" deleted
...

Mixed

确认 Node 上的MIG特性开启后,创建不同大小的3个GI,每个GI对应着一个CI:

1
2
3
4
5
6
7
8
9
10
11
12
$ nvidia-smi mig -cgi 9,14,19 -C
Successfully created GPU instance ID 2 on GPU 0 using profile MIG 3g.20gb (ID 9)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 2 using profile MIG 3g.20gb (ID 2)
Successfully created GPU instance ID 3 on GPU 0 using profile MIG 2g.10gb (ID 14)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 3 using profile MIG 2g.10gb (ID 1)
Successfully created GPU instance ID 9 on GPU 0 using profile MIG 1g.5gb (ID 19)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 9 using profile MIG 1g.5gb (ID 0)
$ nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1)
MIG 3g.20gb Device 0: (UUID: MIG-GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1/2/0)
MIG 2g.10gb Device 1: (UUID: MIG-GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1/3/0)
MIG 1g.5gb Device 2: (UUID: MIG-GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1/9/0)

启动 Device Plugin

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvidia/k8s-device-plugin:v0.7.0
name: nvidia-device-plugin-ctr
args: ["--fail-on-init-error=false", "--mig-strategy=mixed"]
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins

启动 Device Plugin 之后,可以看到 Node 上的有MIG的resource type

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Capacity:
...
nvidia.com/gpu: 0
nvidia.com/mig-1g.5gb: 1
nvidia.com/mig-2g.10gb: 1
nvidia.com/mig-3g.20gb: 1
pods: 61
Allocatable:
...
nvidia.com/gpu: 0
nvidia.com/mig-1g.5gb: 1
nvidia.com/mig-2g.10gb: 1
nvidia.com/mig-3g.20gb: 1
pods: 61

这时候启动 gpu-feature-discovery,启动策略是 mixed

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-feature-discovery
labels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/version: 0.2.0
app.kubernetes.io/part-of: nvidia-gpu
spec:
selector:
matchLabels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/part-of: nvidia-gpu
template:
metadata:
labels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/version: 0.2.0
app.kubernetes.io/part-of: nvidia-gpu
spec:
containers:
- image: nvidia/gpu-feature-discovery:v0.2.0
name: gpu-feature-discovery
args: ["--mig-strategy=mixed"]
volumeMounts:
- name: output-dir
mountPath: "/etc/kubernetes/node-feature-discovery/features.d"
- name: dmi-product-name
mountPath: "/sys/class/dmi/id/product_name"
env:
- name: NVIDIA_MIG_MONITOR_DEVICES
value: all
securityContext:
privileged: true
#nodeSelector:
# feature.node.kubernetes.io/pci-10de.present: "true" # NVIDIA vendor ID
volumes:
- name: output-dir
hostPath:
path: "/etc/kubernetes/node-feature-discovery/features.d"
- name: dmi-product-name
hostPath:
path: "/sys/class/dmi/id/product_name"

这时候查看 Node 的 label,可以看到 MIG 相关的 label 已经打上 ?

1
2
3
$ kubectl get node -o json | \
jq '.items[0].metadata.labels | with_entries(select(.key | startswith("nvidia.com")))'
{}

使用 kubectl 启动 Pod:

1
2
3
4
5
6
7
8
$ kubectl run -it --rm \
--image=nvidia/cuda:11.0-base \
--restart=Never \
--limits=nvidia.com/mig-1g.5gb=1 \
mig-mixed-example -- nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/9/0)
pod "mig-mixed-example" deleted

当前TKE的问题

  • 驱动版本 和 Nvidia-container-toolkit 版本 较老,需要更新
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# TKE GPU Node查看到 Driver 信息
$ nvidia-smi -a
Driver Version : 418.67
CUDA Version : 10.1

# Nvidia Container Toolkit 版本
nvidia-container-runtime-3.1.0-1
nvidia-container-toolkit-1.0.1-2
libnvidia-container-tools-1.0.2-1
libnvidia-container1-1.0.2-1

# NVIDIA Device Plugin 版本较老
nvidia/k8s-device-plugin:1.10

应该用 NVIDIA k8s-device-plugin: v0.7.0+
  • VM 中使用 MIG,开启MIG特性需要重启VM

If you are using MIG inside a VM with GPU passthrough, then you may need to reboot the VM to allow the GPU to be in MIG mode as in some cases, GPU reset is not allowed via the hypervisor for security reasons. This can be seen in the following example:

1
2
3
4
5
6
7
$ sudo nvidia-smi -i 0 -mig 1
Warning: MIG mode is in pending enable state for GPU 00000000:00:03.0:Not Supported
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:00:03.0
All done.

$ sudo nvidia-smi --gpu-reset
Resetting GPU 00000000:00:03.0 is not supported.

划分MIG后的性能对比

整块卡的性能

测试项目 实测性能 官方标准性能
FP32MAD 19.436TF 19.5 TF
FP64MAD 9.690TF 9.7 TF
int32mad 19.446TF -
int32add 18.906TF -
FP32GEMMTensor (矩阵大小满足最佳性能要求) 158.426TF 156 TF
FP32GEMMTensor (不满足最佳性能要求) 68.054 TF -
FP32GEMM 19.047 TF -

备注

  1. GEMM 需要在cuda 11.0 下重编,才能达到以上效果
  2. 满足最佳性能要求是的GEMM大小参数 9000 6000 6000
  3. 不满足最佳性能要求的GEMM大小参数 8997 5998 5998

MIG卡的性能

为了测试各个CI和GI的性能,对3g.20gb GI进行进一步划分,分为 2c.3g.20gb, 1c.3g.20gb,另外两个GI不做进一步划分,直接在GI基础上创建CI。

至此一块GPU卡被分为四个CI分别是

  • MIG 1c.3g.20gb
  • MIG 2c.3g.20gb
  • MIG 2g.10gb
  • MIG 1g.5gb

各CI串行执行

MIG 1c.3g.20gb

测试项目 实测性能(OPS)
FP32MAD 2.523 T
FP64MAD 1.261 T
INT32MAD 2.524 T
INT32ADD 2.455 T
FP32GEMMTensor (矩阵大小满足最佳性能要求) 23.081 T
FP32GEMMTensor (不满足最佳性能要求) 8.940 T
FP32GEMM 2.476 T

MIG 2c.3g.20gb

测试项目 实测性能(OPS)
FP32MAD 5.046 T
FP64MAD 2.521 T
INT32MAD 5.049 T
INT32ADD 4.908 T
FP32GEMMTensor (矩阵大小满足最佳性能要求) 44.941 T
FP32GEMMTensor (不满足最佳性能要求) 18.920 T
FP32GEMM 4.909 T

MIG 2g.10gb

测试项目 实测性能(OPS)
FP32MAD 5.046 T
FP64MAD 2.521 T
INT32MAD 5.049 T
INT32ADD 4.908 T
FP32GEMMTensor (矩阵大小满足最佳性能要求) 40.151 T
FP32GEMMTensor (不满足最佳性能要求) 17.514 T
FP32GEMM 4.909 T

MIG 1g.5gb

测试项目 实测性能(OPS)
FP32MAD 2.523 T
FP64MAD 1.261 T
INT32MAD 2.524 T
INT32ADD 2.454 T
FP32GEMMTensor (矩阵大小满足最佳性能要求) 16.453 T
FP32GEMMTensor (不满足最佳性能要求) 8.261 T
FP32GEMM 2.476T

备注

在串行执行FP32MAD任务时,1c.3g.20gb,1g.5gb的测试任务时保持在14.3%,2c.3g.20gb,2g.10gb的测试任务时保持在28.6%附近

各CI并行执行

统一执行FP32MAD

测试项目 实测性能(OPS)
1c.3g.20gb 2.523 T
2c.3g.20gb 5.044 T
2g.10gb 5.046 T
1g.5gb 2.523 T

统一执行FP32GEMMTensor

测试项目 实测性能(OPS)
1c.3g.20gb 20.450 T
2c.3g.20gb 41.194 T
2g.10gb 39.773 T
1g.5gb 16.336 T

备注

在并行执行FP32MAD任务时,SmActivity,SmOccupancy,FP32Activity三项监控指标保持在85.7%附近

分别执行不同类型的计算

测试项目 实测性能(OPS)
1c.3g.20gb FP32MAD 2.523 T
2c.3g.20gb FP64MAD 2.521 T
2g.10gb INT32MAD 5.048 T
1g.5gb INT32ADD 2.454 T

备注

在并行执行不同计算任务时,SmActivity,SmOccupancy,FP64Activity,FP32Activity分别为85.7%, 78.1%, 28.5%, 57.0%

根据测试结果,验证了CI,GI隔离的有效性,具体结论如下

  1. 对比各个MIG上任务串行执行,以及并行执行的性能数据,可以有效验证CI,GI隔离的有效性
  2. 划分CI,GI存在一定的性能损失,1g.5gb 上测得的性能并不等与整张卡的1/7,从整张卡的维度来看,存在10%的性能损失。考虑原因,A100 卡总共有108 SMs,但是分为7个MIG实例后,每个MIG实例只有14个SM 14*7 = 98 SMs,有10个SM将无法使用,这10个SM的浪费就是产生性能损失的源头。
  3. 对比2c.3g.20gb 2g.10gb可以发现在2c.3g.20gb(一个GI上软隔离的CI)Tensor计算性能比 2c.3g.20gb(完全隔离的GI)更好一些,同时对比所有CI同时执行FP32GemmTensor,可以发现同一个GI上的CI同时执行GEMM(相比FP32MAD,有一定的显存读写)时,两个CI的计算性能比单独执行时会有所下降,更接近单独GI的性能。即说明2c.3g.20gb性能强于2g.10gb,是由于CI隔离不完全导致的。

A100卡可以为后续工作带来的价值

  1. 每个MIG实例的完整隔离,可以支持多种虚拟化场景,包括虚拟机,容器
  2. 最小实例的基础计算能力,大约为T4卡的三分之一,P4卡的一半,计算能力适中,内存带宽1,555 GB/s 相比于P4卡 192 GB/s,T4 320+ GB/s,带宽足够充裕,不会成为瓶颈
  3. 存在离线计算和在线推理使用同一种GPU的可能性,打通离线,在线两个GPU资源池
    • T4卡的具体性能指标 16G显存,SM 40, 8 TensorCores/SM, 64 INT32Cores/SM, 64 FP32Cores/SM,
    • A100卡的具体性能指标 40G显存,SM 108, 4 Third-generation Tensor Cores/SM, 64 FP32 CUDA Cores/SM,

参考资料