0%

【系统监控】GPU 监控

问题背景

在使用GPU进行深度学习相关的训练与推理时,需要查看当前集群中GPU的使用情况:

  • 需要通过当前GPU设备资源使用情况判断是否可以再部署新的应用,判断集群是否需要扩容,为GPU服务提供对齐CPU的容量保障服务,补齐容量保障中的GPU短板
  • 需要通过当前GPU设备资源使用情况分析使用中存在的瓶颈和短板,推进优化,提高资源利用率和服务性能

为了获得GPU的监控数据,NVIDIA 提供了以下三种方法:

  • NVML:NVIDIA Management Library,基于C进行监控和管理GPU的库,nvidia-smi 命令即是基于此实现的
  • DCGM:Data Center GPU Manager,基于NVML和CUDA实现的一整套GPU的监控和管理工具
  • 第三方工具:基于 DCGM 或者 NVML 开发的第三方监控工具,可以与Prometheus等工具结合,提供数据库、UI等工具

对比这三种工具的特点:

  • NVML
    • 无状态的查询,只支持查询当前数据
    • 属于低级别控制GPU的API
    • 基于NVML库开发的管理工具运行成本低,开发成本高
    • 基于NVML库开发的管理工具必须与GPU运行在同一个节点
  • DCGM
    • 可以查询几个小时的数据指标
    • 提供了GPU的健康检查和诊断
    • 可以对一组GPU进行批量查询
    • 允许以 remote/local 两种方式运行
  • 第三方工具
    • 提供了database、graphs和好看的UI

本文后续将主要介绍 DCGM。

DCGM

下图展示了 DCGM 在集群中运行的方式,DCGM 以 Agent 的形式部署在计算节点上,管理节点上的工具可以通过 DCGM 提供的API管理和监控GPU。

DCGM 提供了一下四种关键特性:

  • Active Health Monitoring
  • GPU Diagnostics
  • Policy and Alerting
  • Configuration Managerment

安装部署

DCGM 需要单独下载安装,在NVIDIA官网NVIDIA下载对应的安装包,这里选择下载rpm包即可,下载完成后:

1
2
3
4
# 卸载可能已安装的旧版本DCGM
$ yum remove datacenter-gpu-manager
# 安装
$ rpm -ivh datacenter-gpu-manager-2.0.13-1-x86_64.rpm
  • DCGM的动态链接库会被安装到/usr/lib64目录
  • Python库会被安装到/usr/local/dcgm/bindings目录

DCGM 是一个面向集群管理的工具,所以在实际使用前,需要先在目标机器启动一个agent,nv-hostengine,具体启动命令如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# 启动 nv-hostengine
$ nv-hostengine --port 39999 --bind-interface 127.0.0.1
Host Engine Listener Started
Started host engine version 2.0.13 using port number: 39999

# 查看设备列表
$ dcgmi discovery --host 127.0.0.1:39999 -l
4 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: Tesla T4 |
| | PCI Bus ID: 00000000:00:08.0 |
| | Device UUID: GPU-0bf43c76-0f1a-f49f-a362-92d5b9bbbc9f |
+--------+----------------------------------------------------------------------+
| 1 | Name: Tesla T4 |
| | PCI Bus ID: 00000000:00:09.0 |
| | Device UUID: GPU-c55a4e5e-47dd-48c2-99d0-2630042bf619 |
+--------+----------------------------------------------------------------------+
| 2 | Name: Tesla T4 |
| | PCI Bus ID: 00000000:00:0A.0 |
| | Device UUID: GPU-95e5fe58-f03d-815e-c871-65637b623aca |
+--------+----------------------------------------------------------------------+
| 3 | Name: Tesla T4 |
| | PCI Bus ID: 00000000:00:0B.0 |
| | Device UUID: GPU-70747d1b-2b7a-9895-29f5-485608c1742e |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+

# 关闭 nv-hostengine,这里作演示用,后续的过程还要继续打开
$ nv-hostengine –t
Host engine successfully terminated.

其中,--port --bind-interface 两个参数分别用来设置监听的端口和绑定的IP地址。同时也支持使用 UNIX_SOCKET 通信

在启动 nv-hostengine 之后,我们就可以使用 dcgmi 来操作

组操作

和NVML不同,DCGM 的大部分功能都是面向组的,所以在使用DCGM之前,首先需要创建组,然后才能使用DCGM提供的各种功能。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# 获取设备列表后,可以用如下命令创建组
# 创建成功后,该命令会输出如下,返回设备的组ID,后续的操作中都会用到组ID,例如下面的组ID 2
$ dcgmi group --host 127.0.0.1:39999 -c GPU_GROUP
Successfully created group "GPU_GROUP" with a group ID of 2

$ dcgmi group --host 127.0.0.1:39999 -l
+-------------------+----------------------------------------------------------+
| GROUPS |
| 1 group found. |
+===================+==========================================================+
| Groups | |
| -> 2 | |
| -> Group ID | 2 |
| -> Group Name | GPU_GROUP |
| -> Entities | None |
+-------------------+----------------------------------------------------------+

$ dcgmi discovery --host 127.0.0.1:39999 -l
4 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: Tesla T4 |
| | PCI Bus ID: 00000000:00:08.0 |
| | Device UUID: GPU-0bf43c76-0f1a-f49f-a362-92d5b9bbbc9f |
+--------+----------------------------------------------------------------------+
| 1 | Name: Tesla T4 |
| | PCI Bus ID: 00000000:00:09.0 |
| | Device UUID: GPU-c55a4e5e-47dd-48c2-99d0-2630042bf619 |
+--------+----------------------------------------------------------------------+
| 2 | Name: Tesla T4 |
| | PCI Bus ID: 00000000:00:0A.0 |
| | Device UUID: GPU-95e5fe58-f03d-815e-c871-65637b623aca |
+--------+----------------------------------------------------------------------+
| 3 | Name: Tesla T4 |
| | PCI Bus ID: 00000000:00:0B.0 |
| | Device UUID: GPU-70747d1b-2b7a-9895-29f5-485608c1742e |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+

# 创建组后可以用如下命令给组中添加设备
$ dcgmi group --host 127.0.0.1:39999 -g 2 -a 0,1
Add to group operation successful.

$ dcgmi group --host 127.0.0.1:39999 -g 2 -i
+-------------------+----------------------------------------------------------+
| GROUP INFO |
+===================+==========================================================+
| 2 | |
| -> Group ID | 2 |
| -> Group Name | GPU_GROUP |
| -> Entities | GPU 0, GPU 1 |
+-------------------+----------------------------------------------------------+

# 使用如下命令可以从组中删除设备
$ dcgmi group --host 127.0.0.1:39999 -g 2 -r 0,1
Remove from group operation successful.

# 使用如下命令可以从删除组
$ dcgmi group --host 127.0.0.1:39999 -d 2

注意:group和设备之间是多对多关系

Job Statistics

当有一个Job需要通过GPU加速计算的时候,我们想知道:

  • 我的Job运行在哪个GPU上
  • 我的Job使用了多少GPU
  • 在我的Job运行过程中是否有任何的错误和Warning
  • 系统的GPU是否都健康并且准备好了下一个Job的计算
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# 当前 Group 3 如下
$ dcgmi group --host 127.0.0.1:39999 -g 3 -i
+-------------------+----------------------------------------------------------+
| GROUP INFO |
+===================+==========================================================+
| 3 | |
| -> Group ID | 3 |
| -> Group Name | GPU_GROUP |
| -> Entities | GPU 0, GPU 1, GPU 2, GPU 3 |
+-------------------+----------------------------------------------------------+

# 在使用dcgmi获取GPU统计数据,需要先打开数据分析功能,具体命令如下
$ dcgmi stats --host 127.0.0.1:39999 -g 3 --enable
Successfully started process watches.

# 打开数据分析功能后,可以使用如下命令查看具体的进程的统计信息
# 假设这里启动了一个CUDA应用进程正在使用GPU进行计算
$ dcgmi stats --host 127.0.0.1:39999 -g 3 -p 41861 -v
Successfully retrieved process info for PID: 41861. Process ran on 1 GPUs.
+------------------------------------------------------------------------------+
| GPU ID: 3 |
+====================================+=========================================+
|----- Execution Stats ------------+-----------------------------------------|
| Start Time * | Wed Jan 6 16:54:16 2021 |
| End Time * | Still Running |
| Total Execution Time (sec) * | Still Running |
| No. of Conflicting Processes * | 0 |
+----- Performance Stats ----------+-----------------------------------------+
| Energy Consumed (Joules) | 2985 |
| Max GPU Memory Used (bytes) * | 12107907072 |
| SM Clock (MHz) | Avg: 1590, Max: 1590, Min: 1590 |
| Memory Clock (MHz) | Avg: 5000, Max: 5000, Min: 5000 |
| SM Utilization (%) | Avg: 100, Max: 100, Min: 100 |
| Memory Utilization (%) | Avg: 5, Max: 5, Min: 5 |
| PCIe Rx Bandwidth (megabytes) | Avg: N/A, Max: N/A, Min: N/A |
| PCIe Tx Bandwidth (megabytes) | Avg: N/A, Max: N/A, Min: N/A |
+----- Event Stats ----------------+-----------------------------------------+
| Double Bit ECC Errors | 0 |
| PCIe Replay Warnings | 0 |
| Critical XID Errors | 0 |
+----- Slowdown Stats -------------+-----------------------------------------+
| Due to - Power (%) | 0 |
| - Thermal (%) | 0 |
| - Reliability (%) | 0 |
| - Board Limit (%) | 0 |
| - Low Utilization (%) | 0 |
| - Sync Boost (%) | 0 |
+----- Process Utilization --------+-----------------------------------------+
| PID | 41861 |
| Avg SM Utilization (%) | 99 |
| Avg Memory Utilization (%) | 3 |
+----- Overall Health -------------+-----------------------------------------+
| Overall Health | Healthy |
+------------------------------------+-----------------------------------------+

(*) Represents a process statistic. Otherwise device statistic during
process lifetime listed.

Configuration Managerment

DCGM 可以更改GPU设置, 具体支持的设置项如下,查看原有设置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ dcgmi config  --host 127.0.0.1:39999 -g 3 --get
+------------------------------+------------------------------+------------------------------+
| GPU_GROUP |
| Group of 4 GPUs |
+==============================+==============================+==============================+
| Field | Target | Current |
+------------------------------+------------------------------+------------------------------+
| Compute Mode | Not Specified | Unrestricted |
| ECC Mode | Not Specified | Enabled |
| Sync Boost | Not Specified | Not Supported |
| Memory Application Clock | Not Specified | 5001 |
| SM Application Clock | Not Specified | 585 |
| Power Limit | Not Specified | 70 |
+------------------------------+------------------------------+------------------------------+
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# 具体参数说明
$ dcgmi config -h

config -- Used to configure settings for groups of GPUs.

Usage: dcgmi config
dcgmi config [--host <IP/FQDN>] [-g <groupId>] --enforce
dcgmi config [--host <IP/FQDN>] [-g <groupId>] --get [-v] [-j]
dcgmi config [--host <IP/FQDN>] [-g <groupId>] --set [-e <0/1>] [-s
<0/1>] [-a <mem,proc>] [-P <limit>] [-c <mode>]

...
-c --compmode mode Configure Compute Mode. Can be any of the
following:
0 - Unrestricted
1 - Prohibited
2 - Exclusive Process
-P --powerlimit limit Configure Power Limit (Watts).
-a --appclocks mem,proc Configure Application Clocks. Must use memory,proc
clocks (csv) format(MHz).
-s --syncboost 0/1 Configure Syncboost. (1 to Enable, 0 to Disable)
-e --eccmode 0/1 Configure Ecc mode. (1 to Enable, 0 to Disable)

# 更改设置
$ dcgmi config --host 127.0.0.1:39999 -g 3 --set -c 2

# 查询结果
$ dcgmi config --host 127.0.0.1:39999 -g 3 --get
+------------------------------+------------------------------+------------------------------+
| GPU_GROUP |
| Group of 4 GPUs |
+==============================+==============================+==============================+
| Field | Target | Current |
+------------------------------+------------------------------+------------------------------+
| Compute Mode | E. Process | E. Process |
| ECC Mode | Not Specified | Enabled |
| Sync Boost | Not Specified | Not Supported |
| Memory Application Clock | Not Specified | 5001 |
| SM Application Clock | Not Specified | 585 |
| Power Limit | Not Specified | 70 |
+------------------------------+------------------------------+------------------------------+

注意,使用DCGM更改设置时,运作模式是一种面向声明的模式,用户通过dcgmi指定需要的目标设置,同时nv-hostengine自动调整设置,使当前设置对齐目标设置

Policy and Alerting

dcgm 的提供了policy 功能,policy 本质上是类似于一种Watch机制,首先设定一个违反条件,然后可以根据违反条件设置对应的处理策略。一般而言,可以设置一个条件,然后注册listener,等待dcgm通知。

例如

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# 通过如下命令设置最大温度50度的条件
$ dcgmi policy --host 127.0.0.1:39999 -g 3 --set 0,0 -T 50

# 设置后的policy,通过如下命令查询
$ dcgmi policy --host 127.0.0.1:39999 -g 2 --get
Policy information
+-----------------------------+------------------------------------------------+
| Policy Information |
| GPU_GROUP |
+=============================+================================================+
| Violation conditions | Max temperature threshold - 50 |
| Isolation mode | Manual |
| Action on violation | None |
| Validation after action | None |
| Validation failure action | None |
+-----------------------------+------------------------------------------------+

$ dcgmi policy --host 127.0.0.1:39999 -g 2 --reg
Timestamp: Wed Jan 6 17:02:27 2021
The maximum thermal limit has violated policy manager values.
Temperature: 65
Listening for violations.
Timestamp: Wed Jan 6 17:02:37 2021
The maximum thermal limit has violated policy manager values.
Temperature: 65
...

参数设置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
 --set        actn,val   (OR required)  Set the current violation policy.
Use csv action,validation (ie. 1,2)
-----
Action to take when any of the violations
specified occur.
0 - None
1 - GPU Reset
-----
Validation to take after the violation action has
been performed.
0 - None
1 - System Validation (short)
2 - System Validation (medium)
3 - System Validation (long)
-x --xiderrors Add XID errors to the policy conditions.
-n --nvlinkerrors Add NVLink errors to the policy conditions.
-p --pcierrors Add PCIe replay errors to the policy conditions.
-e --eccerrors Add ECC double bit errors to the policy
conditions.
-P --maxpower max Specify the maximum power a group's GPUs can reach
before triggering a violation.
-T --maxtemp max Specify the maximum temperature a group's GPUs can
reach before triggering a violation.
-M --maxpages max Specify the maximum number of retired pages that
will trigger a violation.

Health check

DCGM 的健康检查是无侵入式的检查,提供了实时监控和聚合的健康数据,其运行机制是

  1. 打开健康检查,设置需要检查的项
  2. DCGM在后台运行,根据设置监控对应组件状态
  3. 用户通过dcgmi health命令查询当前发现的错误
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ dcgmi health --check -g 1
Health Monitor Report
+------------------+---------------------------------------------------------+
| Overall Health: Healthy |
+==================+=========================================================+

$ dcgmi health --check -g 1
Health Monitor Report
+----------------------------------------------------------------------+
| Group 1 | Overall Health: Warning |
+==================+===================================================+
| GPU ID: 0 | Warning |
| | PCIe system: Warning - Detected more than 8 PCIe |
| | replays per minute for GPU 0: 13 |
+---------------+------------------------------------------------------+
| GPU ID: 1 | Warning |
| | InfoROM system: Warning - A corrupt InfoROM has been |
| | detected in GPU 1. |
+---------------+------------------------------------------------------+

GPU Diagnostics

诊断是主动检查的模式,提供了三个级别的检查,每次运行时会根据运行级别,运行对应的测试程序,来发现问题。

运行命令如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ dcgmi diag --host 127.0.0.1:39999 -g 3 -r 1
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Deployment --------+------------------------------------------------|
| Blacklist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Environment Variables | Pass |
| Page Retirement | Pass |
| Graphics Processes | Pass |
| Inforom | Pass |
+---------------------------+------------------------------------------------+

Profile

profile功能可以用较小的性能消耗获取GPU卡的利用率数据以及进程的性能数据,profile功能对于驱动版本和卡的类型有一些强制要求,具体是

  1. DCGM 版本大于1.7
  2. 驱动版本大于418.43
  3. nv-hostengine 以root身份启动
  4. 目前只支持Tesla V100、Tesla T4卡

可以获取的性能指标有

指标 说明 FIELD_NAME
Graphics Engine Activity Ratio of time the graphics engine is active. The graphics engine is active if a graphics/compute context is bound and the graphics pipe or compute pipe is busy. PROF_GR_ENGINE_ACTIVE (ID: 1001)
SM Activity The ratio of cycles an SM has at least 1 warp assigned (computed from the number of cycles and elapsed cycles) PROF_SM_ACTIVE (ID: 1002)
SM Occupancy The ratio of number of warps resident on an SM. (number of resident warps as a percentage of the theoretical maximum number of warps per elapsed cycle) PROF_SM_OCCUPANCY (ID: 1003)
Tensor Activity The ratio of cycles the tensor (HMMA) pipe is active (off the peak sustained elapsed cycles) PROF_PIPE_TENSOR_ACTIVE (ID: 1004)
Memory BW Utilization The ratio of cycles the device memory interface is active sending or receiving data. PROF_DRAM_ACTIVE (ID: 1005)
Engine Activity Ratio of cycles the fp64 /fp32 / fp16 / HMMA / IMMA pipes are active. PROF_PIPE_FPXY_ACTIVE (ID: 1006 (FP64); 1007 (FP32); 1008 (FP16))
NVLink Activity The number of bytes of active NVLink rx or tx data including both header and payload. DEV_NVLINK_BANDWIDTH_L0
PCIe Bandwidth pci_bytes{rx, tx} The number of bytes of active pcie rx or tx data including both header and payload. PROFPCIE[TR]X_BYTES (ID: 1009 (TX); 1010 (RX))

在 k8s 中集成 GPU Telemetry

系统监控通常需要有以下几个组件:

  • 数据收集组件:collector,作为数据来源
  • 时序数据库组件:存储收集到的metrics
  • 可视化组件:将收集到的数据以可视化的界面友好地展示出来

Prometheus 作为云原生时代优秀的解决方案,其结合 Grafana 和 Alert Manager 等组件实现了 k8s 集群的系统监控,下面是其组件架构,更多内容可以参考我的另一篇博文

同样,为了获得 GPU 的监控数据,NVIDIA 推出了 dcgm-exporter,它封装了 DCGM,类似于 node-exporter 将 GPU 的数据暴露给 Prometheus:

部署 dcgm-exporter

dcgm-exporter 作为 DaemonSet 运行在每一个装有GPU的Node上,为了使得 Prometheus 能够采集到它收集的数据,同时创建了 Service

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
apiVersion: apps/v1
kind: DaemonSet
namespace: kube-system
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
name: "dcgm-exporter"
spec:
containers:
- image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
name: "dcgm-exporter"
ports:
- name: "metrics"
containerPort: 9400
securityContext:
runAsNonRoot: false
runAsUser: 0
volumeMounts:
- name: "pod-gpu-resources"
readOnly: true
mountPath: "/var/lib/kubelet/pod-resources"
volumes:
- name: "pod-gpu-resources"
hostPath:
path: "/var/lib/kubelet/pod-resources"
---
kind: Service
apiVersion: v1
namespace: kube-system
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.1.1"
ports:
- name: "metrics"
port: 9400

这一步之后,可以获取每个Node上的 Metrics:

1
2


部署完成后,需要在Prometheus的配置中,给 scrape_configs添加 gpu-metrics 的 job,通过 kubernetes_sd_configs 的服务发现机制找到 dcgm-exporter 对应的服务。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
- job_name: gpu-metrics
scrape_interval: 1s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- kube-system
selectors:
- role: pod
label: "app.kubernetes.io/name:dcgm-exporter"
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node

使用 grafana 监控

NVIDIA 提供了专用于 GPU 监控的 Grafana 面板 ,在Grafana 导入面板后,即可看到对应的GPU监控面板:

OpenFalcon GPU 监控插件

OpenFalcon 是小米开源的一套监控系统解决方案,其架构如下图所示。在每个节点上会有一个 falcon-agent 的 daemon 进程,负责对每个节点进行数据采集。

为了支持GPU监控,OpenFalcon 有专门的 GPU 监控插件,它依赖于 DCGM 获得监控指标,下面是一些常用的指标:

1
2
3
4
5
6
7
8
GPUUtils             GPU 使用率 (%)
MemUtils GPU 显存使用率(%)
FBUsed GPU 的显存占用(MB)
Performance GPU 的性能状态(0-15, 其中0表示最高)
DeviceTemperature 当前GPU设备温度(℃)
PowerUsed GPU的功率使用
SingleBitError 全部累积的单精度ECC错误
DoubleBitError 全部累积的双精度ECC错误

GPU Manager 监控数据分析

OpenFalcon 不同,GPU Manager 使用的是 NVML 库开发,获得对于 GPU Pod 级的监控数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
func (disp *Display) getDeviceUsage(pidsInCont []int, deviceIdx int) *displayapi.DeviceInfo {
nvml.Init()
defer nvml.Shutdown()

dev, err := nvml.DeviceGetHandleByIndex(uint(deviceIdx))
if err != nil {
klog.Warningf("can't find device %d, error %s", deviceIdx, err)
return nil
}

processSamples, err := dev.DeviceGetProcessUtilization(1024, time.Second)
if err != nil {
klog.Warningf("can't get processes utilization from device %d, error %s", deviceIdx, err)
return nil
}

processOnDevices, err := dev.DeviceGetComputeRunningProcesses(1024)
if err != nil {
klog.Warningf("can't get processes info from device %d, error %s", deviceIdx, err)
return nil
}

busID, err := dev.DeviceGetPciInfo()
if err != nil {
klog.Warningf("can't get pci info from device %d, error %s", deviceIdx, err)
return nil
}

sort.Slice(pidsInCont, func(i, j int) bool {
return pidsInCont[i] < pidsInCont[j]
})

usedMemory := uint64(0)
usedPids := make([]int32, 0)
usedGPU := uint(0)
for _, info := range processOnDevices {
idx := sort.Search(len(pidsInCont), func(pivot int) bool {
return pidsInCont[pivot] >= int(info.Pid)
})

if idx < len(pidsInCont) && pidsInCont[idx] == int(info.Pid) {
usedPids = append(usedPids, int32(pidsInCont[idx]))
usedMemory += info.UsedGPUMemory
}
}

for _, sample := range processSamples {
idx := sort.Search(len(pidsInCont), func(pivot int) bool {
return pidsInCont[pivot] >= int(sample.Pid)
})

if idx < len(pidsInCont) && pidsInCont[idx] == int(sample.Pid) {
usedGPU += sample.SmUtil
}
}

return &displayapi.DeviceInfo{
Id: busID.BusID,
CardIdx: fmt.Sprintf("%d", deviceIdx),
Gpu: float32(usedGPU),
Mem: float32(usedMemory >> 20),
Pids: usedPids,
}
}

GPU 监控指标探讨

对于 k8s 的 GPU 监控,我们到底需要那些指标:

  • 集群级别
    • 整个集群有多少GPU,各种GPU的型号是怎样的
    • 集群级别GPU算力使用量(绝对值),算力使用率(相对值)
    • 集群级别GPU显存使用量(绝对值),显存使用率(相对值)
  • 单机级别
    • Node上有多少GPU,各种GPU的型号是怎样的
    • 单机级别GPU算力使用量(绝对值),算力使用率(相对值)
    • 单机级别GPU显存使用量(绝对值),显存使用率(相对值)
  • Pod级别
    • Pod 运行在哪个GPU上
    • Pod级别GPU算力使用量(绝对值),算力使用率(相对值)
    • Pod级别GPU显存使用量(绝对值),显存使用率(相对值)
  • 其他相关统计数据
    • GPU的功率、温度、主频、FAN转速等

参考资料