问题背景 在使用GPU进行深度学习相关的训练与推理时,需要查看当前集群中GPU的使用情况:
需要通过当前GPU设备资源使用情况判断是否可以再部署新的应用,判断集群是否需要扩容,为GPU服务提供对齐CPU的容量保障服务,补齐容量保障中的GPU短板
需要通过当前GPU设备资源使用情况分析使用中存在的瓶颈和短板,推进优化,提高资源利用率和服务性能
为了获得GPU的监控数据,NVIDIA 提供了以下三种方法:
NVML :NVIDIA Management Library,基于C进行监控和管理GPU的库,nvidia-smi
命令即是基于此实现的
DCGM :Data Center GPU Manager,基于NVML和CUDA实现的一整套GPU的监控和管理工具
第三方工具:基于 DCGM 或者 NVML 开发的第三方监控工具,可以与Prometheus等工具结合,提供数据库、UI等工具
对比这三种工具的特点:
NVML
无状态的查询,只支持查询当前数据
属于低级别控制GPU的API
基于NVML库开发的管理工具运行成本低,开发成本高
基于NVML库开发的管理工具必须与GPU运行在同一个节点
DCGM
可以查询几个小时的数据指标
提供了GPU的健康检查和诊断
可以对一组GPU进行批量查询
允许以 remote/local
两种方式运行
第三方工具
本文后续将主要介绍 DCGM。
DCGM 下图展示了 DCGM
在集群中运行的方式,DCGM
以 Agent 的形式部署在计算节点上,管理节点上的工具可以通过 DCGM
提供的API管理和监控GPU。
DCGM
提供了一下四种关键特性:
Active Health Monitoring
GPU Diagnostics
Policy and Alerting
Configuration Managerment
安装部署 DCGM 需要单独下载安装,在NVIDIA官网NVIDIA 下载对应的安装包,这里选择下载rpm包即可,下载完成后:
1 2 3 4 $ yum remove datacenter-gpu-manager $ rpm -ivh datacenter-gpu-manager-2.0.13-1-x86_64.rpm
DCGM的动态链接库会被安装到/usr/lib64
目录
Python库会被安装到/usr/local/dcgm/bindings
目录
DCGM 是一个面向集群管理的工具,所以在实际使用前,需要先在目标机器启动一个agent,nv-hostengine
,具体启动命令如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 $ nv-hostengine --port 39999 --bind -interface 127.0.0.1 Host Engine Listener Started Started host engine version 2.0.13 using port number: 39999 $ dcgmi discovery --host 127.0.0.1:39999 -l 4 GPUs found. +--------+----------------------------------------------------------------------+ | GPU ID | Device Information | +--------+----------------------------------------------------------------------+ | 0 | Name: Tesla T4 | | | PCI Bus ID: 00000000:00:08.0 | | | Device UUID: GPU-0bf43c76-0f1a-f49f-a362-92d5b9bbbc9f | +--------+----------------------------------------------------------------------+ | 1 | Name: Tesla T4 | | | PCI Bus ID: 00000000:00:09.0 | | | Device UUID: GPU-c55a4e5e-47dd-48c2-99d0-2630042bf619 | +--------+----------------------------------------------------------------------+ | 2 | Name: Tesla T4 | | | PCI Bus ID: 00000000:00:0A.0 | | | Device UUID: GPU-95e5fe58-f03d-815e-c871-65637b623aca | +--------+----------------------------------------------------------------------+ | 3 | Name: Tesla T4 | | | PCI Bus ID: 00000000:00:0B.0 | | | Device UUID: GPU-70747d1b-2b7a-9895-29f5-485608c1742e | +--------+----------------------------------------------------------------------+ 0 NvSwitches found. +-----------+ | Switch ID | +-----------+ +-----------+ $ nv-hostengine –t Host engine successfully terminated.
其中,--port
--bind-interface
两个参数分别用来设置监听的端口和绑定的IP地址。同时也支持使用 UNIX_SOCKET
通信
在启动 nv-hostengine
之后,我们就可以使用 dcgmi
来操作
组操作 和NVML不同,DCGM 的大部分功能都是面向组的,所以在使用DCGM之前,首先需要创建组,然后才能使用DCGM提供的各种功能。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 $ dcgmi group --host 127.0.0.1:39999 -c GPU_GROUP Successfully created group "GPU_GROUP" with a group ID of 2 $ dcgmi group --host 127.0.0.1:39999 -l +-------------------+----------------------------------------------------------+ | GROUPS | | 1 group found. | +===================+==========================================================+ | Groups | | | -> 2 | | | -> Group ID | 2 | | -> Group Name | GPU_GROUP | | -> Entities | None | +-------------------+----------------------------------------------------------+ $ dcgmi discovery --host 127.0.0.1:39999 -l 4 GPUs found. +--------+----------------------------------------------------------------------+ | GPU ID | Device Information | +--------+----------------------------------------------------------------------+ | 0 | Name: Tesla T4 | | | PCI Bus ID: 00000000:00:08.0 | | | Device UUID: GPU-0bf43c76-0f1a-f49f-a362-92d5b9bbbc9f | +--------+----------------------------------------------------------------------+ | 1 | Name: Tesla T4 | | | PCI Bus ID: 00000000:00:09.0 | | | Device UUID: GPU-c55a4e5e-47dd-48c2-99d0-2630042bf619 | +--------+----------------------------------------------------------------------+ | 2 | Name: Tesla T4 | | | PCI Bus ID: 00000000:00:0A.0 | | | Device UUID: GPU-95e5fe58-f03d-815e-c871-65637b623aca | +--------+----------------------------------------------------------------------+ | 3 | Name: Tesla T4 | | | PCI Bus ID: 00000000:00:0B.0 | | | Device UUID: GPU-70747d1b-2b7a-9895-29f5-485608c1742e | +--------+----------------------------------------------------------------------+ 0 NvSwitches found. +-----------+ | Switch ID | +-----------+ +-----------+ $ dcgmi group --host 127.0.0.1:39999 -g 2 -a 0,1 Add to group operation successful. $ dcgmi group --host 127.0.0.1:39999 -g 2 -i +-------------------+----------------------------------------------------------+ | GROUP INFO | +===================+==========================================================+ | 2 | | | -> Group ID | 2 | | -> Group Name | GPU_GROUP | | -> Entities | GPU 0, GPU 1 | +-------------------+----------------------------------------------------------+ $ dcgmi group --host 127.0.0.1:39999 -g 2 -r 0,1 Remove from group operation successful. $ dcgmi group --host 127.0.0.1:39999 -d 2
注意:group和设备之间是多对多关系
Job Statistics 当有一个Job需要通过GPU加速计算的时候,我们想知道:
我的Job运行在哪个GPU上
我的Job使用了多少GPU
在我的Job运行过程中是否有任何的错误和Warning
系统的GPU是否都健康并且准备好了下一个Job的计算
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 $ dcgmi group --host 127.0.0.1:39999 -g 3 -i +-------------------+----------------------------------------------------------+ | GROUP INFO | +===================+==========================================================+ | 3 | | | -> Group ID | 3 | | -> Group Name | GPU_GROUP | | -> Entities | GPU 0, GPU 1, GPU 2, GPU 3 | +-------------------+----------------------------------------------------------+ $ dcgmi stats --host 127.0.0.1:39999 -g 3 --enable Successfully started process watches. $ dcgmi stats --host 127.0.0.1:39999 -g 3 -p 41861 -v Successfully retrieved process info for PID: 41861. Process ran on 1 GPUs. +------------------------------------------------------------------------------+ | GPU ID: 3 | +====================================+=========================================+ |----- Execution Stats ------------+-----------------------------------------| | Start Time * | Wed Jan 6 16:54:16 2021 | | End Time * | Still Running | | Total Execution Time (sec) * | Still Running | | No. of Conflicting Processes * | 0 | +----- Performance Stats ----------+-----------------------------------------+ | Energy Consumed (Joules) | 2985 | | Max GPU Memory Used (bytes) * | 12107907072 | | SM Clock (MHz) | Avg: 1590, Max: 1590, Min: 1590 | | Memory Clock (MHz) | Avg: 5000, Max: 5000, Min: 5000 | | SM Utilization (%) | Avg: 100, Max: 100, Min: 100 | | Memory Utilization (%) | Avg: 5, Max: 5, Min: 5 | | PCIe Rx Bandwidth (megabytes) | Avg: N/A, Max: N/A, Min: N/A | | PCIe Tx Bandwidth (megabytes) | Avg: N/A, Max: N/A, Min: N/A | +----- Event Stats ----------------+-----------------------------------------+ | Double Bit ECC Errors | 0 | | PCIe Replay Warnings | 0 | | Critical XID Errors | 0 | +----- Slowdown Stats -------------+-----------------------------------------+ | Due to - Power (%) | 0 | | - Thermal (%) | 0 | | - Reliability (%) | 0 | | - Board Limit (%) | 0 | | - Low Utilization (%) | 0 | | - Sync Boost (%) | 0 | +----- Process Utilization --------+-----------------------------------------+ | PID | 41861 | | Avg SM Utilization (%) | 99 | | Avg Memory Utilization (%) | 3 | +----- Overall Health -------------+-----------------------------------------+ | Overall Health | Healthy | +------------------------------------+-----------------------------------------+ (*) Represents a process statistic. Otherwise device statistic during process lifetime listed.
Configuration Managerment DCGM 可以更改GPU设置, 具体支持的设置项如下,查看原有设置:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 $ dcgmi config --host 127.0.0.1:39999 -g 3 --get +------------------------------+------------------------------+------------------------------+ | GPU_GROUP | | Group of 4 GPUs | +==============================+==============================+==============================+ | Field | Target | Current | +------------------------------+------------------------------+------------------------------+ | Compute Mode | Not Specified | Unrestricted | | ECC Mode | Not Specified | Enabled | | Sync Boost | Not Specified | Not Supported | | Memory Application Clock | Not Specified | 5001 | | SM Application Clock | Not Specified | 585 | | Power Limit | Not Specified | 70 | +------------------------------+------------------------------+------------------------------+
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 $ dcgmi config -h config -- Used to configure settings for groups of GPUs. Usage: dcgmi config dcgmi config [--host <IP/FQDN>] [-g <groupId>] --enforce dcgmi config [--host <IP/FQDN>] [-g <groupId>] --get [-v] [-j] dcgmi config [--host <IP/FQDN>] [-g <groupId>] --set [-e <0/1>] [-s <0/1>] [-a <mem,proc>] [-P <limit >] [-c <mode>] ... -c --compmode mode Configure Compute Mode. Can be any of the following: 0 - Unrestricted 1 - Prohibited 2 - Exclusive Process -P --powerlimit limit Configure Power Limit (Watts). -a --appclocks mem,proc Configure Application Clocks. Must use memory,proc clocks (csv) format(MHz). -s --syncboost 0/1 Configure Syncboost. (1 to Enable, 0 to Disable) -e --eccmode 0/1 Configure Ecc mode. (1 to Enable, 0 to Disable) $ dcgmi config --host 127.0.0.1:39999 -g 3 --set -c 2 $ dcgmi config --host 127.0.0.1:39999 -g 3 --get +------------------------------+------------------------------+------------------------------+ | GPU_GROUP | | Group of 4 GPUs | +==============================+==============================+==============================+ | Field | Target | Current | +------------------------------+------------------------------+------------------------------+ | Compute Mode | E. Process | E. Process | | ECC Mode | Not Specified | Enabled | | Sync Boost | Not Specified | Not Supported | | Memory Application Clock | Not Specified | 5001 | | SM Application Clock | Not Specified | 585 | | Power Limit | Not Specified | 70 | +------------------------------+------------------------------+------------------------------+
注意,使用DCGM更改设置时,运作模式是一种面向声明的模式,用户通过dcgmi指定需要的目标设置,同时nv-hostengine自动调整设置,使当前设置对齐目标设置
Policy and Alerting dcgm 的提供了policy 功能,policy 本质上是类似于一种Watch机制,首先设定一个违反
条件,然后可以根据违反
条件设置对应的处理策略。一般而言,可以设置一个条件,然后注册listener,等待dcgm通知。
例如
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 $ dcgmi policy --host 127.0.0.1:39999 -g 3 --set 0,0 -T 50 $ dcgmi policy --host 127.0.0.1:39999 -g 2 --get Policy information +-----------------------------+------------------------------------------------+ | Policy Information | | GPU_GROUP | +=============================+================================================+ | Violation conditions | Max temperature threshold - 50 | | Isolation mode | Manual | | Action on violation | None | | Validation after action | None | | Validation failure action | None | +-----------------------------+------------------------------------------------+ $ dcgmi policy --host 127.0.0.1:39999 -g 2 --reg Timestamp: Wed Jan 6 17:02:27 2021 The maximum thermal limit has violated policy manager values. Temperature: 65 Listening for violations. Timestamp: Wed Jan 6 17:02:37 2021 The maximum thermal limit has violated policy manager values. Temperature: 65 ...
参数设置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 --set actn,val (OR required) Set the current violation policy. Use csv action,validation (ie. 1,2) ----- Action to take when any of the violations specified occur. 0 - None 1 - GPU Reset ----- Validation to take after the violation action has been performed. 0 - None 1 - System Validation (short) 2 - System Validation (medium) 3 - System Validation (long) -x --xiderrors Add XID errors to the policy conditions. -n --nvlinkerrors Add NVLink errors to the policy conditions. -p --pcierrors Add PCIe replay errors to the policy conditions. -e --eccerrors Add ECC double bit errors to the policy conditions. -P --maxpower max Specify the maximum power a group's GPUs can reach before triggering a violation. -T --maxtemp max Specify the maximum temperature a group' s GPUs can reach before triggering a violation. -M --maxpages max Specify the maximum number of retired pages that will trigger a violation.
Health check DCGM
的健康检查是无侵入式的检查,提供了实时监控和聚合的健康数据,其运行机制是
打开健康检查,设置需要检查的项
DCGM在后台运行,根据设置监控对应组件状态
用户通过dcgmi health
命令查询当前发现的错误
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 $ dcgmi health --check -g 1 Health Monitor Report +------------------+---------------------------------------------------------+ | Overall Health: Healthy | +==================+=========================================================+ $ dcgmi health --check -g 1 Health Monitor Report +----------------------------------------------------------------------+ | Group 1 | Overall Health: Warning | +==================+===================================================+ | GPU ID: 0 | Warning | | | PCIe system: Warning - Detected more than 8 PCIe | | | replays per minute for GPU 0: 13 | +---------------+------------------------------------------------------+ | GPU ID: 1 | Warning | | | InfoROM system: Warning - A corrupt InfoROM has been | | | detected in GPU 1. | +---------------+------------------------------------------------------+
GPU Diagnostics 诊断是主动检查的模式,提供了三个级别的检查,每次运行时会根据运行级别,运行对应的测试程序,来发现问题。
运行命令如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 $ dcgmi diag --host 127.0.0.1:39999 -g 3 -r 1 Successfully ran diagnostic for group. +---------------------------+------------------------------------------------+ | Diagnostic | Result | +===========================+================================================+ |----- Deployment --------+------------------------------------------------| | Blacklist | Pass | | NVML Library | Pass | | CUDA Main Library | Pass | | Permissions and OS Blocks | Pass | | Persistence Mode | Pass | | Environment Variables | Pass | | Page Retirement | Pass | | Graphics Processes | Pass | | Inforom | Pass | +---------------------------+------------------------------------------------+
Profile profile功能可以用较小的性能消耗获取GPU卡的利用率数据以及进程的性能数据,profile功能对于驱动版本和卡的类型有一些强制要求,具体是
DCGM 版本大于1.7
驱动版本大于418.43
nv-hostengine 以root身份启动
目前只支持Tesla V100、Tesla T4卡
可以获取的性能指标有
指标
说明
FIELD_NAME
Graphics Engine Activity
Ratio of time the graphics engine is active. The graphics engine is active if a graphics/compute context is bound and the graphics pipe or compute pipe is busy. PROF_GR_ENGINE_ACTIVE (ID: 1001)
SM Activity
The ratio of cycles an SM has at least 1 warp assigned (computed from the number of cycles and elapsed cycles)
PROF_SM_ACTIVE (ID: 1002)
SM Occupancy
The ratio of number of warps resident on an SM. (number of resident warps as a percentage of the theoretical maximum number of warps per elapsed cycle)
PROF_SM_OCCUPANCY (ID: 1003)
Tensor Activity
The ratio of cycles the tensor (HMMA) pipe is active (off the peak sustained elapsed cycles)
PROF_PIPE_TENSOR_ACTIVE (ID: 1004)
Memory BW Utilization
The ratio of cycles the device memory interface is active sending or receiving data.
PROF_DRAM_ACTIVE (ID: 1005)
Engine Activity
Ratio of cycles the fp64 /fp32 / fp16 / HMMA / IMMA pipes are active.
PROF_PIPE_FPXY_ACTIVE (ID: 1006 (FP64); 1007 (FP32); 1008 (FP16))
NVLink Activity
The number of bytes of active NVLink rx or tx data including both header and payload.
DEV_NVLINK_BANDWIDTH_L0
PCIe Bandwidth pci_bytes {rx, tx}
The number of bytes of active pcie rx or tx data including both header and payload.
PROFPCIE [TR]X_BYTES (ID: 1009 (TX); 1010 (RX))
在 k8s 中集成 GPU Telemetry 系统监控通常需要有以下几个组件:
数据收集组件:collector,作为数据来源
时序数据库组件:存储收集到的metrics
可视化组件:将收集到的数据以可视化的界面友好地展示出来
Prometheus 作为云原生时代优秀的解决方案,其结合 Grafana 和 Alert Manager 等组件实现了 k8s 集群的系统监控,下面是其组件架构,更多内容可以参考我的另一篇博文 。
同样,为了获得 GPU 的监控数据,NVIDIA 推出了 dcgm-exporter
,它封装了 DCGM
,类似于 node-exporter
将 GPU 的数据暴露给 Prometheus:
部署 dcgm-exporter dcgm-exporter
作为 DaemonSet
运行在每一个装有GPU的Node上,为了使得 Prometheus 能够采集到它收集的数据,同时创建了 Service
。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 apiVersion: apps/v1 kind: DaemonSet namespace: kube-system metadata: name: "dcgm-exporter" labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.1.1" spec: updateStrategy: type: RollingUpdate selector: matchLabels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.1.1" template: metadata: labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.1.1" name: "dcgm-exporter" spec: containers: - image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04" env: - name: "DCGM_EXPORTER_LISTEN" value: ":9400" - name: "DCGM_EXPORTER_KUBERNETES" value: "true" name: "dcgm-exporter" ports: - name: "metrics" containerPort: 9400 securityContext: runAsNonRoot: false runAsUser: 0 volumeMounts: - name: "pod-gpu-resources" readOnly: true mountPath: "/var/lib/kubelet/pod-resources" volumes: - name: "pod-gpu-resources" hostPath: path: "/var/lib/kubelet/pod-resources" --- kind: Service apiVersion: v1 namespace: kube-system metadata: name: "dcgm-exporter" labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.1.1" spec: selector: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.1.1" ports: - name: "metrics" port: 9400
这一步之后,可以获取每个Node上的 Metrics:
部署完成后,需要在Prometheus的配置中,给 scrape_configs
添加 gpu-metrics
的 job,通过 kubernetes_sd_configs
的服务发现机制找到 dcgm-exporter
对应的服务。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 - job_name: gpu-metrics scrape_interval: 1s metrics_path: /metrics scheme: http kubernetes_sd_configs: - role: endpoints namespaces: names: - kube-system selectors: - role: pod label: "app.kubernetes.io/name:dcgm-exporter" relabel_configs: - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: kubernetes_node
使用 grafana 监控 NVIDIA 提供了专用于 GPU 监控的 Grafana 面板 ,在Grafana 导入面板后,即可看到对应的GPU监控面板:
OpenFalcon GPU 监控插件 OpenFalcon 是小米开源的一套监控系统解决方案,其架构如下图所示。在每个节点上会有一个 falcon-agent
的 daemon 进程,负责对每个节点进行数据采集。
为了支持GPU监控,OpenFalcon 有专门的 GPU 监控插件 ,它依赖于 DCGM
获得监控指标,下面是一些常用的指标:
1 2 3 4 5 6 7 8 GPUUtils GPU 使用率 (%) MemUtils GPU 显存使用率(%) FBUsed GPU 的显存占用(MB) Performance GPU 的性能状态(0-15, 其中0表示最高) DeviceTemperature 当前GPU设备温度(℃) PowerUsed GPU的功率使用 SingleBitError 全部累积的单精度ECC错误 DoubleBitError 全部累积的双精度ECC错误
GPU Manager 监控数据分析 与 OpenFalcon
不同,GPU Manager 使用的是 NVML
库开发,获得对于 GPU Pod 级的监控数据。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 func (disp *Display) getDeviceUsage (pidsInCont []int , deviceIdx int ) *displayapi .DeviceInfo { nvml.Init() defer nvml.Shutdown() dev, err := nvml.DeviceGetHandleByIndex(uint (deviceIdx)) if err != nil { klog.Warningf("can't find device %d, error %s" , deviceIdx, err) return nil } processSamples, err := dev.DeviceGetProcessUtilization(1024 , time.Second) if err != nil { klog.Warningf("can't get processes utilization from device %d, error %s" , deviceIdx, err) return nil } processOnDevices, err := dev.DeviceGetComputeRunningProcesses(1024 ) if err != nil { klog.Warningf("can't get processes info from device %d, error %s" , deviceIdx, err) return nil } busID, err := dev.DeviceGetPciInfo() if err != nil { klog.Warningf("can't get pci info from device %d, error %s" , deviceIdx, err) return nil } sort.Slice(pidsInCont, func (i, j int ) bool { return pidsInCont[i] < pidsInCont[j] }) usedMemory := uint64 (0 ) usedPids := make ([]int32 , 0 ) usedGPU := uint (0 ) for _, info := range processOnDevices { idx := sort.Search(len (pidsInCont), func (pivot int ) bool { return pidsInCont[pivot] >= int (info.Pid) }) if idx < len (pidsInCont) && pidsInCont[idx] == int (info.Pid) { usedPids = append (usedPids, int32 (pidsInCont[idx])) usedMemory += info.UsedGPUMemory } } for _, sample := range processSamples { idx := sort.Search(len (pidsInCont), func (pivot int ) bool { return pidsInCont[pivot] >= int (sample.Pid) }) if idx < len (pidsInCont) && pidsInCont[idx] == int (sample.Pid) { usedGPU += sample.SmUtil } } return &displayapi.DeviceInfo{ Id: busID.BusID, CardIdx: fmt.Sprintf("%d" , deviceIdx), Gpu: float32 (usedGPU), Mem: float32 (usedMemory >> 20 ), Pids: usedPids, } }
GPU 监控指标探讨 对于 k8s 的 GPU 监控,我们到底需要那些指标:
集群级别
整个集群有多少GPU,各种GPU的型号是怎样的
集群级别GPU算力使用量(绝对值),算力使用率(相对值)
集群级别GPU显存使用量(绝对值),显存使用率(相对值)
单机级别
Node上有多少GPU,各种GPU的型号是怎样的
单机级别GPU算力使用量(绝对值),算力使用率(相对值)
单机级别GPU显存使用量(绝对值),显存使用率(相对值)
Pod级别
Pod 运行在哪个GPU上
Pod级别GPU算力使用量(绝对值),算力使用率(相对值)
Pod级别GPU显存使用量(绝对值),显存使用率(相对值)
其他相关统计数据
参考资料