Data center managers aim to keep resource utilization high, so an ideal data center accelerator doesn’t just go big- it also efficiently accelerates many smaller workloads.
$ nvidia-smi Wed Jan 13 11:42:34 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 A100-SXM4-40GB Off | 00000000:00:08.0 Off | 0 | | N/A 26C P0 43W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
$ nvidia-smi -i 0 -mig 1 Warning: MIG mode is in pending enable state for GPU 00000000:00:08.0:In use by another client 00000000:00:08.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using the device and retry the command or reboot the system to make MIG mode effective. All done.
If you are using MIG inside a VM with GPU passthrough, then you may need to reboot the VM to allow the GPU to be in MIG mode as in some cases, GPU reset is not allowed via the hypervisor for security reasons. This can be seen in the following example:
# nvidia-smi mig -lgipp GPU 0 Profile ID 19 Placements: {0,1,2,3,4,5,6}:1 GPU 0 Profile ID 14 Placements: {0,2,4}:2 GPU 0 Profile ID 9 Placements: {0,4}:4 GPU 0 Profile ID 5 Placement : {0}:4 GPU 0 Profile ID 0 Placement : {0}:8
创建 GPU Instances
1 2 3 4
# nvidia-smi mig -cgi 9,14,19 Successfully created GPU instance ID 2 on GPU 0 using profile MIG 3g.20gb (ID 9) Successfully created GPU instance ID 3 on GPU 0 using profile MIG 2g.10gb (ID 14) Successfully created GPU instance ID 9 on GPU 0 using profile MIG 1g.5gb (ID 19)
# nvidia-smi mig -cci 0,1 -gi 2 Successfully created compute instance ID 0 on GPU 0 GPU instance ID 2 using profile MIG 1c.3g.20gb (ID 0) Successfully created compute instance ID 1 on GPU 0 GPU instance ID 2 using profile MIG 2c.3g.20gb (ID 1)
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
$ docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=MIG-GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1/2/0 nvidia/cuda nvidia-smi Wed Jan 13 11:30:19 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 A100-SXM4-40GB On | 00000000:00:08.0 Off | On | | N/A 26C P0 42W / 400W | N/A | N/A Default | | | | Enabled | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG| | | | ECC| | |==================+======================+===========+=======================| | 0 2 0 0 | 5MiB / 20096MiB | 14 0 | 3 0 2 0 0 | | | 0MiB / 32767MiB | | | +------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Thu Jan 14 16:35:34 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 A100-SXM4-40GB On | 00000000:00:08.0 Off | On | | N/A 26C P0 43W / 400W | 0MiB / 40536MiB | N/A Default | | | | Enabled | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG| | | | ECC| | |==================+======================+===========+=======================| | No MIG devices found | +-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
kind:DaemonSet metadata: name:nvidia-device-plugin-daemonset namespace:kube-system spec: selector: matchLabels: name:nvidia-device-plugin-ds updateStrategy: type:RollingUpdate template: metadata: # This annotation is deprecated. Kept here for backward compatibility # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ annotations: scheduler.alpha.kubernetes.io/critical-pod:"" labels: name:nvidia-device-plugin-ds spec: tolerations: # This toleration is deprecated. Kept here for backward compatibility # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ -key:CriticalAddonsOnly operator:Exists -key:nvidia.com/gpu operator:Exists effect:NoSchedule # Mark this pod as a critical add-on; when enabled, the critical add-on # scheduler reserves resources for critical add-on pods so that they can # be rescheduled after a failure. # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ priorityClassName:"system-node-critical" containers: -image:nvidia/k8s-device-plugin:v0.7.0 name:nvidia-device-plugin-ctr args:["--fail-on-init-error=false"] securityContext: allowPrivilegeEscalation:false capabilities: drop:["ALL"] volumeMounts: -name:device-plugin mountPath:/var/lib/kubelet/device-plugins volumes: -name:device-plugin hostPath: path:/var/lib/kubelet/device-plugins
$ nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C Successfully created GPU instance ID 13 on GPU 0 using profile MIG 1g.5gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 13 using profile MIG 1g.5gb (ID 0) Successfully created GPU instance ID 11 on GPU 0 using profile MIG 1g.5gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 11 using profile MIG 1g.5gb (ID 0) Successfully created GPU instance ID 12 on GPU 0 using profile MIG 1g.5gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 12 using profile MIG 1g.5gb (ID 0) Successfully created GPU instance ID 7 on GPU 0 using profile MIG 1g.5gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 7 using profile MIG 1g.5gb (ID 0) Successfully created GPU instance ID 8 on GPU 0 using profile MIG 1g.5gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 8 using profile MIG 1g.5gb (ID 0) Successfully created GPU instance ID 9 on GPU 0 using profile MIG 1g.5gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 9 using profile MIG 1g.5gb (ID 0) Successfully created GPU instance ID 10 on GPU 0 using profile MIG 1g.5gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 10 using profile MIG 1g.5gb (ID 0) $ nvidia-smi -L GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181) MIG 1g.5gb Device 0: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/7/0) MIG 1g.5gb Device 1: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/8/0) MIG 1g.5gb Device 2: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/9/0) MIG 1g.5gb Device 3: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/10/0) MIG 1g.5gb Device 4: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/11/0) MIG 1g.5gb Device 5: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/12/0) MIG 1g.5gb Device 6: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/13/0)
apiVersion:apps/v1 kind:DaemonSet metadata: name:nvidia-device-plugin-daemonset namespace:kube-system spec: selector: matchLabels: name:nvidia-device-plugin-ds updateStrategy: type:RollingUpdate template: metadata: # This annotation is deprecated. Kept here for backward compatibility # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ annotations: scheduler.alpha.kubernetes.io/critical-pod:"" labels: name:nvidia-device-plugin-ds spec: tolerations: # This toleration is deprecated. Kept here for backward compatibility # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ -key:CriticalAddonsOnly operator:Exists -key:nvidia.com/gpu operator:Exists effect:NoSchedule # Mark this pod as a critical add-on; when enabled, the critical add-on # scheduler reserves resources for critical add-on pods so that they can # be rescheduled after a failure. # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ priorityClassName:"system-node-critical" containers: -image:nvidia/k8s-device-plugin:v0.7.0 name:nvidia-device-plugin-ctr args:["--fail-on-init-error=false","--mig-strategy=single"] securityContext: allowPrivilegeEscalation:false capabilities: drop:["ALL"] volumeMounts: -name:device-plugin mountPath:/var/lib/kubelet/device-plugins volumes: -name:device-plugin hostPath: path:/var/lib/kubelet/device-plugins
$ for i in $(seq 7); do kubectl run \ --image=nvidia/cuda:11.0-base \ --restart=Never \ --limits=nvidia.com/gpu=1 \ mig-single-example-${i} -- bash -c "nvidia-smi -L; sleep infinity" done pod/mig-single-example-1 created pod/mig-single-example-2 created pod/mig-single-example-3 created pod/mig-single-example-4 created pod/mig-single-example-5 created pod/mig-single-example-6 created pod/mig-single-example-7 created
$ for i in $(seq 7); do echo"mig-single-example-${i}"; kubectl logs mig-single-example-${i} echo""; done
$ for i in $(seq 7); do kubectl delete pod mig-single-example-${i}; done
pod "mig-single-example-1" deleted pod "mig-single-example-2" deleted ...
Mixed
确认 Node 上的MIG特性开启后,创建不同大小的3个GI,每个GI对应着一个CI:
1 2 3 4 5 6 7 8 9 10 11 12
$ nvidia-smi mig -cgi 9,14,19 -C Successfully created GPU instance ID 2 on GPU 0 using profile MIG 3g.20gb (ID 9) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 2 using profile MIG 3g.20gb (ID 2) Successfully created GPU instance ID 3 on GPU 0 using profile MIG 2g.10gb (ID 14) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 3 using profile MIG 2g.10gb (ID 1) Successfully created GPU instance ID 9 on GPU 0 using profile MIG 1g.5gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 9 using profile MIG 1g.5gb (ID 0) $ nvidia-smi -L GPU 0: A100-SXM4-40GB (UUID: GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1) MIG 3g.20gb Device 0: (UUID: MIG-GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1/2/0) MIG 2g.10gb Device 1: (UUID: MIG-GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1/3/0) MIG 1g.5gb Device 2: (UUID: MIG-GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1/9/0)
apiVersion:apps/v1 kind:DaemonSet metadata: name:nvidia-device-plugin-daemonset namespace:kube-system spec: selector: matchLabels: name:nvidia-device-plugin-ds updateStrategy: type:RollingUpdate template: metadata: # This annotation is deprecated. Kept here for backward compatibility # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ annotations: scheduler.alpha.kubernetes.io/critical-pod:"" labels: name:nvidia-device-plugin-ds spec: tolerations: # This toleration is deprecated. Kept here for backward compatibility # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ -key:CriticalAddonsOnly operator:Exists -key:nvidia.com/gpu operator:Exists effect:NoSchedule # Mark this pod as a critical add-on; when enabled, the critical add-on # scheduler reserves resources for critical add-on pods so that they can # be rescheduled after a failure. # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ priorityClassName:"system-node-critical" containers: -image:nvidia/k8s-device-plugin:v0.7.0 name:nvidia-device-plugin-ctr args:["--fail-on-init-error=false","--mig-strategy=mixed"] securityContext: allowPrivilegeEscalation:false capabilities: drop:["ALL"] volumeMounts: -name:device-plugin mountPath:/var/lib/kubelet/device-plugins volumes: -name:device-plugin hostPath: path:/var/lib/kubelet/device-plugins
If you are using MIG inside a VM with GPU passthrough, then you may need to reboot the VM to allow the GPU to be in MIG mode as in some cases, GPU reset is not allowed via the hypervisor for security reasons. This can be seen in the following example:
1 2 3 4 5 6 7
$ sudo nvidia-smi -i 0 -mig 1 Warning: MIG mode is in pending enable state for GPU 00000000:00:03.0:Not Supported Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:00:03.0 All done.
$ sudo nvidia-smi --gpu-reset Resetting GPU 00000000:00:03.0 is not supported.