0%

【异构计算】NVIDIA XID Message

Xid Message 由 NVIDIA 驱动报告的错误信息,一般卸载操作系统的内核日志或者是事件日志中。Xid消息表明发生了一般的GPU错误,通常是由于驱动程序对GPU的编程不正确或发送给GPU的命令损坏所致。这些消息可能表示硬件问题、NVIDIA软件问题或用户应用程序问题。

Xid Message 的产生可能有以下三种:

  • Hardware Problem
  • NVIDIA Software Problem
  • User Application Problem

Xid Message 可以用作错误诊断,辅助调试报告的错误。在所有不同版本的NVIDIA驱动中,Xid Message 的含义保持一致。

查看 Xid Errors

在 Linux 中,Xid Error 的信息在 /var/log/messages 中,可以看到错误信息。下图展示的是 XID 14 的错误信息:

1
2
3
$ grep "NVRM: Xid" /var/log/messages
[…] NVRM: GPU at 0000:03:00: GPU-b850f46d-d5ea-c752-ddf3-c4453e44d3f7
[…] NVRM: Xid (0000:03:00): 14, Channel 00000001

在 NVIDIA 提供的 NVML 库中可以监听 GPU 的 Xid Error,下面是 Go 监听的示例代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
eventSet := nvml.NewEventSet()
defer nvml.DeleteEventSet(eventSet)

for _, gpu := range devices {
err = nvml.RegisterEventForDevice(eventSet, nvml.XidCriticalError, gpu)
// ...
}

for {
select {
case <-stop:
return
default:
}

e, err := nvml.WaitForEvent(eventSet, 5000)
if err != nil && e.Etype != nvml.XidCriticalError {
continue
}

// FIXME: formalize the full list and document it.
// http://docs.nvidia.com/deploy/xid-errors/index.html#topic_4
// Application errors: the GPU should still be healthy
if e.Edata == 31 || e.Edata == 43 || e.Edata == 45 {
continue
}

if e.UUID == nil || len(*e.UUID) == 0 {
// All devices are unhealthy
log.Printf("XidCriticalError: Xid=%d, All devices will go unhealthy.", e.Edata)
for _, d := range devices {
unhealthy <- d
}
continue
}
//...
}

Common Xid Errors

XID 13:GR: SW Notify Error

XID 13 号错误是通用的用户进程的错误,一般是用户访问数组越界、或者非法指令、非法寄存器的问题。这种问题在很少的情况下才会是硬件问题或者内核驱动的问题,基本上是用户进程的问题。

当这种问题发生时,NVIDIA 推荐如下步骤:

  1. Run the application in cuda-gdb or cuda-memcheck , or
  2. Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or
  3. File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.

XID 31: Fifo: MMU Error

XID 31 号错误是由 MMU 报告的错误,比如当一个用户进程对一个非法地址访问的时候。一般来说,这是用户程序级别的bug,也有可能是驱动或者硬件bug。

当这种问题发生时,NVIDIA 推荐如下步骤:

  1. Run the application in cuda-gdb or cuda-memcheck , or
  2. Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or
  3. File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.

XID 32: PBDMA Error

XID 32 号错误是由 DMA Controller 上报的,DMA Controller 负责在 NVIDIA 驱动和 GPU之前通过 PCIe总线进行通信。

一般来说,这种问题是由 PCI 的质量问题导致,一般也不是由用户程序造成的。

XID 43: Reset Channel VERIF Error

XID 43 号错误发生在当探测到用户程序可能因此故障,这时候必须终止用户程序。这种情况下,GPU还是处于健康的状态。

在大多数情况,这种问题是用户进程导致的,而不是驱动的bug

XID 45: OS: Preemptive Channel Removal

XID 45 号错误发生在 用户进程 Abort 了,这时候内核驱动需要终止在GPU上运行的GPU Application。Ctrl-C、CPU Reset、Sigkill 都是这种场景。

大多数情况下,这种问题是用户进程导致的,而不是驱动的bug

XID 48: DBE(Double Bit Error) ECC Error

XID 48 号错误发生在当 GPU 探测到GPU上有一个不可纠正的错误,这个错误也会报告给用户进程。这种情况下,可要 GPU Reset 或者 Node 重启来修复这个问题。nvidia-smi 工具会提供一个ECC错误的总结。

Xid Error Listing

下表展示了所有的Xid Error信息:

XID Failure Causes
HW Error Driver Error User App Error System Memory Corruption Bus Error Thermal Issue FB Corruption
1 Invalid or corrupted push buffer stream X X X X
2 Invalid or corrupted push buffer stream X X X X
3 Invalid or corrupted push buffer stream X X X X
4 Invalid or corrupted push buffer stream X X X X
GPU semaphore timeout X X X X X
5 Unused
6 Invalid or corrupted push buffer stream X X X X
7 Invalid or corrupted push buffer address X X X
8 GPU stopped processing X X X X
9 Driver error programming GPU X
10 Unused
11 Invalid or corrupted push buffer stream X X X X
12 Driver error handling GPU exception X
13 Graphics Engine Exception X X X X X X
14 Unused
15 Unused
16 Display engine hung X
17 Unused
18 Bus mastering disabled in PCI Config Space X
19 Display Engine error X
20 Invalid or corrupted Mpeg push buffer X X X X
21 Invalid or corrupted Motion Estimation push buffer X X X X
22 Invalid or corrupted Video Processor push buffer X X X X
23 Unused
24 GPU semaphore timeout X X X X X X
25 Invalid or illegal push buffer stream X X X X X
26 Framebuffer timeout X
27 Video processor exception X
28 Video processor exception X
29 Video processor exception X
30 GPU semaphore access error X
31 GPU memory page fault X X
32 Invalid or corrupted push buffer stream X X X X X
33 Internal micro-controller error X
34 Video processor exception X
35 Video processor exception X
36 Video processor exception X
37 Driver firmware error X X X
38 Driver firmware error X
39 Unused
40 Unused
41 Unused
42 Video processor exception X
43 GPU stopped processing X X
44 Graphics Engine fault during context switch X
45 Preemptive cleanup, due to previous errors — Most likely to see when running multiple cuda applications and hitting a DBE X
46 GPU stopped processing X
47 Video processor exception X
48 Double Bit ECC Error X
49 Unused
50 Unused
51 Unused
52 Unused
53 Unused
54 Auxiliary power is not connected to the GPU board
55 Unused
56 Display Engine error X X
57 Error programming video memory interface X X X
58 Unstable video memory interface detected X X
EDC error – clarified in printout X
59 Internal micro-controller error(older drivers) X
60 Video processor exception X
61 Internal micro-controller breakpoint/warning(newer drivers)
62 Internal micro-controller halt(newer drivers) X X X
63 ECC page retirement recording event X X X
64 ECC page retirement recording failure X X
65 Video processor exception X X
66 Illegal access by driver X X
67 Illegal access by driver X X
68 Video processor exception X X
69 Graphics Engine class error X X
70 CE3: Unknown Error X X
71 CE4: Unknown Error X X
72 CE5: Unknown Error X X
73 NVENC2 Error X X
74 NVLINK Error X X X
75 Reserved
76 Reserved
77 Reserved
78 vGPU Start Error X
79 GPU has fallen off the bus X X X X X
80 Corrupted data sent to GPU X X X X X
81 VGA Subsystem Error X
82 Reserved
83 Reserved
84 Reserved
85 Reserved
86 Reserved
87 Reserved
88 Reserved
89 Reserved
90 Reserved
91 Reserved
92 High single-bit ECC error rate X X

参考资料