Xid Message
由 NVIDIA 驱动报告的错误信息,一般卸载操作系统的内核日志或者是事件日志中。Xid消息表明发生了一般的GPU错误,通常是由于驱动程序对GPU的编程不正确或发送给GPU的命令损坏所致。这些消息可能表示硬件问题、NVIDIA软件问题或用户应用程序问题。
Xid Message 的产生可能有以下三种:
- Hardware Problem
- NVIDIA Software Problem
- User Application Problem
Xid Message 可以用作错误诊断,辅助调试报告的错误。在所有不同版本的NVIDIA驱动中,Xid Message 的含义保持一致。
查看 Xid Errors
在 Linux 中,Xid Error 的信息在 /var/log/messages
中,可以看到错误信息。下图展示的是 XID 14 的错误信息:
1 | $ grep "NVRM: Xid" /var/log/messages |
在 NVIDIA 提供的 NVML 库中可以监听 GPU 的 Xid Error,下面是 Go 监听的示例代码:
1 | eventSet := nvml.NewEventSet() |
Common Xid Errors
XID 13:GR: SW Notify Error
XID 13 号错误是通用的用户进程的错误,一般是用户访问数组越界、或者非法指令、非法寄存器的问题。这种问题在很少的情况下才会是硬件问题或者内核驱动的问题,基本上是用户进程的问题。
当这种问题发生时,NVIDIA 推荐如下步骤:
- Run the application in cuda-gdb or cuda-memcheck , or
- Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or
- File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.
XID 31: Fifo: MMU Error
XID 31 号错误是由 MMU 报告的错误,比如当一个用户进程对一个非法地址访问的时候。一般来说,这是用户程序级别的bug,也有可能是驱动或者硬件bug。
当这种问题发生时,NVIDIA 推荐如下步骤:
- Run the application in cuda-gdb or cuda-memcheck , or
- Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or
- File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.
XID 32: PBDMA Error
XID 32 号错误是由 DMA Controller 上报的,DMA Controller 负责在 NVIDIA 驱动和 GPU之前通过 PCIe总线进行通信。
一般来说,这种问题是由 PCI 的质量问题导致,一般也不是由用户程序造成的。
XID 43: Reset Channel VERIF Error
XID 43 号错误发生在当探测到用户程序可能因此故障,这时候必须终止用户程序。这种情况下,GPU还是处于健康的状态。
在大多数情况,这种问题是用户进程导致的,而不是驱动的bug
XID 45: OS: Preemptive Channel Removal
XID 45 号错误发生在 用户进程 Abort 了,这时候内核驱动需要终止在GPU上运行的GPU Application。Ctrl-C
、CPU Reset、Sigkill 都是这种场景。
大多数情况下,这种问题是用户进程导致的,而不是驱动的bug
XID 48: DBE(Double Bit Error) ECC Error
XID 48 号错误发生在当 GPU 探测到GPU上有一个不可纠正的错误,这个错误也会报告给用户进程。这种情况下,可要 GPU Reset 或者 Node 重启来修复这个问题。nvidia-smi
工具会提供一个ECC错误的总结。
Xid Error Listing
下表展示了所有的Xid Error信息:
XID | Failure | Causes | ||||||
---|---|---|---|---|---|---|---|---|
HW Error | Driver Error | User App Error | System Memory Corruption | Bus Error | Thermal Issue | FB Corruption | ||
1 | Invalid or corrupted push buffer stream | X | X | X | X | |||
2 | Invalid or corrupted push buffer stream | X | X | X | X | |||
3 | Invalid or corrupted push buffer stream | X | X | X | X | |||
4 | Invalid or corrupted push buffer stream | X | X | X | X | |||
GPU semaphore timeout | X | X | X | X | X | |||
5 | Unused | |||||||
6 | Invalid or corrupted push buffer stream | X | X | X | X | |||
7 | Invalid or corrupted push buffer address | X | X | X | ||||
8 | GPU stopped processing | X | X | X | X | |||
9 | Driver error programming GPU | X | ||||||
10 | Unused | |||||||
11 | Invalid or corrupted push buffer stream | X | X | X | X | |||
12 | Driver error handling GPU exception | X | ||||||
13 | Graphics Engine Exception | X | X | X | X | X | X | |
14 | Unused | |||||||
15 | Unused | |||||||
16 | Display engine hung | X | ||||||
17 | Unused | |||||||
18 | Bus mastering disabled in PCI Config Space | X | ||||||
19 | Display Engine error | X | ||||||
20 | Invalid or corrupted Mpeg push buffer | X | X | X | X | |||
21 | Invalid or corrupted Motion Estimation push buffer | X | X | X | X | |||
22 | Invalid or corrupted Video Processor push buffer | X | X | X | X | |||
23 | Unused | |||||||
24 | GPU semaphore timeout | X | X | X | X | X | X | |
25 | Invalid or illegal push buffer stream | X | X | X | X | X | ||
26 | Framebuffer timeout | X | ||||||
27 | Video processor exception | X | ||||||
28 | Video processor exception | X | ||||||
29 | Video processor exception | X | ||||||
30 | GPU semaphore access error | X | ||||||
31 | GPU memory page fault | X | X | |||||
32 | Invalid or corrupted push buffer stream | X | X | X | X | X | ||
33 | Internal micro-controller error | X | ||||||
34 | Video processor exception | X | ||||||
35 | Video processor exception | X | ||||||
36 | Video processor exception | X | ||||||
37 | Driver firmware error | X | X | X | ||||
38 | Driver firmware error | X | ||||||
39 | Unused | |||||||
40 | Unused | |||||||
41 | Unused | |||||||
42 | Video processor exception | X | ||||||
43 | GPU stopped processing | X | X | |||||
44 | Graphics Engine fault during context switch | X | ||||||
45 | Preemptive cleanup, due to previous errors — Most likely to see when running multiple cuda applications and hitting a DBE | X | ||||||
46 | GPU stopped processing | X | ||||||
47 | Video processor exception | X | ||||||
48 | Double Bit ECC Error | X | ||||||
49 | Unused | |||||||
50 | Unused | |||||||
51 | Unused | |||||||
52 | Unused | |||||||
53 | Unused | |||||||
54 | Auxiliary power is not connected to the GPU board | |||||||
55 | Unused | |||||||
56 | Display Engine error | X | X | |||||
57 | Error programming video memory interface | X | X | X | ||||
58 | Unstable video memory interface detected | X | X | |||||
EDC error – clarified in printout | X | |||||||
59 | Internal micro-controller error(older drivers) | X | ||||||
60 | Video processor exception | X | ||||||
61 | Internal micro-controller breakpoint/warning(newer drivers) | |||||||
62 | Internal micro-controller halt(newer drivers) | X | X | X | ||||
63 | ECC page retirement recording event | X | X | X | ||||
64 | ECC page retirement recording failure | X | X | |||||
65 | Video processor exception | X | X | |||||
66 | Illegal access by driver | X | X | |||||
67 | Illegal access by driver | X | X | |||||
68 | Video processor exception | X | X | |||||
69 | Graphics Engine class error | X | X | |||||
70 | CE3: Unknown Error | X | X | |||||
71 | CE4: Unknown Error | X | X | |||||
72 | CE5: Unknown Error | X | X | |||||
73 | NVENC2 Error | X | X | |||||
74 | NVLINK Error | X | X | X | ||||
75 | Reserved | |||||||
76 | Reserved | |||||||
77 | Reserved | |||||||
78 | vGPU Start Error | X | ||||||
79 | GPU has fallen off the bus | X | X | X | X | X | ||
80 | Corrupted data sent to GPU | X | X | X | X | X | ||
81 | VGA Subsystem Error | X | ||||||
82 | Reserved | |||||||
83 | Reserved | |||||||
84 | Reserved | |||||||
85 | Reserved | |||||||
86 | Reserved | |||||||
87 | Reserved | |||||||
88 | Reserved | |||||||
89 | Reserved | |||||||
90 | Reserved | |||||||
91 | Reserved | |||||||
92 | High single-bit ECC error rate | X | X |