Successfully ran diagnostic for group. +---------------------------+------------------------------------------------+ | Diagnostic | Result | +===========================+================================================+ |----- Metadata ----------+------------------------------------------------| | DCGM Version | 3.3.8 | | Driver Version Detected | 535.129.03 | | GPU Device IDs Detected | 2330,2330,2330,2330,2330,2330,2330,2330 | |----- Deployment --------+------------------------------------------------| | Denylist | Pass | | NVML Library | Pass | | CUDA Main Library | Pass | | Permissions and OS Blocks | Pass | | Persistence Mode | Pass | | Environment Variables | Pass | | Page Retirement/Row Remap | Pass | | Graphics Processes | Pass | | Inforom | Pass | +----- Integration -------+------------------------------------------------+ | PCIe | Pass - GPUs: 0, 1, 2, 3, 4, 5, 6 | | | Fail - GPU: 7 | | Warning | GPU 7 Temperature 91 of GPU 7 exceeded user-s | | | pecified maximum allowed temperature 87 Verif | | | y that the user-specified temperature maximum | | | is set correctly. If it is, check the coolin | | | g for this GPU and node: Verify that the cool | | | ing on this machine is functional, including | | | external, thermal material interface, fans, a | | | nd any other components. | | Warning | GPU 7 Clocks are being throttled for GPU 7 be | | | cause of clock throttling starting 370.5 seco | | | nds into the test. clocks_throttle_reason_sw_ | | | thermal_slowdown: the GPU or its memory have | | | reached unsafe temperatures. Check DCGM and s | | | ystem logs for errors. Reset GPU. Restart DCG | | | M. Rerun diagnostics. | | Info | GPU 0 GPU to Host bandwidth: 55.15 GB/s, GPU | | | 0 Host to GPU bandwidth: 54.95 GB/s, GPU 0 | | | bidirectional bandwidth: 100.71 GB/s, GPU 0 G | | | PU to Host latency: 1.933 us, GPU 0 Host to | | | GPU latency: 2.170 us, GPU 0 bidirectional l | | | atency: 3.331 us | | Info | GPU 1 GPU to Host bandwidth: 55.14 GB/s, GPU | | | 1 Host to GPU bandwidth: 54.93 GB/s, GPU 1 | | | bidirectional bandwidth: 100.70 GB/s, GPU 1 G | | | PU to Host latency: 1.935 us, GPU 1 Host to | | | GPU latency: 2.167 us, GPU 1 bidirectional l | | | atency: 3.380 us | | Info | GPU 2 GPU to Host bandwidth: 55.16 GB/s, GPU | | | 2 Host to GPU bandwidth: 54.96 GB/s, GPU 2 | | | bidirectional bandwidth: 100.72 GB/s, GPU 2 G | | | PU to Host latency: 1.942 us, GPU 2 Host to | | | GPU latency: 2.180 us, GPU 2 bidirectional l | | | atency: 3.298 us | | Info | GPU 3 GPU to Host bandwidth: 55.15 GB/s, GPU | | | 3 Host to GPU bandwidth: 54.93 GB/s, GPU 3 | | | bidirectional bandwidth: 100.67 GB/s, GPU 3 G | | | PU to Host latency: 1.934 us, GPU 3 Host to | | | GPU latency: 2.167 us, GPU 3 bidirectional l | | | atency: 3.268 us | | Info | GPU 4 GPU to Host bandwidth: 55.09 GB/s, GPU | | | 4 Host to GPU bandwidth: 54.74 GB/s, GPU 4 | | | bidirectional bandwidth: 100.49 GB/s, GPU 4 G | | | PU to Host latency: 1.877 us, GPU 4 Host to | | | GPU latency: 2.166 us, GPU 4 bidirectional l | | | atency: 2.941 us | | Info | GPU 5 GPU to Host bandwidth: 55.09 GB/s, GPU | | | 5 Host to GPU bandwidth: 54.72 GB/s, GPU 5 | | | bidirectional bandwidth: 100.48 GB/s, GPU 5 G | | | PU to Host latency: 1.884 us, GPU 5 Host to | | | GPU latency: 2.173 us, GPU 5 bidirectional l | | | atency: 2.937 us | | Info | GPU 6 GPU to Host bandwidth: 55.11 GB/s, GPU | | | 6 Host to GPU bandwidth: 54.77 GB/s, GPU 6 | | | bidirectional bandwidth: 100.53 GB/s, GPU 6 G | | | PU to Host latency: 1.882 us, GPU 6 Host to | | | GPU latency: 2.197 us, GPU 6 bidirectional l | | | atency: 2.926 us | | Info | GPU 7 GPU to Host bandwidth: 55.09 GB/s, GPU | | | 7 Host to GPU bandwidth: 54.75 GB/s, GPU 7 | | | bidirectional bandwidth: 100.51 GB/s, GPU 7 G | | | PU to Host latency: 1.880 us, GPU 7 Host to | | | GPU latency: 2.171 us, GPU 7 bidirectional l | | | atency: 3.059 us | +----- Hardware ----------+------------------------------------------------+ | GPU Memory | Pass - All | | Diagnostic | Pass - All | | Pulse Test | Pass - All | +----- Stress ------------+------------------------------------------------+ | Targeted Stress | Pass - All | | Targeted Power | Pass - GPUs: 0, 2, 3, 4, 5, 6 | | | Fail - GPUs: 1, 7 | | Warning | GPU 1 Clocks are being throttled for GPU 1 be | | | cause of clock throttling starting 9.9 second | | | s into the test. clocks_throttle_reason_sw_th | | | ermal_slowdown: the GPU or its memory have re | | | ached unsafe temperatures. Check DCGM and sys | | | tem logs for errors. Reset GPU. Restart DCGM. | | | Rerun diagnostics. | | Warning | GPU 7 Max power of 513.9 did not reach desire | | | d power minimum target_power_min_ratio of 525 | | | .0 for GPU 7 Verify that the clock speeds and | | | GPU utilization are high. | | Warning | GPU 7 Temperature 89 of GPU 7 exceeded user-s | | | pecified maximum allowed temperature 87 Verif | | | y that the user-specified temperature maximum | | | is set correctly. If it is, check the coolin | | | g for this GPU and node: Verify that the cool | | | ing on this machine is functional, including | | | external, thermal material interface, fans, a | | | nd any other components. | | Warning | GPU 7 Clocks are being throttled for GPU 7 be | | | cause of clock throttling starting 5.0 second | | | s into the test. clocks_throttle_reason_sw_th | | | ermal_slowdown: the GPU or its memory have re | | | ached unsafe temperatures. Check DCGM and sys | | | tem logs for errors. Reset GPU. Restart DCGM. | | | Rerun diagnostics. | | Info | GPU 0 GPU 0 max power: 656.6 W average power | | | usage: 644.7 W | | Info | GPU 1 GPU 1 max power: 578.7 W average power | | | usage: 466.0 W | | Info | GPU 2 GPU 2 max power: 661.3 W average power | | | usage: 646.9 W | | Info | GPU 3 GPU 3 max power: 646.5 W average power | | | usage: 640.3 W | | Info | GPU 4 GPU 4 max power: 659.9 W average power | | | usage: 645.6 W | | Info | GPU 5 GPU 5 max power: 668.8 W average power | | | usage: 655.4 W | | Info | GPU 6 GPU 6 max power: 664.0 W average power | | | usage: 652.0 W | | Memory Bandwidth | Pass - All | | Memtest | Pass - GPUs: 0, 1, 2, 3, 4, 5, 6 | | | Fail - GPU: 7 | | Warning | GPU 7 Temperature 96 of GPU 7 exceeded user-s | | | pecified maximum allowed temperature 95 Verif | | | y that the user-specified temperature maximum | | | is set correctly. If it is, check the coolin | | | g for this GPU and node: Verify that the cool | | | ing on this machine is functional, including | | | external, thermal material interface, fans, a | | | nd any other components. | | Warning | GPU 7 Clocks are being throttled for GPU 7 be | | | cause of clock throttling starting 15.0 secon | | | ds into the test. clocks_throttle_reason_sw_t | | | hermal_slowdown: the GPU or its memory have r | | | eached unsafe temperatures. Check DCGM and sy | | | stem logs for errors. Reset GPU. Restart DCGM | | | . Rerun diagnostics. | | EUD Test | Skip - All | +---------------------------+------------------------------------------------+