Successfully ran diagnostic for group. +---------------------------+------------------------------------------------+ | Diagnostic | Result | +===========================+================================================+ |----- Metadata ----------+------------------------------------------------| | DCGM Version | 3.3.7 | | Driver Version Detected | 535.129.03 | | GPU Device IDs Detected | 2330,2330,2330,2330,2330,2330,2330,2330 | |----- Deployment --------+------------------------------------------------| | Denylist | Pass | | NVML Library | Pass | | CUDA Main Library | Pass | | Permissions and OS Blocks | Pass | | Persistence Mode | Pass | | Environment Variables | Pass | | Page Retirement/Row Remap | Pass | | Graphics Processes | Pass | | Inforom | Pass | +----- Integration -------+------------------------------------------------+ | PCIe | Pass - All | +----- Hardware ----------+------------------------------------------------+ | GPU Memory | Pass - All | | Diagnostic | Pass - All | | Pulse Test | Pass - All | +----- Stress ------------+------------------------------------------------+ | Targeted Stress | Pass - All | | Targeted Power | Pass - GPUs: 0, 1, 3, 4, 6, 7 | | | Fail - GPUs: 2, 5 | | Warning | GPU 2 Max power of 462.1 did not reach desire | | | d power minimum target_power_min_ratio of 525 | | | .0 for GPU 2 Verify that the clock speeds and | | | GPU utilization are high. | | Warning | GPU 2 Clocks are being throttled for GPU 2 be | | | cause of clock throttling starting 5.0 second | | | s into the test. clocks_throttle_reason_sw_th | | | ermal_slowdown: the GPU or its memory have re | | | ached unsafe temperatures. Check DCGM and sys | | | tem logs for errors. Reset GPU. Restart DCGM. | | | Rerun diagnostics. | | Warning | GPU 5 Max power of 514.5 did not reach desire | | | d power minimum target_power_min_ratio of 525 | | | .0 for GPU 5 Verify that the clock speeds and | | | GPU utilization are high. | | Warning | GPU 5 Clocks are being throttled for GPU 5 be | | | cause of clock throttling starting 4.8 second | | | s into the test. clocks_throttle_reason_sw_th | | | ermal_slowdown: the GPU or its memory have re | | | ached unsafe temperatures. Check DCGM and sys | | | tem logs for errors. Reset GPU. Restart DCGM. | | | Rerun diagnostics. | | Info | GPU 0 GPU 0 max power: 654.9 W average power | | | usage: 642.7 W | | Info | GPU 1 GPU 1 max power: 674.2 W average power | | | usage: 652.6 W | | Info | GPU 3 GPU 3 max power: 663.8 W average power | | | usage: 655.1 W | | Info | GPU 4 GPU 4 max power: 646.1 W average power | | | usage: 632.2 W | | Info | GPU 6 GPU 6 max power: 647.6 W average power | | | usage: 637.8 W | | Info | GPU 7 GPU 7 max power: 663.6 W average power | | | usage: 650.7 W | | Memory Bandwidth | Pass - All | | Memtest | Pass - All | | EUD Test | Skip - All | +---------------------------+------------------------------------------------+