# dcgmi diag -r 4 Successfully ran diagnostic for group. +---------------------------+------------------------------------------------+ | Diagnostic | Result | +===========================+================================================+ |----- Metadata ----------+------------------------------------------------| | DCGM Version | 3.3.7 | | Driver Version Detected | 535.129.03 | | GPU Device IDs Detected | 2330,2330,2330,2330,2330,2330,2330,2330 | |----- Deployment --------+------------------------------------------------| | Denylist | Pass | | NVML Library | Pass | | CUDA Main Library | Pass | | Permissions and OS Blocks | Pass | | Persistence Mode | Pass | | Environment Variables | Pass | | Page Retirement/Row Remap | Pass | | Graphics Processes | Pass | | Inforom | Pass | +----- Integration -------+------------------------------------------------+ | PCIe | Pass - GPUs: 1, 2, 4, 5, 6, 7 | | | Fail - GPUs: 0, 3 | | Warning | GPU 0 Thermal violations totaling 668.6 secon | | | ds started at 344.6 seconds into the test for | | | GPU 0 Verify that the cooling on this machin | | | e is functional, including external, thermal | | | material interface, fans, and any other compo | | | nents. | | Warning | GPU 0 Temperature 89 of GPU 0 exceeded user-s | | | pecified maximum allowed temperature 87 Verif | | | y that the user-specified temperature maximum | | | is set correctly. If it is, check the coolin | | | g for this GPU and node: Verify that the cool | | | ing on this machine is functional, including | | | external, thermal material interface, fans, a | | | nd any other components. | | Warning | GPU 0 Clocks are being throttled for GPU 0 be | | | cause of clock throttling starting 344.6 seco | | | nds into the test. clocks_throttle_reason_hw_ | | | slowdown: either the temperature is too high | | | or there is a power supply problem (the power | | | brake assertion has been tripped). Check DCG | | | M and system logs for errors. Reset GPU. Rest | | | art DCGM. Rerun diagnostics. | | Warning | GPU 3 Temperature 88 of GPU 3 exceeded user-s | | | pecified maximum allowed temperature 87 Verif | | | y that the user-specified temperature maximum | | | is set correctly. If it is, check the coolin | | | g for this GPU and node: Verify that the cool | | | ing on this machine is functional, including | | | external, thermal material interface, fans, a | | | nd any other components. | | Warning | GPU 3 Clocks are being throttled for GPU 3 be | | | cause of clock throttling starting 345.2 seco | | | nds into the test. clocks_throttle_reason_sw_ | | | thermal_slowdown: the GPU or its memory have | | | reached unsafe temperatures. Check DCGM and s | | | ystem logs for errors. Reset GPU. Restart DCG | | | M. Rerun diagnostics. | | Info | GPU 0 GPU to Host bandwidth: 55.17 GB/s, GPU | | | 0 Host to GPU bandwidth: 55.09 GB/s, GPU 0 | | | bidirectional bandwidth: 100.90 GB/s, GPU 0 G | | | PU to Host latency: 1.906 us, GPU 0 Host to | | | GPU latency: 2.196 us, GPU 0 bidirectional l | | | atency: 3.061 us | | Info | GPU 1 GPU to Host bandwidth: 55.15 GB/s, GPU | | | 1 Host to GPU bandwidth: 55.15 GB/s, GPU 1 | | | bidirectional bandwidth: 100.96 GB/s, GPU 1 G | | | PU to Host latency: 1.920 us, GPU 1 Host to | | | GPU latency: 2.194 us, GPU 1 bidirectional l | | | atency: 3.274 us | | Info | GPU 2 GPU to Host bandwidth: 55.17 GB/s, GPU | | | 2 Host to GPU bandwidth: 55.11 GB/s, GPU 2 | | | bidirectional bandwidth: 100.88 GB/s, GPU 2 G | | | PU to Host latency: 1.886 us, GPU 2 Host to | | | GPU latency: 2.176 us, GPU 2 bidirectional l | | | atency: 3.031 us | | Info | GPU 3 GPU to Host bandwidth: 55.14 GB/s, GPU | | | 3 Host to GPU bandwidth: 55.15 GB/s, GPU 3 | | | bidirectional bandwidth: 100.94 GB/s, GPU 3 G | | | PU to Host latency: 1.917 us, GPU 3 Host to | | | GPU latency: 2.183 us, GPU 3 bidirectional l | | | atency: 3.017 us | | Info | GPU 4 GPU to Host bandwidth: 55.18 GB/s, GPU | | | 4 Host to GPU bandwidth: 55.14 GB/s, GPU 4 | | | bidirectional bandwidth: 100.93 GB/s, GPU 4 G | | | PU to Host latency: 1.944 us, GPU 4 Host to | | | GPU latency: 2.182 us, GPU 4 bidirectional l | | | atency: 3.441 us | | Info | GPU 5 GPU to Host bandwidth: 55.17 GB/s, GPU | | | 5 Host to GPU bandwidth: 55.15 GB/s, GPU 5 | | | bidirectional bandwidth: 100.96 GB/s, GPU 5 G | | | PU to Host latency: 1.944 us, GPU 5 Host to | | | GPU latency: 2.172 us, GPU 5 bidirectional l | | | atency: 3.388 us | | Info | GPU 6 GPU to Host bandwidth: 55.17 GB/s, GPU | | | 6 Host to GPU bandwidth: 55.16 GB/s, GPU 6 | | | bidirectional bandwidth: 100.97 GB/s, GPU 6 G | | | PU to Host latency: 1.947 us, GPU 6 Host to | | | GPU latency: 2.211 us, GPU 6 bidirectional l | | | atency: 3.438 us | | Info | GPU 7 GPU to Host bandwidth: 55.15 GB/s, GPU | | | 7 Host to GPU bandwidth: 55.15 GB/s, GPU 7 | | | bidirectional bandwidth: 100.98 GB/s, GPU 7 G | | | PU to Host latency: 1.921 us, GPU 7 Host to | | | GPU latency: 2.160 us, GPU 7 bidirectional l | | | atency: 3.269 us | +----- Hardware ----------+------------------------------------------------+ | GPU Memory | Pass - All | | Diagnostic | Pass - GPUs: 1, 2, 4, 5, 6, 7 | | | Fail - GPUs: 0, 3 | | Warning | GPU 0 Thermal violations totaling 2.1 seconds | | | started at 9.8 seconds into the test for GPU | | | 0 Verify that the cooling on this machine is | | | functional, including external, thermal mate | | | rial interface, fans, and any other component | | | s. | | Warning | GPU 0 Clocks are being throttled for GPU 0 be | | | cause of clock throttling starting 9.8 second | | | s into the test. clocks_throttle_reason_sw_th | | | ermal_slowdown: the GPU or its memory have re | | | ached unsafe temperatures. Check DCGM and sys | | | tem logs for errors. Reset GPU. Restart DCGM. | | | Rerun diagnostics. | | Warning | GPU 3 Clocks are being throttled for GPU 3 be | | | cause of clock throttling starting 20.0 secon | | | ds into the test. clocks_throttle_reason_sw_t | | | hermal_slowdown: the GPU or its memory have r | | | eached unsafe temperatures. Check DCGM and sy | | | stem logs for errors. Reset GPU. Restart DCGM | | | . Rerun diagnostics. | | Info | GPU 0 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 0 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 0 GPU 0 calculated at approximately 816 | | | .31 gigaflops during this test | | Info | GPU 1 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 1 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 1 GPU 1 calculated at approximately 235 | | | 1.46 gigaflops during this test | | Info | GPU 2 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 2 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 2 GPU 2 calculated at approximately 230 | | | 2.72 gigaflops during this test | | Info | GPU 3 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 3 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 3 GPU 3 calculated at approximately 938 | | | .14 gigaflops during this test | | Info | GPU 4 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 4 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 4 GPU 4 calculated at approximately 235 | | | 1.46 gigaflops during this test | | Info | GPU 5 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 5 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 5 GPU 5 calculated at approximately 235 | | | 1.46 gigaflops during this test | | Info | GPU 6 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 6 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 6 GPU 6 calculated at approximately 230 | | | 2.72 gigaflops during this test | | Info | GPU 7 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 7 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 7 GPU 7 calculated at approximately 235 | | | 1.46 gigaflops during this test | | Pulse Test | Pass - GPUs: 1, 2, 4, 5, 6, 7 | | | Fail - GPUs: 0, 3 | | Warning | GPU 0 Thermal violations totaling 6.3 seconds | | | started at 1500.7 seconds into the test for | | | GPU 0 Verify that the cooling on this machine | | | is functional, including external, thermal m | | | aterial interface, fans, and any other compon | | | ents. | | Warning | GPU 0 Clocks are being throttled for GPU 0 be | | | cause of clock throttling starting 1500.7 sec | | | onds into the test. clocks_throttle_reason_sw | | | _thermal_slowdown: the GPU or its memory have | | | reached unsafe temperatures. Check DCGM and | | | system logs for errors. Reset GPU. Restart DC | | | GM. Rerun diagnostics. | | Warning | GPU 3 Clocks are being throttled for GPU 3 be | | | cause of clock throttling starting 1500.8 sec | | | onds into the test. clocks_throttle_reason_sw | | | _thermal_slowdown: the GPU or its memory have | | | reached unsafe temperatures. Check DCGM and | | | system logs for errors. Reset GPU. Restart DC | | | GM. Rerun diagnostics. | +----- Stress ------------+------------------------------------------------+ | Targeted Stress | Pass - GPUs: 1, 2, 3, 4, 5, 6, 7 | | | Fail - GPU: 0 | | Warning | GPU 0 There were thermal violations totaling | | | 57.2 seconds for GPU 0 Verify that the coolin | | | g on this machine is functional, including ex | | | ternal, thermal material interface, fans, and | | | any other components. | | Info | GPU 0 GPU 0 relative stress level 5527 | | Info | GPU 1 GPU 1 relative stress level 5475 | | Info | GPU 2 GPU 2 relative stress level 5570 | | Info | GPU 3 GPU 3 relative stress level 5655 | | Info | GPU 4 GPU 4 relative stress level 4260 | | Info | GPU 5 GPU 5 relative stress level 4260 | | Info | GPU 6 GPU 6 relative stress level 4260 | | Info | GPU 7 GPU 7 relative stress level 4260 | | Targeted Power | Pass - GPUs: 1, 2, 4, 5, 6, 7 | | | Fail - GPUs: 0, 3 | | Warning | GPU 0 Max power of 450.6 did not reach desire | | | d power minimum target_power_min_ratio of 525 | | | .0 for GPU 0 Verify that the clock speeds and | | | GPU utilization are high. | | Warning | GPU 0 Thermal violations totaling 201.5 secon | | | ds started at 4.9 seconds into the test for G | | | PU 0 Verify that the cooling on this machine | | | is functional, including external, thermal ma | | | terial interface, fans, and any other compone | | | nts. | | Warning | GPU 0 Temperature 89 of GPU 0 exceeded user-s | | | pecified maximum allowed temperature 87 Verif | | | y that the user-specified temperature maximum | | | is set correctly. If it is, check the coolin | | | g for this GPU and node: Verify that the cool | | | ing on this machine is functional, including | | | external, thermal material interface, fans, a | | | nd any other components. | | Warning | GPU 0 Clocks are being throttled for GPU 0 be | | | cause of clock throttling starting 4.9 second | | | s into the test. clocks_throttle_reason_sw_th | | | ermal_slowdown: the GPU or its memory have re | | | ached unsafe temperatures. Check DCGM and sys | | | tem logs for errors. Reset GPU. Restart DCGM. | | | Rerun diagnostics. | | Warning | GPU 3 Max power of 519.3 did not reach desire | | | d power minimum target_power_min_ratio of 525 | | | .0 for GPU 3 Verify that the clock speeds and | | | GPU utilization are high. | | Warning | GPU 3 Clocks are being throttled for GPU 3 be | | | cause of clock throttling starting 5.0 second | | | s into the test. clocks_throttle_reason_hw_sl | | | owdown: either the temperature is too high or | | | there is a power supply problem (the power b | | | rake assertion has been tripped). Check DCGM | | | and system logs for errors. Reset GPU. Restar | | | t DCGM. Rerun diagnostics. | | Info | GPU 1 GPU 1 max power: 668.5 W average power | | | usage: 657.3 W | | Info | GPU 2 GPU 2 max power: 698.6 W average power | | | usage: 684.3 W | | Info | GPU 4 GPU 4 max power: 656.2 W average power | | | usage: 647.4 W | | Info | GPU 5 GPU 5 max power: 688.5 W average power | | | usage: 672.3 W | | Info | GPU 6 GPU 6 max power: 671.8 W average power | | | usage: 657.5 W | | Info | GPU 7 GPU 7 max power: 656.8 W average power | | | usage: 646.5 W | | Memory Bandwidth | Pass - All | | Memtest | Pass - GPUs: 1, 2, 4, 5, 6, 7 | | | Fail - GPUs: 0, 3 | | Warning | GPU 0 Thermal violations totaling 6.3 seconds | | | started at 94.9 seconds into the test for GP | | | U 0 Verify that the cooling on this machine i | | | s functional, including external, thermal mat | | | erial interface, fans, and any other componen | | | ts. | | Warning | GPU 0 Temperature 96 of GPU 0 exceeded user-s | | | pecified maximum allowed temperature 95 Verif | | | y that the user-specified temperature maximum | | | is set correctly. If it is, check the coolin | | | g for this GPU and node: Verify that the cool | | | ing on this machine is functional, including | | | external, thermal material interface, fans, a | | | nd any other components. | | Warning | GPU 0 Clocks are being throttled for GPU 0 be | | | cause of clock throttling starting 94.9 secon | | | ds into the test. clocks_throttle_reason_sw_t | | | hermal_slowdown: the GPU or its memory have r | | | eached unsafe temperatures. Check DCGM and sy | | | stem logs for errors. Reset GPU. Restart DCGM | | | . Rerun diagnostics. | | Warning | GPU 3 Temperature 88 of GPU 3 exceeded user-s | | | pecified maximum allowed temperature 87 Verif | | | y that the user-specified temperature maximum | | | is set correctly. If it is, check the coolin | | | g for this GPU and node: Verify that the cool | | | ing on this machine is functional, including | | | external, thermal material interface, fans, a | | | nd any other components. | | Warning | GPU 3 Clocks are being throttled for GPU 3 be | | | cause of clock throttling starting 95.1 secon | | | ds into the test. clocks_throttle_reason_sw_t | | | hermal_slowdown: the GPU or its memory have r | | | eached unsafe temperatures. Check DCGM and sy | | | stem logs for errors. Reset GPU. Restart DCGM | | | . Rerun diagnostics. | | EUD Test | Skip - All | +---------------------------+------------------------------------------------+