Successfully ran diagnostic for group. +---------------------------+------------------------------------------------+ | Diagnostic | Result | +===========================+================================================+ |----- Metadata ----------+------------------------------------------------| | DCGM Version | 3.3.8 | | Driver Version Detected | 535.129.03 | | GPU Device IDs Detected | 2330,2330,2330,2330,2330,2330,2330,2330 | |----- Deployment --------+------------------------------------------------| | Denylist | Pass | | NVML Library | Pass | | CUDA Main Library | Pass | | Permissions and OS Blocks | Pass | | Persistence Mode | Pass | | Environment Variables | Pass | | Page Retirement/Row Remap | Pass | | Graphics Processes | Pass | | Inforom | Pass | +----- Integration -------+------------------------------------------------+ | PCIe | Pass - GPUs: 1, 2, 3, 4, 5, 6, 7 | | | Fail - GPU: 0 | | Warning | GPU 0 There were thermal violations totaling | | | 2016084.3 seconds for GPU 0 Verify that the c | | | ooling on this machine is functional, includi | | | ng external, thermal material interface, fans | | | , and any other components. | | Info | GPU 0 GPU to Host bandwidth: 55.09 GB/s, GPU | | | 0 Host to GPU bandwidth: 54.75 GB/s, GPU 0 | | | bidirectional bandwidth: 100.53 GB/s, GPU 0 G | | | PU to Host latency: 1.905 us, GPU 0 Host to | | | GPU latency: 2.192 us, GPU 0 bidirectional l | | | atency: 3.126 us | | Info | GPU 1 GPU to Host bandwidth: 55.09 GB/s, GPU | | | 1 Host to GPU bandwidth: 54.71 GB/s, GPU 1 | | | bidirectional bandwidth: 100.49 GB/s, GPU 1 G | | | PU to Host latency: 1.880 us, GPU 1 Host to | | | GPU latency: 2.164 us, GPU 1 bidirectional l | | | atency: 2.977 us | | Info | GPU 2 GPU to Host bandwidth: 55.10 GB/s, GPU | | | 2 Host to GPU bandwidth: 54.71 GB/s, GPU 2 | | | bidirectional bandwidth: 100.48 GB/s, GPU 2 G | | | PU to Host latency: 1.883 us, GPU 2 Host to | | | GPU latency: 2.183 us, GPU 2 bidirectional l | | | atency: 3.144 us | | Info | GPU 3 GPU to Host bandwidth: 55.08 GB/s, GPU | | | 3 Host to GPU bandwidth: 54.76 GB/s, GPU 3 | | | bidirectional bandwidth: 100.52 GB/s, GPU 3 G | | | PU to Host latency: 1.895 us, GPU 3 Host to | | | GPU latency: 2.187 us, GPU 3 bidirectional l | | | atency: 3.128 us | | Info | GPU 4 GPU to Host bandwidth: 55.08 GB/s, GPU | | | 4 Host to GPU bandwidth: 54.74 GB/s, GPU 4 | | | bidirectional bandwidth: 100.47 GB/s, GPU 4 G | | | PU to Host latency: 1.935 us, GPU 4 Host to | | | GPU latency: 2.167 us, GPU 4 bidirectional l | | | atency: 3.347 us | | Info | GPU 5 GPU to Host bandwidth: 55.09 GB/s, GPU | | | 5 Host to GPU bandwidth: 54.75 GB/s, GPU 5 | | | bidirectional bandwidth: 100.49 GB/s, GPU 5 G | | | PU to Host latency: 1.944 us, GPU 5 Host to | | | GPU latency: 2.178 us, GPU 5 bidirectional l | | | atency: 3.329 us | | Info | GPU 6 GPU to Host bandwidth: 55.09 GB/s, GPU | | | 6 Host to GPU bandwidth: 54.75 GB/s, GPU 6 | | | bidirectional bandwidth: 100.53 GB/s, GPU 6 G | | | PU to Host latency: 1.934 us, GPU 6 Host to | | | GPU latency: 2.169 us, GPU 6 bidirectional l | | | atency: 3.391 us | | Info | GPU 7 GPU to Host bandwidth: 55.11 GB/s, GPU | | | 7 Host to GPU bandwidth: 54.73 GB/s, GPU 7 | | | bidirectional bandwidth: 100.50 GB/s, GPU 7 G | | | PU to Host latency: 1.926 us, GPU 7 Host to | | | GPU latency: 2.191 us, GPU 7 bidirectional l | | | atency: 3.197 us | +----- Hardware ----------+------------------------------------------------+ | GPU Memory | Pass - GPUs: 1, 2, 3, 4, 5, 6, 7 | | | Fail - GPU: 0 | | Warning | GPU 0 Thermal violations totaling 128005.4 se | | | conds started at 9.8 seconds into the test fo | | | r GPU 0 Verify that the cooling on this machi | | | ne is functional, including external, thermal | | | material interface, fans, and any other comp | | | onents. | | Info | GPU 0 Allocated 83537716779 bytes (98.3%) | | Info | GPU 1 Allocated 83537716779 bytes (98.3%) | | Info | GPU 2 Allocated 83537716779 bytes (98.3%) | | Info | GPU 3 Allocated 83537716779 bytes (98.3%) | | Info | GPU 4 Allocated 83537716779 bytes (98.3%) | | Info | GPU 5 Allocated 83537716779 bytes (98.3%) | | Info | GPU 6 Allocated 83537716779 bytes (98.3%) | | Info | GPU 7 Allocated 83537716779 bytes (98.3%) | | Diagnostic | Pass - GPUs: 2, 3, 4, 5, 6, 7 | | | Fail - GPUs: 0, 1 | | Warning | GPU 0 Thermal violations totaling 688028.8 se | | | conds started at 14.8 seconds into the test f | | | or GPU 0 Verify that the cooling on this mach | | | ine is functional, including external, therma | | | l material interface, fans, and any other com | | | ponents. | | Warning | GPU 0 Clocks are being throttled for GPU 0 be | | | cause of clock throttling starting 29.8 secon | | | ds into the test. clocks_throttle_reason_sw_t | | | hermal_slowdown: the GPU or its memory have r | | | eached unsafe temperatures. Check DCGM and sy | | | stem logs for errors. Reset GPU. Restart DCGM | | | . Rerun diagnostics. | | Warning | GPU 1 Clocks are being throttled for GPU 1 be | | | cause of clock throttling starting 34.9 secon | | | ds into the test. clocks_throttle_reason_sw_t | | | hermal_slowdown: the GPU or its memory have r | | | eached unsafe temperatures. Check DCGM and sy | | | stem logs for errors. Reset GPU. Restart DCGM | | | . Rerun diagnostics. | | Info | GPU 0 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 0 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 0 GPU 0 calculated at approximately 229 | | | 0.54 gigaflops during this test | | Info | GPU 1 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 1 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 1 GPU 1 calculated at approximately 221 | | | 7.43 gigaflops during this test | | Info | GPU 2 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 2 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 2 GPU 2 calculated at approximately 230 | | | 2.72 gigaflops during this test | | Info | GPU 3 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 3 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 3 GPU 3 calculated at approximately 235 | | | 1.46 gigaflops during this test | | Info | GPU 4 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 4 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 4 GPU 4 calculated at approximately 230 | | | 2.72 gigaflops during this test | | Info | GPU 5 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 5 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 5 GPU 5 calculated at approximately 230 | | | 2.72 gigaflops during this test | | Info | GPU 6 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 6 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 6 GPU 6 calculated at approximately 230 | | | 2.72 gigaflops during this test | | Info | GPU 7 Allocated space for 137 output matricie | | | s from 75945266380 bytes available., GPU 7 Ru | | | nning with precisions: FP64 1, FP32 1, FP16 1 | | | , GPU 7 GPU 7 calculated at approximately 238 | | | 8.01 gigaflops during this test | | Pulse Test | Pass - GPUs: 1, 2, 3, 4, 5, 6, 7 | | | Fail - GPU: 0 | | Warning | GPU 0 There were thermal violations totaling | | | 2016084.3 seconds for GPU 0 Verify that the c | | | ooling on this machine is functional, includi | | | ng external, thermal material interface, fans | | | , and any other components. | +----- Stress ------------+------------------------------------------------+ | Targeted Stress | Pass - GPUs: 1, 2, 3, 4, 5, 6, 7 | | | Fail - GPU: 0 | | Warning | GPU 0 There were thermal violations totaling | | | 112004.7 seconds for GPU 0 Verify that the co | | | oling on this machine is functional, includin | | | g external, thermal material interface, fans, | | | and any other components. | | Info | GPU 0 GPU 0 relative stress level 4081 | | Info | GPU 1 GPU 1 relative stress level 4132 | | Info | GPU 2 GPU 2 relative stress level 4095 | | Info | GPU 3 GPU 3 relative stress level 4131 | | Info | GPU 4 GPU 4 relative stress level 5616 | | Info | GPU 5 GPU 5 relative stress level 5614 | | Info | GPU 6 GPU 6 relative stress level 5531 | | Info | GPU 7 GPU 7 relative stress level 5534 | | Targeted Power | Pass - GPUs: 2, 3, 4, 5, 6, 7 | | | Fail - GPUs: 0, 1 | | Warning | GPU 0 Thermal violations totaling 384016.1 se | | | conds started at 9.8 seconds into the test fo | | | r GPU 0 Verify that the cooling on this machi | | | ne is functional, including external, thermal | | | material interface, fans, and any other comp | | | onents. | | Warning | GPU 0 Clocks are being throttled for GPU 0 be | | | cause of clock throttling starting 24.9 secon | | | ds into the test. clocks_throttle_reason_sw_t | | | hermal_slowdown: the GPU or its memory have r | | | eached unsafe temperatures. Check DCGM and sy | | | stem logs for errors. Reset GPU. Restart DCGM | | | . Rerun diagnostics. | | Warning | GPU 1 Clocks are being throttled for GPU 1 be | | | cause of clock throttling starting 19.9 secon | | | ds into the test. clocks_throttle_reason_sw_t | | | hermal_slowdown: the GPU or its memory have r | | | eached unsafe temperatures. Check DCGM and sy | | | stem logs for errors. Reset GPU. Restart DCGM | | | . Rerun diagnostics. | | Info | GPU 0 GPU 0 max power: 698.9 W average power | | | usage: 655.8 W | | Info | GPU 1 GPU 1 max power: 699.0 W average power | | | usage: 606.7 W | | Info | GPU 2 GPU 2 max power: 664.3 W average power | | | usage: 651.4 W | | Info | GPU 3 GPU 3 max power: 656.2 W average power | | | usage: 647.8 W | | Info | GPU 4 GPU 4 max power: 678.3 W average power | | | usage: 664.5 W | | Info | GPU 5 GPU 5 max power: 670.4 W average power | | | usage: 657.6 W | | Info | GPU 6 GPU 6 max power: 663.2 W average power | | | usage: 651.8 W | | Info | GPU 7 GPU 7 max power: 643.2 W average power | | | usage: 632.1 W | | Memory Bandwidth | Pass - All | | Memtest | Pass - GPUs: 1, 2, 3, 4, 5, 6, 7 | | | Fail - GPU: 0 | | Warning | GPU 0 There were thermal violations totaling | | | 1984083.0 seconds for GPU 0 Verify that the c | | | ooling on this machine is functional, includi | | | ng external, thermal material interface, fans | | | , and any other components. | | EUD Test | Skip - All | +---------------------------+------------------------------------------------+