================================================== Host SN : S6S0MD0000ZT Command Executed : dcgmi diag -r 4 Start Time : 2026-03-19 10:02:54 ================================================== Successfully ran diagnostic for group. +---------------------------+------------------------------------------------+ | Diagnostic | Result | +===========================+================================================+ |----- Metadata ----------+------------------------------------------------| | DCGM Version | 4.5.2 | | Driver Version Detected | 590.48.01 | | GPU Device IDs Detected | 2330, 2330, 2330, 2330, 2330, 2330, 2330, 2330 | |----- Deployment --------+------------------------------------------------| | software | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | +----- Hardware ----------+------------------------------------------------+ | memory | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | diagnostic | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | nvbandwidth | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | pulse_test | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | +----- Integration -------+------------------------------------------------+ | pcie | Fail | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Fail | | Warning: GPU2 | Temperature 90 of GPU 2 exceeded user-specifi | | | ed maximum allowed temperature 87 Verify that | | | the user-specified temperature maximum is se | | | t correctly. If it is, check the cooling for | | | this GPU and node: Verify that the cooling on | | | this machine is functional, including extern | | | al, thermal material interface, fans, and any | | | other components. | | Warning: GPU2 | Clocks event for GPU 2 because of clocks even | | | t starting 369.9 seconds into the test. clock | | | s_event_reason_sw_thermal_slowdown: the GPU o | | | r its memory have reached unsafe temperatures | | | . Check DCGM and system logs for errors. Rese | | | t GPU. Restart DCGM. Rerun diagnostics. | | Warning: GPU2 | Thermal violations totaling 13761.1 seconds s | | | tarted at 369.9 seconds into the test for GPU | | | 2 Verify that the cooling on this machine is | | | functional, including external, thermal mate | | | rial interface, fans, and any other component | | | s. | | | GPU3: Pass | | | GPU4: Fail | | Warning: GPU4 | There were thermal violations totaling 2107.9 | | | seconds for GPU 4 Verify that the cooling on | | | this machine is functional, including extern | | | al, thermal material interface, fans, and any | | | other components. | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Fail | | Warning: GPU7 | There were thermal violations totaling 156.8 | | | seconds for GPU 7 Verify that the cooling on | | | this machine is functional, including externa | | | l, thermal material interface, fans, and any | | | other components. | +----- Stress ------------+------------------------------------------------+ | memtest | Fail | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Fail | | Warning: GPU2 | Temperature 96 of HBM Memory on GPU 2 exceede | | | d user-specified maximum allowed temperature | | | 95 Verify that the user-specified temperature | | | maximum is set correctly. If it is, check th | | | e cooling for this GPU and node: Verify that | | | the cooling on this machine is functional, in | | | cluding external, thermal material interface, | | | fans, and any other components. | | Warning: GPU2 | Clocks event for GPU 2 because of clocks even | | | t starting 75.0 seconds into the test. clocks | | | _event_reason_sw_thermal_slowdown: the GPU or | | | its memory have reached unsafe temperatures. | | | Check DCGM and system logs for errors. Reset | | | GPU. Restart DCGM. Rerun diagnostics. | | Warning: GPU2 | Thermal violations totaling 391.7 seconds sta | | | rted at 75.0 seconds into the test for GPU 2 | | | Verify that the cooling on this machine is fu | | | nctional, including external, thermal materia | | | l interface, fans, and any other components. | | | GPU3: Pass | | | GPU4: Fail | | Warning: GPU4 | There were thermal violations totaling 381.3 | | | seconds for GPU 4 Verify that the cooling on | | | this machine is functional, including externa | | | l, thermal material interface, fans, and any | | | other components. | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Fail | | Warning: GPU7 | There were thermal violations totaling 10.7 s | | | econds for GPU 7 Verify that the cooling on t | | | his machine is functional, including external | | | , thermal material interface, fans, and any o | | | ther components. | | memory_bandwidth | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | targeted_stress | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | targeted_power | Fail | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Fail | | Warning: GPU2 | Max power of 352.5 did not reach desired powe | | | r minimum target_power_min_ratio of 525.0 for | | | GPU 2 Verify that the clock speeds and GPU u | | | tilization are high. | | Warning: GPU2 | Temperature 88 of GPU 2 exceeded user-specifi | | | ed maximum allowed temperature 87 Verify that | | | the user-specified temperature maximum is se | | | t correctly. If it is, check the cooling for | | | this GPU and node: Verify that the cooling on | | | this machine is functional, including extern | | | al, thermal material interface, fans, and any | | | other components. | | Warning: GPU2 | Clocks event for GPU 2 because of clocks even | | | t starting 5.0 seconds into the test. clocks_ | | | event_reason_sw_thermal_slowdown: the GPU or | | | its memory have reached unsafe temperatures. | | | Check DCGM and system logs for errors. Reset | | | GPU. Restart DCGM. Rerun diagnostics. | | Warning: GPU2 | Thermal violations totaling 2765.7 seconds st | | | arted at 5.0 seconds into the test for GPU 2 | | | Verify that the cooling on this machine is fu | | | nctional, including external, thermal materia | | | l interface, fans, and any other components. | | | GPU3: Pass | | | GPU4: Fail | | Warning: GPU4 | Clocks event for GPU 4 because of clocks even | | | t starting 25.0 seconds into the test. clocks | | | _event_reason_sw_thermal_slowdown: the GPU or | | | its memory have reached unsafe temperatures. | | | Check DCGM and system logs for errors. Reset | | | GPU. Restart DCGM. Rerun diagnostics. | | Warning: GPU4 | Thermal violations totaling 401.5 seconds sta | | | rted at 10.0 seconds into the test for GPU 4 | | | Verify that the cooling on this machine is fu | | | nctional, including external, thermal materia | | | l interface, fans, and any other components. | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Fail | | Warning: GPU7 | Thermal violations totaling 29.9 seconds star | | | ted at 10.0 seconds into the test for GPU 7 V | | | erify that the cooling on this machine is fun | | | ctional, including external, thermal material | | | interface, fans, and any other components. | +---------------------------+------------------------------------------------+ ================================================== End Time : 2026-03-19 11:24:01 ==================================================