================================================== Host SN : S6S0MD0000X9 Command Executed : dcgmi diag -r 4 Start Time : 2026-04-08 13:15:42 ================================================== Successfully ran diagnostic for group. +---------------------------+------------------------------------------------+ | Diagnostic | Result | +===========================+================================================+ |----- Metadata ----------+------------------------------------------------| | DCGM Version | 4.5.2 | | Driver Version Detected | 590.48.01 | | GPU Device IDs Detected | 2330, 2330, 2330, 2330, 2330, 2330, 2330, 2330 | |----- Deployment --------+------------------------------------------------| | software | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | +----- Hardware ----------+------------------------------------------------+ | memory | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | diagnostic | Fail | | | GPU0: Fail | | Warning: GPU0 | Temperature 90 of GPU 0 exceeded user-specifi | | | ed maximum allowed temperature 87 Verify that | | | the user-specified temperature maximum is se | | | t correctly. If it is, check the cooling for | | | this GPU and node: Verify that the cooling on | | | this machine is functional, including extern | | | al, thermal material interface, fans, and any | | | other components. | | Warning: GPU0 | Clocks event for GPU 0 because of clocks even | | | t starting 10.0 seconds into the test. clocks | | | _event_reason_sw_thermal_slowdown: the GPU or | | | its memory have reached unsafe temperatures. | | | Check DCGM and system logs for errors. Reset | | | GPU. Restart DCGM. Rerun diagnostics. | | Warning: GPU0 | Thermal violations totaling 8.5 seconds start | | | ed at 5.0 seconds into the test for GPU 0 Ver | | | ify that the cooling on this machine is funct | | | ional, including external, thermal material i | | | nterface, fans, and any other components. | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Fail | | Warning: GPU3 | Clocks event for GPU 3 because of clocks even | | | t starting 15.0 seconds into the test. clocks | | | _event_reason_sw_thermal_slowdown: the GPU or | | | its memory have reached unsafe temperatures. | | | Check DCGM and system logs for errors. Reset | | | GPU. Restart DCGM. Rerun diagnostics. | | Warning: GPU3 | Thermal violations totaling 122.6 seconds sta | | | rted at 15.0 seconds into the test for GPU 3 | | | Verify that the cooling on this machine is fu | | | nctional, including external, thermal materia | | | l interface, fans, and any other components. | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | nvbandwidth | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | pulse_test | Fail | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Fail | | Warning: GPU3 | Temperature 92 of GPU 3 exceeded user-specifi | | | ed maximum allowed temperature 87 Verify that | | | the user-specified temperature maximum is se | | | t correctly. If it is, check the cooling for | | | this GPU and node: Verify that the cooling on | | | this machine is functional, including extern | | | al, thermal material interface, fans, and any | | | other components. | | Warning: GPU3 | Clocks event for GPU 3 because of clocks even | | | t starting 1380.6 seconds into the test. cloc | | | ks_event_reason_sw_thermal_slowdown: the GPU | | | or its memory have reached unsafe temperature | | | s. Check DCGM and system logs for errors. Res | | | et GPU. Restart DCGM. Rerun diagnostics. | | Warning: GPU3 | Thermal violations totaling 9768.5 seconds st | | | arted at 1380.6 seconds into the test for GPU | | | 3 Verify that the cooling on this machine is | | | functional, including external, thermal mate | | | rial interface, fans, and any other component | | | s. | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | +----- Integration -------+------------------------------------------------+ | pcie | Fail | | | GPU0: Fail | | Warning: GPU0 | Temperature 88 of GPU 0 exceeded user-specifi | | | ed maximum allowed temperature 87 Verify that | | | the user-specified temperature maximum is se | | | t correctly. If it is, check the cooling for | | | this GPU and node: Verify that the cooling on | | | this machine is functional, including extern | | | al, thermal material interface, fans, and any | | | other components. | | Warning: GPU0 | Clocks event for GPU 0 because of clocks even | | | t starting 349.8 seconds into the test. clock | | | s_event_reason_sw_thermal_slowdown: the GPU o | | | r its memory have reached unsafe temperatures | | | . Check DCGM and system logs for errors. Rese | | | t GPU. Restart DCGM. Rerun diagnostics. | | Warning: GPU0 | Thermal violations totaling 2078.6 seconds st | | | arted at 349.8 seconds into the test for GPU | | | 0 Verify that the cooling on this machine is | | | functional, including external, thermal mater | | | ial interface, fans, and any other components | | | . | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Fail | | Warning: GPU3 | Temperature 90 of GPU 3 exceeded user-specifi | | | ed maximum allowed temperature 87 Verify that | | | the user-specified temperature maximum is se | | | t correctly. If it is, check the cooling for | | | this GPU and node: Verify that the cooling on | | | this machine is functional, including extern | | | al, thermal material interface, fans, and any | | | other components. | | Warning: GPU3 | Clocks event for GPU 3 because of clocks even | | | t starting 349.9 seconds into the test. clock | | | s_event_reason_sw_thermal_slowdown: the GPU o | | | r its memory have reached unsafe temperatures | | | . Check DCGM and system logs for errors. Rese | | | t GPU. Restart DCGM. Rerun diagnostics. | | Warning: GPU3 | Thermal violations totaling 11853.3 seconds s | | | tarted at 349.9 seconds into the test for GPU | | | 3 Verify that the cooling on this machine is | | | functional, including external, thermal mate | | | rial interface, fans, and any other component | | | s. | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | +----- Stress ------------+------------------------------------------------+ | memtest | Fail | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Fail | | Warning: GPU3 | Temperature 96 of HBM Memory on GPU 3 exceede | | | d user-specified maximum allowed temperature | | | 95 Verify that the user-specified temperature | | | maximum is set correctly. If it is, check th | | | e cooling for this GPU and node: Verify that | | | the cooling on this machine is functional, in | | | cluding external, thermal material interface, | | | fans, and any other components. | | Warning: GPU3 | Clocks event for GPU 3 because of clocks even | | | t starting 50.0 seconds into the test. clocks | | | _event_reason_sw_thermal_slowdown: the GPU or | | | its memory have reached unsafe temperatures. | | | Check DCGM and system logs for errors. Reset | | | GPU. Restart DCGM. Rerun diagnostics. | | Warning: GPU3 | Thermal violations totaling 371.0 seconds sta | | | rted at 50.0 seconds into the test for GPU 3 | | | Verify that the cooling on this machine is fu | | | nctional, including external, thermal materia | | | l interface, fans, and any other components. | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | memory_bandwidth | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | targeted_stress | Pass | | | GPU0: Pass | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Pass | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | | targeted_power | Fail | | | GPU0: Fail | | Warning: GPU0 | Clocks event for GPU 0 because of clocks even | | | t starting 10.0 seconds into the test. clocks | | | _event_reason_sw_thermal_slowdown: the GPU or | | | its memory have reached unsafe temperatures. | | | Check DCGM and system logs for errors. Reset | | | GPU. Restart DCGM. Rerun diagnostics. | | Warning: GPU0 | Thermal violations totaling 395.9 seconds sta | | | rted at 5.0 seconds into the test for GPU 0 V | | | erify that the cooling on this machine is fun | | | ctional, including external, thermal material | | | interface, fans, and any other components. | | | GPU1: Pass | | | GPU2: Pass | | | GPU3: Fail | | Warning: GPU3 | Max power of 395.3 did not reach desired powe | | | r minimum target_power_min_ratio of 525.0 for | | | GPU 3 Verify that the clock speeds and GPU u | | | tilization are high. | | Warning: GPU3 | Temperature 89 of GPU 3 exceeded user-specifi | | | ed maximum allowed temperature 87 Verify that | | | the user-specified temperature maximum is se | | | t correctly. If it is, check the cooling for | | | this GPU and node: Verify that the cooling on | | | this machine is functional, including extern | | | al, thermal material interface, fans, and any | | | other components. | | Warning: GPU3 | Clocks event for GPU 3 because of clocks even | | | t starting 5.0 seconds into the test. clocks_ | | | event_reason_sw_thermal_slowdown: the GPU or | | | its memory have reached unsafe temperatures. | | | Check DCGM and system logs for errors. Reset | | | GPU. Restart DCGM. Rerun diagnostics. | | Warning: GPU3 | Thermal violations totaling 2443.9 seconds st | | | arted at 5.0 seconds into the test for GPU 3 | | | Verify that the cooling on this machine is fu | | | nctional, including external, thermal materia | | | l interface, fans, and any other components. | | | GPU4: Pass | | | GPU5: Pass | | | GPU6: Pass | | | GPU7: Pass | +---------------------------+------------------------------------------------+ ================================================== End Time : 2026-04-08 14:36:16 ==================================================