Page 1 of 1

VASP Job Failure with Soft Lockup on HPC Nodes | Vasp5.4.4

Posted: Tue Nov 12, 2024 9:02 am
by n_sukumar1

We are experiencing persistent issues with VASP 5.4.4 compiled using Intel OneAPI on our HPC system. These issues result in the compute nodes frequently going down, causing jobs to fail or terminate unexpectedly. Below is a typical error message seen in the logs:

[1822049.022978] watchdog: BUG: soft lockup - CPU#101 stuck for 22s! [vasp_std:1145679]

Has anyone encountered similar issues with VASP (particularly version 5.4.4) on high-core systems?
Could this problem be related to Intel OneAPI's handling of MPI or thread scheduling?
What are the recommended compilation flags or settings to optimize stability and performance on systems with high CPU core counts?
Troubleshooting Steps Taken:

We've tried modifying runtime parameters to reduce CPU load but still encounter the issue sporadically.
Standard diagnostic tools have not indicated memory leaks, but CPU utilization spikes are observed.
Seeking Input: Any insights on how to adjust VASP configuration, compilation options, or node settings to avoid soft lockups would be greatly appreciated. Additionally, suggestions for kernel tuning or Intel OneAPI settings that could mitigate this would be helpful.


Re: VASP Job Failure with Soft Lockup on HPC Nodes | Vasp5.4.4

Posted: Tue Nov 12, 2024 3:41 pm
by henrique_miranda

Firstly, thank you for your report!

I have never seen this issue and unfortunately, it is hard for us to try and reproduce it.
I have a few suggestions for you to try:

  1. compile with a different toolchain: perhaps a previous version of the intel compiler or even the gnu compiler
  2. try compiling with another version of MPI: openmpi for example
  3. compile your own version of scalapack and link VASP to it