VASP Job Failure with Soft Lockup on HPC Nodes | Vasp5.4.4
We are experiencing persistent issues with VASP 5.4.4 compiled using Intel OneAPI on our HPC system. These issues result in the compute nodes frequently going down, causing jobs to fail or terminate unexpectedly. Below is a typical error message seen in the logs:
[1822049.022978] watchdog: BUG: soft lockup - CPU#101 stuck for 22s! [vasp_std:1145679]
Has anyone encountered similar issues with VASP (particularly version 5.4.4) on high-core systems?
Could this problem be related to Intel OneAPI's handling of MPI or thread scheduling?
What are the recommended compilation flags or settings to optimize stability and performance on systems with high CPU core counts?
Troubleshooting Steps Taken:
We've tried modifying runtime parameters to reduce CPU load but still encounter the issue sporadically.
Standard diagnostic tools have not indicated memory leaks, but CPU utilization spikes are observed.
Seeking Input: Any insights on how to adjust VASP configuration, compilation options, or node settings to avoid soft lockups would be greatly appreciated. Additionally, suggestions for kernel tuning or Intel OneAPI settings that could mitigate this would be helpful.