|
|
(3 intermediate revisions by 2 users not shown) |
Line 1: |
Line 1: |
| For many complex problems, a single core is not enough to finish the calculation in a reasonable time.
| | #REDIRECT [[:Category:Parallelization]] |
| VASP makes use of parallel machines splitting the calculation into many tasks, that communicate with each other using MPI.
| |
| By default, VASP distributes the number of bands ({{TAG|NBANDS}}) over the available cores.
| |
| But it is often beneficial to add parallelization of the FFTs ({{TAG|NCORE}}), parallelization over '''k''' points ({{TAG|KPAR}}), and parallelization over separate calculations ({{TAG|IMAGES}}).
| |
| All these tags default to 1 and divide the number of cores among the parallelization options.
| |
| There are also additional parallelization options for some algorithms in VASP.
| |
| ::<math>
| |
| \text{total cores} = \text{cores parallelizing bands} \times \text{NCORE} \times \text{KPAR} \times \text{IMAGES} \times \text{other algorithm-dependent tags}
| |
| </math>
| |
| In addition to the parallelization using MPI, VASP can make use of [[Hybrid_MPI/OpenMP_parallelization|OpenMP]] and/or [[OpenACC_GPU_port_of_VASP|OpenACC (for the GPU-port)]].
| |
| Note that running on multiple OpenMP-threads and/or GPUs switches off the {{TAG|NCORE}} parallelization.
| |
| | |
| ==Optimizing the parallelization==
| |
| {{NB|tip|We offer only general advice here. The performance for specific systems may be significantly different. However, in many cases, one is interested in similar calculations. Then run a few of these cases varying the parallel setup and use the optimal choice of parameters for the rest.}}
| |
| | |
| When choosing the optimal performance try to get as close as possible to the actual system.
| |
| This includes both the physical system (atoms, cell size, cutoff, ...) as well as the computational hardware (CPUs, interconnect, number of nodes, ...).
| |
| If too many parameters are different, the parallel configuration may not be transferable to the production calculation.
| |
| Nevertheless, a few steps of repetitive tasks give a good idea of an optimal choice for the full calculation.
| |
| For example, running only a few electronic or ionic self-consistency steps instead of finishing the convergence.
| |
| | |
| Often, combining multiple parallelization options yields the fastest results because the parallel efficiency of each level drops near its limit.
| |
| For the default option (band parallelization), the limit is {{TAG|NBANDS}} divided by a small integer.
| |
| Note that VASP will increase {{TAG|NBANDS}} to match the number of cores.
| |
| Choose {{TAG|NCORE}} as a factor of the cores per node to avoid communicating between nodes for the FFTs.
| |
| Recall that OpenMP and OpenACC enforce that {{TAG|NCORE}} is not set.
| |
| The '''k'''-point parallelization is efficient but requires additional memory.
| |
| Given sufficient memory, increase {{TAG|KPAR}} up to the number of irreducible '''k''' points.
| |
| Keep in mind that {{TAG|KPAR}} should factorize the number of '''k''' points.
| |
| Finally, {{TAG|IMAGES}} is required to split several VASP runs into separate calculations.
| |
| The limit is dictated by the number of desired calculations.
| |
| | |
| ==Caveat about the MPI setup==
| |
| | |
| The MPI setup determines the placement of the threads onto the nodes.
| |
| VASP assumes the threads first fill up a node before the next node is occupied.
| |
| As an example when running with 8 cores on two nodes, VASP expects thread 1–4 on node 1 and thread 5–8 on node 2.
| |
| If the threads are placed differently, communication between the nodes occurs for every parallel FFT.
| |
| Because FFTs are essential to VASP's speed this inhibits the performance of the calculation.
| |
| A manifestation is an increase in computing time when the number of nodes is increased from 1 to 2.
| |
| If {{TAG|NCORE}} is not used this issue is less severe but will still reduce the performance.
| |
| | |
| To address this issue, please check the setup of the MPI library and the submitted job script.
| |
| It is usually possible to overwrite the placement by setting environment variables or command-line arguments.
| |
| When in doubt, contact the HPC administration of your machine to investigate the behavior.
| |
| | |
| ==Additional parallelization options==
| |
| | |
| ; {{TAG|KPAR}}: For Laplace transformed MP2 this tag [[LTMP2_-_Tutorial#Parallelization|has a different meaning]].
| |
| ; {{TAG|NCORE_IN_IMAGE1}}: Defines how many cores work on the first image in the thermodynamic coupling constant integration ({{TAG|VCAIMAGES}}).
| |
| ; {{TAG|NOMEGAPAR}}: Parallelize over imaginary frequency points in GW and RPA calculations.
| |
| ; {{TAG|NTAUPAR}}: Parallelize over imaginary time points in GW and RPA calculations.
| |
| | |
| ==OpenMP/OpenACC==
| |
| | |
| Both [[Hybrid_MPI/OpenMP_parallelization|OpenMP]] and [[OpenACC_GPU_port_of_VASP|OpenACC]] parallelize the FFTs and therefore disregard any conflicting specification of {{TAG|NCORE}}.
| |
| When combining these methods OpenACC takes precedence but any code not ported to OpenACC benefits from the additional OpenMP treads.
| |
| This approach is relevant because the recommended NVIDIA Collective Communications Library requires a single MPI thread per GPU.
| |