next up previous contents index
Next: Shared Memory Parallelization Up: CPMD on Parallel Computers Previous: CPMD on Parallel Computers   Contents   Index

Distributed Memory Parallelization using MPI

This is the standard parallelization scheme used in CPMD based on an MPI message passing library and the generally recommended way to compile and run CPMD. The single processor version of this code typically incurs an overhead of about 10% with respect to the serially compiled code. This overhead is due to additional copy and sort operations during the FFTs.

All the basic system data and many matrices of the size of the number of electrons are replicated on all processors. This leads to considerable additional memory usage (calculated as the sum over the memory of all processors compared to the memory needed on a single processor). For large systems distributed over many processors the replicated data can dominate the memory usage.

The efficiency of the parallelization depends on the calculated system (e.g. cutoff and number of electrons) and the hardware platform, mostly latency and bandwidth of the communication system. The most important bottleneck in the distributed memory parallelization of CPMD is the load-balancing problem in the FFT. The real space grids are distributed over the first dimension alone (see line REAL SPACE MESH: in the output. As the mesh sizes only vary between 20 (very small systems, low cutoffs) and 300 (large systems, high cutoff) we have a rather coarse grain parallelization. To avoid load imbalance the number of processors should be a divisor of the mesh size. It is therefore evident that even for large systems additional speedup beyond a few hundred processors by parallelizing not only across data but also across tasks.

A partial solution to this problem is provided with the keyword TASKGROUPS. This technique, together with optimal mapping, allow to scale to thousands of processors on modern supercomputers such as IBM BG/L.

When selecting NSTBLK for BLOCKSIZE STATES it is important to take into account the granularity of the problem at hand. For example, in cases where the number of STATES is smaller than the total number of the available processors, one must choose a value for NSTBLK such that only a subgroup of the processors participate in the distributed linear algebra calculations. The same argument is also relevant when the number of STATES is only moderately larger than the number of processors.

To learn more about the distributed memory parallelization of CPMD consult D. Marx and J. Hutter, ``Modern Methods and Algorithms of Quantum Chemistry'', Forschungszentrum Jülich, NIC Series, Vol. 1 (2000), 301-449. For recent developments and for a perspective see [5].


next up previous contents index
Next: Shared Memory Parallelization Up: CPMD on Parallel Computers Previous: CPMD on Parallel Computers   Contents   Index
Costas Bekas 2008-09-04