All the basic system data and many matrices of the size of the number of electrons are replicated on all processors. This leads to considerable additional memory usage (calculated as the sum over the memory of all processors compared to the memory needed on a single processor). For large systems distributed over many processors the replicated data can dominate the memory usage.
The efficiency of the parallelization depends on the calculated system (e.g. cutoff and number of electrons) and the hardware platform, mostly latency and bandwidth of the communication system. The most important bottleneck in the distributed memory parallelization of CPMD is the load-balancing problem in the FFT. The real space grids are distributed over the first dimension alone (see line REAL SPACE MESH: in the output. As the mesh sizes only vary between 20 (very small systems, low cutoffs) and 300 (large systems, high cutoff) we have a rather coarse grain parallelization. To avoid load imbalance the number of processors should be a divisor of the mesh size. It is therefore evident that even for large systems additional speedup beyond a few hundred processors by parallelizing not only across data but also across tasks.
A partial solution to this problem is provided with the keyword TASKGROUPS. This technique, together with optimal mapping, allow to scale to thousands of processors on modern supercomputers such as IBM BG/L.
When selecting NSTBLK for BLOCKSIZE STATES it is important to take into account the granularity of the problem at hand. For example, in cases where the number of STATES is smaller than the total number of the available processors, one must choose a value for NSTBLK such that only a subgroup of the processors participate in the distributed linear algebra calculations. The same argument is also relevant when the number of STATES is only moderately larger than the number of processors.
To learn more about the distributed memory parallelization of CPMD consult D. Marx and J. Hutter, ``Modern Methods and Algorithms of Quantum Chemistry'', Forschungszentrum Jülich, NIC Series, Vol. 1 (2000), 301-449. For recent developments and for a perspective see [5].