test for 4096 cpus. was RE: [CPMD-list] Empty ENERGIES file on x86-64 systems

Bernd Kallies kallies at zib.de
Mon May 8 20:00:40 CEST 2006


Dear all,

I've set up a medium-to-large CPMD benchmark, which we want to use to
rate machines regarding suitability for CPMD. The terms "medium" to
"large" depend on what kind of machine people have at hand and what
system sizes they are commonly using. For us, the benchmark is inbetween
"medium" and "large", since we only have 512 IBM 1.3 GHz Power4 CPUs,
which are 4 years old now.

The benchmark idea is to time an MD run, which is set up as a restart
from a previous MD run regarding wavefunction and velocities. One either
may use a provided restart file, or recalculate the wavefunction in a
preceeding step from given ionic positions. Thus, different convergence
behaviour on different platforms is avoided for benchmarking. I/O can be
switched off completely, except reading and writing the restart file.
However, the number of timesteps one benchmarks has to ensure that
reading/writing the restart file does not dominate the benchmark. It is
not a good idea to mix calculation and I/O for benchmarking to my
opinion, since both things have to be tuned in a completely different
way.

My benchmark system consists of an Fe3+ ion, which is complexed with an
organic tetradentate chelate ligand named tris-carboxymethylamine (NTA).
The whole thing is solvated in 229 waters, yielding a cubic box with
about 36 a.u. box length. The system is a real-world problem and not
artificial. Since Fe3+ is an open-shell system (sextett), one has to use
LSD. The Fe-PP I use for the benchmark uses NLCC and was developed for
Fe2+/Fe3+ with PBE. It performs good with 70 Ry cutoff and above when
looking on the corresponding aqua ions. For 70 Ry the system needs
nearly 2 Mio plane waves. The calculation needs about 47 GByte memory
total, the restart file is about 7 GByte (!) on usual platforms (bigger
on a NEC). Systems like this one are among the ones people want to look
at at our site.

The benchmark needs about 35 minutes for 15 time steps when running on
our platform with CPMD v3.9.2 using 32 MPI tasks and 4 OpenMP threads
for each task. Using twice the number of tasks one needs 22 minutes,
increasing up to 128 tasks and 4 threads per task (our whole machine)
needs about 15 minutes. So scaling is not very well here, which should
be due to our network (IBM HPS). I also run it on a SGI Altix 3700 with
Itanium 2 1.3 GHz on 64 CPUs (which is all what I got so far). There it
needed 50 minutes for 15 time steps.
Using 8 tasks one gets nearly 2 GFlop/s per CPU on both platforms with
this benchmark, which is quite good regarding peak performance.
Increasing the number of tasks up to the numbers given above yields drop
of performance (of course) down to 1 GFlop/s per CPU, which is still
quite good for tightly coupled and structured codes.

All in all I believe that benchmarks like this one are suitable to
measure CPMD-performance on a variety of "big" platforms. Smaller
problems make no sense to my opinion.
The benchmark run time can be tuned easily and does not suffer from
different convergence behaviour, since wavefunction convergence is
achieved outside of the benchmark. One also might play with plane wave
cutoffs. To stay physical and increase system size, one might add
addtional water. LSD seemed a good idea to me since it doubles the
computational work to do. However, all this works only if one gets a
converged wavefunction prior the benchmark, which is not easy with
systems like this one.

Any comments are welcome.

-- 
Dr. Bernd Kallies
Konrad-Zuse-Zentrum für Informationstechnik Berlin
Takustr. 7
14195 Berlin
Tel: +49-30-84185-270
Fax: +49-30-84185-311
e-mail: kallies at zib.de




More information about the CPMD-list mailing list