test for 4096 cpus. was RE: [CPMD-list] Empty ENERGIES file on x86-64 systems
Axel Kohlmeyer
akohlmey at vitae.cmm.upenn.edu
Mon May 8 17:40:17 CEST 2006
On 5/8/06, Kozin, I (Igor) <i.kozin at dl.ac.uk> wrote:
igor,
[...]
> Sometime ago CPMD developers promised to make available
> the test cases which could scale to a large number of cpus.
> Given that yours has 4096, would not it be nice
> to run something suitable on it?
please be patient. to provide consistent examples needs
people to have a continous block of free time and for one
reason or another this has not yet happened very often in
the last two years. since none of us is financed to work on
CPMD, other requirements take precedence.
> Is there a _relaible_ standard test case which could
> fit the bill?
this totally depends on whether you want to test for correctness
or speed. wavefunction optimization with the default DIIS
optimizer is very sensitive to the initial guess and in some notorious
cases can go totally wrong. switching to a more reliable optimizer as
PCG or PCG MINIMIZE usually improves the situaltion. however,
if the wavefunction is converged, it should in all cases yield the same
total energy (wrt. to the convergence parameter).
if you want to test for speed, you can just set
ODIIS NO_RESET=200
5
MAXSTEP 20
provided the run needs at least 20 steps even in the best case.
this way you get comparable times. The best way to evaluate
performance is to have both, since also the average number of
steps needed until convergence can be an important information,
e.g. to single out 'performance' libraries that have a lower accuracy
or compilers that 'optimize' too aggressively.
also you have to pay attention to the fact, that the performance
and scaling is very different for different job types, combinations
of pseudopotentials and job sizes on different machines.
to ease cooperation on building a consistent library of benchmark
inputs and collect and discuss benchmark results, i've just created
a (private) project on biocore.
http://www.ks.uiuc.edu/Research/biocore/
anybody willing to contribute, with time and access to relevant
resources please set up a user-id and send it to me per e-mail,
so i can add you to the project. The general idea is to run benchmarks
in two ways: use the inputs 'as is' and as they would run on any
machine and then do a best effort run that would use additional
tricks and modified sources etc.
> [e.g. it might be tempting to put a quick but not
> robust wavefunction optimization method in the test.
> But then you end up having different number of iterations
> on different machines and how do you compare those?]
>
> Best,
> Igor
>
> PS Axel, you are more than welcome to post any hints
> which improve performance! XT3 or otherwise.
the major trick to get good CPMD performance on machines
with many nodes is to avoid serial overhead and reduce the
effect of communication latencies.
this is nicely summarized in the hutter/curioni paper in
the parrinello festschrift (ChemPhysChem 2005, 6, 1788-1793).
in practice, you want to change the code to not write files
you don't need (e.g. in QM/MM) or write them less frequently
(e.g. GEOMETRY/GEOMETRY.xyz, by adding a test similar
as for TRAJEC.xyz/dcd). for large systems using:
TRAJECTORY SAMPLE DCD
-20
could reduce the I/O latency enormously. note that in case
of lustre (xt3) or gpfs (bg/l) you have a networked file system
across thousands of nodes.
once you have a large enough number of nodes and thus enough
aggregate memory you can use
REALSPACE WFN KEEP
to reduce the number of FFTs which are extremely sensitive to
the latencies of the interconnect (so it should be useful for people
with gigabit ethernet clusters, too).
with a huge number of nodes, you can use TASKGROUPS as
well to distribute the FFTs for the calculation of the KS-states.
(see the 32 water results in the CPC article).
finally, a special trick on the XT3 (and perhaps other machines
as well) would be to increase the buffer size for buffered i/o.
one the XT3 there is sarah anderson's iobuf.o module, i am
currently trying to figure out a way to make this more generic,
at least across linux (derived) machines that should work... but
it is a bit tricky and still would require a special hack in the
fileopen code.
> PPS is there a noticeable variation among XT3s?
available compiler and library versions were quite different
a while back, but it seems, they are converging now.
best regards,
axel.
> I. Kozin (i.kozin at dl.ac.uk)
> CCLRC Daresbury Laboratory
> tel: 01925 603308
> http://www.cse.clrc.ac.uk/disco
>
> > AK> on what xt3 are you running on? i might have some extra hints
> > AK> for you to get even more out of it...
> >
> > It would be great! I am working with this machine:
> > http://top500.org/system/details/7654
> > Unfortunately, I did not yet make a lot of benchmarks there. However,
> > rough tests suggest that performance and scaling are VERY GOOD!
>
>
--
=======================================================================
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.
More information about the CPMD-list
mailing list