[CPMD-list] error in BO-PIMD
Axel Kohlmeyer
akohlmey at cmm.chem.upenn.edu
Tue Jul 3 03:49:25 CEST 2007
dear qianfan zhang,
i finally found some time to trace down the deadlock with pimd-bo
that you were reporting.
the deadlock occurs in the testex() subroutine, which is responsible
for handling the so-called soft-exit, i.e. orderly termination after
creating a file called EXIT in the current working directory.
a simple workaround is to disable the softexit feature for
processor groups by changing in the file testex.F the lines 31-33
from:
IF (TPATH) THEN
T_IO_NODE = GRANDPARENT
MPIGROUP = SUPERGROUP
to:
IF (TPATH) THEN
IF(PC_GROUPS.GT.1) RETURN
T_IO_NODE = GRANDPARENT
MPIGROUP = SUPERGROUP
i'm currently working on a more elaborate fix that
will recover the softexit feature and eliminate a
race-condition i found on top of the deadlock.
best regards,
axel.
p.s.: out of curiosity traced down the relevant change to a cvs commit
on nov 16, 2004. it seems like nobody has seriously tried PIMD with
BO with a version newer than 3.9.1 or reported the bug, or handed
in a bugfix... ;-)
On 6/10/07, Axel Kohlmeyer <akohlmey at cmm.chem.upenn.edu> wrote:
> On Sun, 10 Jun 2007, qfzhang wrote:
>
> QZ> Hi,
> QZ> Thanks for your advice! Can I sovle the problem by rewriting some part of the
> QZ> program? I have add some "CALL MY_SYNC(SUPERGROUP)" sentence in pi-diag.F, but
>
> if it were that simple to fix, i'd have done it already. :)
> the problem seems to arise from different replica needing to
> do a different number of wfopt steps and some implicit synchronization
> because of that. the code in pi_diag.F is just the starting point...
>
> QZ> it seems not to work. And you also mentioned the compiler. Can I solve the probl
> QZ> em by some change during compiling?
>
> no. please read my reply more carefully. the remark about the compilers
> (actually, it is about the runtime library default behavior of the
> compiler implementation) was only to explain, why you did see files
> with 0 byte length.
>
> to fix this problem, you need to trace the MPI calls and
> find the exact place, where the code deadlocks. how to do
> this cannot be explained in a few lines and is also very
> MPI library specific, so if you want to fix it, you have to
> dig out the corresponding information from various places
> (MPI library documentation, the web, tutorial literature etc.).
>
> cheers,
> axel.
>
> QZ>
> QZ> Best wishes
> QZ> Qianfan Zhang
> QZ>
> QZ> Axel Kohlmeyer д:
> QZ>
> QZ> > On Sat, 9 Jun 2007, qfzhang wrote:
> QZ> >
> QZ> > QZ> Hi,
> QZ> > QZ> So sorry for that. It is very strange that when running BO-PIMD job,noth
> QZ> ing is
> QZ> > QZ> written to the output file since "force initialization", and nothing to t
> QZ> he fil
> QZ> > QZ> e TRAJECTORY and ENERGY. But the job will not stop until the walltime limi
> QZ> t, and
> QZ> >
> QZ> > hi,
> QZ> > this is not strange, you just discovered a \'deadlock\' bug due to
> QZ> > a so-called race condition. this can happen with PI-MD, when the
> QZ> > individual replica take significantly different time to do some
> QZ> > work yet the code is written in a way that expects about the same
> QZ> > time spent.
> QZ> >
> QZ> > you don\'t see any output to the files, since you compiler defaults
> QZ> > to buffered output (inded something is written, but the first MD
> QZ> > step stalls, at least when trying to reproduce it on my machine).
> QZ> >
> QZ> > QZ> no error message.But when specify PORCESSOR GROUP=1,no problem.when I use
> QZ> CP-PI
> QZ> >
> QZ> > with no processor groups there is no parallelization over replica,
> QZ> > and it seems that exactly that is causing the problems. with CP-MD
> QZ> > all operations take about the same time per replica, but with BO-MD
> QZ> > this is not always the case (different number of WF-opt steps for
> QZ> > different replica).
> QZ> >
> QZ> > QZ> MD for calculations, no such problem. So I really don\'t know what\'s wron
> QZ> g with i
> QZ> > QZ> t.the output file is as below.
> QZ> >
> QZ> > the cause can probably be found or narrowed down by tracing
> QZ> > the parallelization in pi_diag.F.
> QZ> >
> QZ> > please note, that even though your job appears to be working, all
> QZ> > it does, is checking for the other parts to communicate which are
> QZ> > waiting for the first nodes in return (=> deadlock).
> QZ> >
> QZ> > cheers,
> QZ> > axel.
> QZ> >
> QZ> > [...]
> QZ> >
> QZ> > QZ> > --
> QZ> > QZ> > =======================================================================
> QZ> > QZ> > Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
> QZ> > QZ> > Center for Molecular Modeling -- University of Pennsylvania
> QZ> > QZ> > Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
> QZ> > QZ> > tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
> QZ> > QZ> > =======================================================================
> QZ> > QZ> > If you make something idiot-proof, the universe creates a better idiot.
> QZ> > QZ>
> QZ> > QZ>
> QZ> > QZ>
> QZ> >
> QZ> > --
> QZ> > =======================================================================
> QZ> > Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
> QZ> > Center for Molecular Modeling -- University of Pennsylvania
> QZ> > Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
> QZ> > tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
> QZ> > =======================================================================
> QZ> > If you make something idiot-proof, the universe creates a better idiot.
> QZ> >
> QZ>
> QZ>
> QZ>
>
> --
> =======================================================================
> Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
> Center for Molecular Modeling -- University of Pennsylvania
> Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
> tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
> =======================================================================
> If you make something idiot-proof, the universe creates a better idiot.
>
>
--
=======================================================================
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.
More information about the CPMD-list
mailing list