[CPMD-list] help on parellel computing
axel.kohlmeyer at theochem.ruhr-uni-bochum.de
axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Thu May 16 12:28:32 CEST 2002
>>> "WZ" == weiz <weizhuang> writes:
WZ> Hi, friends:
WZ> I was running CPMD program on a linux cluster, I set the restart file to be
WZ> saved every 5 steps. however, every time when the machine is going to save
WZ> the restart file. the job is crashed. and following is the information. could
WZ> anybody give me some clue about what is wrong here and any suggestion of how
WZ> to solve it. thanks a lot.
WZ> wei zhuang
WZ> -----------------
WZ> MPI_Recv: process in local group is dead (rank 4, MPI_COMM_WORLD)
WZ> MPI_Recv: process in local group is dead (rank 1, MPI_COMM_WORLD)
WZ> MPI_Recv: process in local group is dead (rank 8, MPI_COMM_WORLD)
WZ> MPI_Recv: process in local group is dead (rank 2, MPI_COMM_WORLD)
hiho!
it looks as if you have overloaded your network and the ethernet
driver skipped a few packages (or the pci bus lost some interrupts).
some combinations of pc hardware get severly overloaded if you run large
parallel cpmd jobs, especially with TCP/IP networking.
can you describe the kind of network (network cards, switch) and
driver software (linux kernel, network driver) you are using.
also it would be helpful to know, whether the same job runs single
cpu and how 'large' the job, i.e. how big the restart file is.
cheers,
axel.
--
=======================================================================
Axel Kohlmeyer e-mail: axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Lehrstuhl fuer Theoretische Chemie Phone: ++49 (0)234/32-26673
Ruhr-Universitaet Bochum - NC 03/53 Fax: ++49 (0)234/32-14045
D-44780 Bochum http://www.theochem.ruhr-uni-bochum.de
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.
More information about the CPMD-list
mailing list