[CPMD-list] help on parellel computing

axel.kohlmeyer at theochem.ruhr-uni-bochum.de axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Thu May 16 12:28:32 CEST 2002


>>> "WZ" == weiz  <weizhuang> writes:

WZ> Hi, friends:

WZ> I was running  CPMD program on a linux cluster, I set the restart file to be 
WZ> saved every 5 steps. however, every time when the machine is going to save 
WZ> the restart file. the job is crashed. and following is the information. could 
WZ> anybody give me some clue about what is wrong here and any suggestion of how 
WZ> to solve it. thanks a lot.

WZ> wei zhuang

WZ> -----------------
WZ> MPI_Recv: process in local group is dead (rank 4, MPI_COMM_WORLD)
WZ> MPI_Recv: process in local group is dead (rank 1, MPI_COMM_WORLD)
WZ> MPI_Recv: process in local group is dead (rank 8, MPI_COMM_WORLD)
WZ> MPI_Recv: process in local group is dead (rank 2, MPI_COMM_WORLD)

hiho!

it looks as if you have overloaded your network and the ethernet
driver skipped a few packages (or the pci bus lost some interrupts).
some combinations of pc hardware get severly overloaded if you run large
parallel cpmd jobs, especially with TCP/IP networking.

can you describe the kind of network (network cards, switch) and 
driver software (linux kernel, network driver) you are using. 

also it would be helpful to know, whether the same job runs single
cpu and how 'large' the job, i.e. how big the restart file is.

cheers,
        axel.



--

=======================================================================
Axel Kohlmeyer       e-mail: axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Lehrstuhl fuer Theoretische Chemie          Phone: ++49 (0)234/32-26673
Ruhr-Universitaet Bochum - NC 03/53         Fax:   ++49 (0)234/32-14045
D-44780 Bochum                   http://www.theochem.ruhr-uni-bochum.de
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.



More information about the CPMD-list mailing list