[CPMD-list] MPI jobs hang...for ever.

Axel Kohlmeyer axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Fri Oct 4 11:02:57 CEST 2002


>>> "HW" == HW Sheng <hwsheng at jhu.edu> writes:


HW> Dear all,=20

HW> On my Linux/Alpha cluster (10 dual alpha processors with LAM/MPI), MPI =
HW> programs often crash without warnings. Each time, a different node gets =
HW> crashed. The program (cpmd), however, runs without any problem on =
HW> singular machines (at least, it appears to be running okay). Could =
HW> someone point me to a tutorial so that I might get a feel of MPI/LAM and =
HW> shoot the problems?

HW> I am sorry being a pest.

HW> Howard

hi howard,

for tutorials on LAM/MPI just have a closer look at the lam homepage
under the url http://www.lam-mpi.org/.

but i doubt, that your problem is lam related (if the lam package has
been compiled properly). more likely are hardware/driver problems.

so: what kind of network connection are you using exactly? 
and what brand of ethernet cards (try running '/sbin/lspci | grep -i ether')?

the reason for this question is, that some ethernet cards (and their
linux drivers) are not well suited for the extreme load parallel cpmd
jobs will create. we have made some bad experiences with 3com 3c905
cards (in pc's though) and especially with the intel chipset based
ethernet cards that originally came with our linux/alphas (i replaced them
with then over 3 year old dec tulip chipset cards and they are very
reliable).

also, if you only have a 100MBit connection, you should better try
to run 10 jobs, each on only a single smp node, or you will waste
most of the available cpu power. 

if you cannot do this, you should seriously consider hooking up those
machines with a small SCI or myrinet network, and you will probably more
than double the 'usable' cpu power for large jobs. compared to the cost
of the machines itself the high-speed interconnect will come rather cheap.

all the best,
    axel kohlmeyer.

--

=======================================================================
Axel Kohlmeyer       e-mail: axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Lehrstuhl fuer Theoretische Chemie          Phone: ++49 (0)234/32-26673
Ruhr-Universitaet Bochum - NC 03/53         Fax:   ++49 (0)234/32-14045
D-44780 Bochum                   http://www.theochem.ruhr-uni-bochum.de
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.



More information about the CPMD-list mailing list