[CPMD-list] Trying to get MPI to work
Axel Kohlmeyer
axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Wed Aug 6 12:44:23 CEST 2003
>>> "MK" == mkosmows <mkosmows at mailbox.syr.edu> writes:
MK> Dear CPMD community:
hello mark,
MK> I have gotten the MPI version of CPMD 3.7.2 (PGI f77, GCC cc) to run.
MK> However, I get an error, described below. I am running MPICH 1.2.5 and
MK> CPMD 3.7.2. I have not tried other versions of CPMD. I am using a "cluster"
MK> of two Athlon 1.3GHz workstations with 1Gb RAM each. Both computers are
MK> running Mandrake 9.1 linux.
MK> In the terminal where the program was running:
MK> [mark at linux SP]$ mpirun -np 2 ~/work/bin/cpmd.x BH3NH3.in >BH3NH3.out
MK> p4_error: latest msg from perror: No route to host
MK> Killed by signal 2.
MK> /home/mark/work/mpich-1.2.5/bin/mpirun: line 1: 8167 Broken pipe
MK> /home/mark/work/bin/cpmd.x "BH3NH3.in" -p4pg
MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP/PI8082 -p4wd
MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP
MK> [mark at linux SP]$ mpirun -np 2 ~/work/bin/cpmd.x BH3NH3.in >BH3NH3.out
MK> p4_error: latest msg from perror: No route to host
that means, that your second machine has crashed hard.
this is most likely due to an overloading of the memory
management (tcp/ip is very tough on the mm-subsystem)
or the ethernet driver (some of them are pretty fragile).
also some gcc versions occasionally miscompile some parts of
the kernel or the drivers.
what kernel version are you using (cat /proc/version)?
and what ethernet card(s) do you have (/sbin/lspci -v | grep -A6 Ether)?
MK> Killed by signal 2.
MK> /home/mark/work/mpich-1.2.5/bin/mpirun: line 1: 10028 Broken pipe
MK> /home/mark/work/bin/cpmd.x "BH3NH3.in" -p4pg
MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP/PI9943 -p4wd
MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP
MK> [mark at linux SP]$
MK> And in the output file:
MK> NFI GEMAX CNORM ETOT DETOT TCPU
MK> 1 2.112E-01 1.667E-02 -29.677204 0.000E+00 88.83
MK> 2 1.074E-01 6.307E-03 -31.056725 -1.380E+00 90.80
MK> 3 4.252E-02 2.703E-03 -31.253852 -1.971E-01 91.64
MK> 4 3.083E-02 1.413E-03 -31.281939 -2.809E-02 93.56
MK> p0_10028: (5378.632967) net_send: could not write to fd=5, errno = 113
MK> p0_10028: p4_error: net_send write: -1
MK> The first time this happened, only two lines after the NFI line were printed.
MK> Also,
MK> the second workstation (not the one that mpirun ... cpmd.x was invoked on)
MK> stops giving a video signal and is unresponsive to ssh or webmin from the
MK> first
MK> workstation. Is this something that mpi is doing, or should I be looking for
MK> a
MK> hardware problem?
well, it may also be a hardware problem (wrong bios settings, weak
memory, etc.). you may want to run memtest <http://www.memtest86.com/>
on all machines to rule out weak memory.
MK> Thank you,
MK> Mark Kosmowski
MK> Chemistry Department
MK> Syracuse University
MK> mkosmows at syr.edu
>>> "MK" == mkosmows <mkosmows at mailbox.syr.edu> writes:
MK> Dear CPMD communtiy:
MK> I just got that same broken pipe error when using my notebook as the second
MK> processor, so I am inclined to rule out hardware problem. However, the
MK> notebook did not hang and is still perfectly functional.
hmm, another question: do you use direct cabeling or a switch or a hub?
if it is one of the latter two, you may want to look up the switch
statistics for the ports you are using (if you have access to that info,
that is). perhaps there is a cabeling problem. please also look at the output of
'netstat -i'. it should produce something like this (gigabit ethernet
machine, uptime 9 days):
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 2147483647 0 0 0 2147483647 1 0 0 BRU
lo 16436 0 1837 0 0 0 1837 0 0 0 LRU
or this (100mbit ethernet machine, uptime 9 days):
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 935317809 0 0 58 915489568 0 0 0 BRU
lo 16436 0 26270 0 0 0 26270 0 0 0 LRU
both machines have been heavily used for tcp/ip based parallel jobs.
hope this helps,
axel.
MK> I will try using both desktops and the notebook together to see what will
MK> happen. Hopefully someone knows what to do.
MK> Thank you,
MK> Mark Kosmowski
MK> Chemistry Department
MK> Syracuse University
MK> mkosmows at syr.edu
MK> _______________________________________________
MK> CPMD-list mailing list
MK> CPMD-list at cpmd.org
MK> http://www.cpmd.org/mailman/listinfo/cpmd-list
--
=======================================================================
Axel Kohlmeyer e-mail: axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Lehrstuhl fuer Theoretische Chemie Phone: ++49 (0)234/32-26673
Ruhr-Universitaet Bochum - NC 03/53 Fax: ++49 (0)234/32-14045
D-44780 Bochum http://www.theochem.ruhr-uni-bochum.de
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.
More information about the CPMD-list
mailing list