[CPMD-list] Trying to get MPI to work
mkosmows
mkosmows at mailbox.syr.edu
Thu Aug 7 18:01:43 CEST 2003
Dear Dr. Kohlmeyer:
I am using a 10/100 cable router/switch for my network backbone. I am
planning to invest in a gigabit network in a month or two, as well as
purchasing some additional hardware, but wanted to get MPI running on my
current equipment before spending money. It looks like there may only be a
10Mb ethernet card in the machine that crashed - but my notebook has a 10/100
on-board NIC, and while the notebook didn't crash, mpirun cpmd did.
I have attached the results of all of the tests you reccomended below in the
file testresults. I ran memtest86 2.9 on both desktop machines and no errors
after 2 passes. Any advice would be greatly appreciated. Even if it is only
to tell me that the problem will go away with server level equipment.
Have a good day, and thank you very much for your help and concern,
Mark Kosmowski
>===== Original Message From axel.kohlmeyer at theochem.ruhr-uni-bochum.de =====
>>> "MK" == mkosmows <mkosmows at mailbox.syr.edu> writes:
>
>MK> Dear CPMD community:
>
>hello mark,
>
>MK> I have gotten the MPI version of CPMD 3.7.2 (PGI f77, GCC cc) to run.
>MK> However, I get an error, described below. I am running MPICH 1.2.5 and
>MK> CPMD 3.7.2. I have not tried other versions of CPMD. I am using a
"cluster"
>MK> of two Athlon 1.3GHz workstations with 1Gb RAM each. Both computers are
>MK> running Mandrake 9.1 linux.
>
>MK> In the terminal where the program was running:
>
>MK> [mark at linux SP]$ mpirun -np 2 ~/work/bin/cpmd.x BH3NH3.in >BH3NH3.out
>MK> p4_error: latest msg from perror: No route to host
>MK> Killed by signal 2.
>MK> /home/mark/work/mpich-1.2.5/bin/mpirun: line 1: 8167 Broken pipe
>MK> /home/mark/work/bin/cpmd.x "BH3NH3.in" -p4pg
>MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP/PI8082 -p4wd
>MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP
>MK> [mark at linux SP]$ mpirun -np 2 ~/work/bin/cpmd.x BH3NH3.in >BH3NH3.out
>MK> p4_error: latest msg from perror: No route to host
>
>that means, that your second machine has crashed hard.
>this is most likely due to an overloading of the memory
>management (tcp/ip is very tough on the mm-subsystem)
>or the ethernet driver (some of them are pretty fragile).
>also some gcc versions occasionally miscompile some parts of
>the kernel or the drivers.
>
>what kernel version are you using (cat /proc/version)?
>and what ethernet card(s) do you have (/sbin/lspci -v | grep -A6 Ether)?
>
>MK> Killed by signal 2.
>MK> /home/mark/work/mpich-1.2.5/bin/mpirun: line 1: 10028 Broken pipe
>MK> /home/mark/work/bin/cpmd.x "BH3NH3.in" -p4pg
>MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP/PI9943 -p4wd
>MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP
>MK> [mark at linux SP]$
>
>MK> And in the output file:
>
>MK> NFI GEMAX CNORM ETOT DETOT TCPU
>MK> 1 2.112E-01 1.667E-02 -29.677204 0.000E+00 88.83
>MK> 2 1.074E-01 6.307E-03 -31.056725 -1.380E+00 90.80
>MK> 3 4.252E-02 2.703E-03 -31.253852 -1.971E-01 91.64
>MK> 4 3.083E-02 1.413E-03 -31.281939 -2.809E-02 93.56
>MK> p0_10028: (5378.632967) net_send: could not write to fd=5, errno = 113
>MK> p0_10028: p4_error: net_send write: -1
>
>MK> The first time this happened, only two lines after the NFI line were
printed.
>MK> Also,
>MK> the second workstation (not the one that mpirun ... cpmd.x was invoked
on)
>MK> stops giving a video signal and is unresponsive to ssh or webmin from the
>MK> first
>MK> workstation. Is this something that mpi is doing, or should I be looking
for
>MK> a
>MK> hardware problem?
>
>well, it may also be a hardware problem (wrong bios settings, weak
>memory, etc.). you may want to run memtest <http://www.memtest86.com/>
>on all machines to rule out weak memory.
>
>MK> Thank you,
>
>MK> Mark Kosmowski
>
>MK> Chemistry Department
>MK> Syracuse University
>MK> mkosmows at syr.edu
>
>
>>> "MK" == mkosmows <mkosmows at mailbox.syr.edu> writes:
>
>MK> Dear CPMD communtiy:
>
>MK> I just got that same broken pipe error when using my notebook as the
second
>MK> processor, so I am inclined to rule out hardware problem. However, the
>MK> notebook did not hang and is still perfectly functional.
>
>hmm, another question: do you use direct cabeling or a switch or a hub?
>if it is one of the latter two, you may want to look up the switch
>statistics for the ports you are using (if you have access to that info,
>that is). perhaps there is a cabeling problem. please also look at the output
of
>'netstat -i'. it should produce something like this (gigabit ethernet
>machine, uptime 9 days):
>
>Kernel Interface table
>Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR
Flg
>eth0 1500 0 2147483647 0 0 0 2147483647 1 0
0 BRU
>lo 16436 0 1837 0 0 0 1837 0 0 0
LRU
>
>or this (100mbit ethernet machine, uptime 9 days):
>Kernel Interface table
>Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR
Flg
>eth0 1500 0 935317809 0 0 58 915489568 0 0 0
BRU
>lo 16436 0 26270 0 0 0 26270 0 0 0
LRU
>
>both machines have been heavily used for tcp/ip based parallel jobs.
>
>hope this helps,
> axel.
>
>MK> I will try using both desktops and the notebook together to see what will
>MK> happen. Hopefully someone knows what to do.
>
>MK> Thank you,
>
>MK> Mark Kosmowski
>
>MK> Chemistry Department
>MK> Syracuse University
>MK> mkosmows at syr.edu
>
>MK> _______________________________________________
>MK> CPMD-list mailing list
>MK> CPMD-list at cpmd.org
>MK> http://www.cpmd.org/mailman/listinfo/cpmd-list
>
>
>
>--
>
>=======================================================================
>Axel Kohlmeyer e-mail: axel.kohlmeyer at theochem.ruhr-uni-bochum.de
>Lehrstuhl fuer Theoretische Chemie Phone: ++49 (0)234/32-26673
>Ruhr-Universitaet Bochum - NC 03/53 Fax: ++49 (0)234/32-14045
>D-44780 Bochum http://www.theochem.ruhr-uni-bochum.de
>=======================================================================
>If you make something idiot-proof, the universe creates a better idiot.
Chemistry Department
Syracuse University
mkosmows at syr.edu
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testresults
Type: application/octet-stream
Size: 2298 bytes
Desc: not available
Url : http://cpmd.org/pipermail/cpmd-list/attachments/20030807/08abb672/attachment.obj
More information about the CPMD-list
mailing list