[CPMD-list] Trying to get MPI to work

mkosmows mkosmows at mailbox.syr.edu
Thu Aug 7 18:01:43 CEST 2003


Dear Dr. Kohlmeyer:

I am using a 10/100 cable router/switch for my network backbone.  I am 
planning to invest in a gigabit network in a month or two, as well as 
purchasing some additional hardware, but wanted to get MPI running on my 
current equipment before spending money.  It looks like there may only be a 
10Mb ethernet card in the machine that crashed - but my notebook has a 10/100 
on-board NIC, and while the notebook didn't crash, mpirun cpmd did.

I have attached the results of all of the tests you reccomended below in the 
file testresults.  I ran memtest86 2.9 on both desktop machines and no errors 
after 2 passes.  Any advice would be greatly appreciated.  Even if it is only 
to tell me that the problem will go away with server level equipment.

Have a good day, and thank you very much for your help and concern,

Mark Kosmowski

>===== Original Message From axel.kohlmeyer at theochem.ruhr-uni-bochum.de =====
>>> "MK" == mkosmows  <mkosmows at mailbox.syr.edu> writes:
>
>MK> Dear CPMD community:
>
>hello mark,
>
>MK> I have gotten the MPI version of CPMD 3.7.2 (PGI f77, GCC cc) to run.
>MK> However, I get an error, described below.  I am running MPICH 1.2.5 and
>MK> CPMD 3.7.2.  I have not tried other versions of CPMD.  I am using a 
"cluster"
>MK> of two Athlon 1.3GHz workstations with 1Gb RAM each.  Both computers are
>MK> running Mandrake 9.1 linux.
>
>MK> In the terminal where the program was running:
>
>MK> [mark at linux SP]$ mpirun -np 2 ~/work/bin/cpmd.x BH3NH3.in >BH3NH3.out
>MK>     p4_error: latest msg from perror: No route to host
>MK> Killed by signal 2.
>MK> /home/mark/work/mpich-1.2.5/bin/mpirun: line 1:  8167 Broken pipe
>MK> /home/mark/work/bin/cpmd.x "BH3NH3.in" -p4pg
>MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP/PI8082 -p4wd
>MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP
>MK> [mark at linux SP]$ mpirun -np 2 ~/work/bin/cpmd.x BH3NH3.in >BH3NH3.out
>MK>     p4_error: latest msg from perror: No route to host
>
>that means, that your second machine has crashed hard.
>this is most likely due to an overloading of the memory
>management (tcp/ip is very tough on the mm-subsystem)
>or the ethernet driver (some of them are pretty fragile).
>also some gcc versions occasionally miscompile some parts of
>the kernel or the drivers.
>
>what kernel version are you using (cat /proc/version)?
>and what ethernet card(s) do you have (/sbin/lspci -v | grep -A6 Ether)?
>
>MK> Killed by signal 2.
>MK> /home/mark/work/mpich-1.2.5/bin/mpirun: line 1: 10028 Broken pipe
>MK> /home/mark/work/bin/cpmd.x "BH3NH3.in" -p4pg
>MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP/PI9943 -p4wd
>MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP
>MK> [mark at linux SP]$
>
>MK> And in the output file:
>
>MK>  NFI      GEMAX       CNORM           ETOT        DETOT      TCPU
>MK>    1  2.112E-01   1.667E-02     -29.677204    0.000E+00     88.83
>MK>    2  1.074E-01   6.307E-03     -31.056725   -1.380E+00     90.80
>MK>    3  4.252E-02   2.703E-03     -31.253852   -1.971E-01     91.64
>MK>    4  3.083E-02   1.413E-03     -31.281939   -2.809E-02     93.56
>MK> p0_10028: (5378.632967) net_send: could not write to fd=5, errno = 113
>MK> p0_10028:  p4_error: net_send write: -1
>
>MK> The first time this happened, only two lines after the NFI line were 
printed.
>MK> Also,
>MK> the second workstation (not the one that mpirun ... cpmd.x was invoked 
on)
>MK> stops giving a video signal and is unresponsive to ssh or webmin from the
>MK> first
>MK> workstation.  Is this something that mpi is doing, or should I be looking 
for
>MK> a
>MK> hardware problem?
>
>well, it may also be a hardware problem (wrong bios settings, weak
>memory, etc.). you may want to run memtest <http://www.memtest86.com/>
>on all machines to rule out weak memory.
>
>MK> Thank you,
>
>MK> Mark Kosmowski
>
>MK> Chemistry Department
>MK> Syracuse University
>MK> mkosmows at syr.edu
>
>
>>> "MK" == mkosmows  <mkosmows at mailbox.syr.edu> writes:
>
>MK> Dear CPMD communtiy:
>
>MK> I just got that same broken pipe error when using my notebook as the 
second
>MK> processor, so I am inclined to rule out hardware problem.  However, the
>MK> notebook did not hang and is still perfectly functional.
>
>hmm, another question: do you use direct cabeling or a switch or a hub?
>if it is one of the latter two, you may want to look up the switch
>statistics for the ports you are using (if you have access to that info,
>that is). perhaps there is a cabeling problem. please also look at the output 
of
>'netstat -i'. it should produce something like this (gigabit ethernet
>machine, uptime 9 days):
>
>Kernel Interface table
>Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR 
Flg
>eth0   1500   0 2147483647      0      0      0 2147483647      1      0      
0 BRU
>lo    16436   0     1837      0      0      0     1837      0      0      0 
LRU
>
>or this (100mbit ethernet machine, uptime 9 days):
>Kernel Interface table
>Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR 
Flg
>eth0   1500   0 935317809      0      0     58 915489568      0      0      0 
BRU
>lo    16436   0    26270      0      0      0    26270      0      0      0 
LRU
>
>both machines have been heavily used for tcp/ip based parallel jobs.
>
>hope this helps,
>     axel.
>
>MK> I will try using both desktops and the notebook together to see what will
>MK> happen.  Hopefully someone knows what to do.
>
>MK> Thank you,
>
>MK> Mark Kosmowski
>
>MK> Chemistry Department
>MK> Syracuse University
>MK> mkosmows at syr.edu
>
>MK> _______________________________________________
>MK> CPMD-list mailing list
>MK> CPMD-list at cpmd.org
>MK> http://www.cpmd.org/mailman/listinfo/cpmd-list
>
>
>
>--
>
>=======================================================================
>Axel Kohlmeyer       e-mail: axel.kohlmeyer at theochem.ruhr-uni-bochum.de
>Lehrstuhl fuer Theoretische Chemie          Phone: ++49 (0)234/32-26673
>Ruhr-Universitaet Bochum - NC 03/53         Fax:   ++49 (0)234/32-14045
>D-44780 Bochum                   http://www.theochem.ruhr-uni-bochum.de
>=======================================================================
>If you make something idiot-proof, the universe creates a better idiot.

Chemistry Department
Syracuse University
mkosmows at syr.edu
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testresults
Type: application/octet-stream
Size: 2298 bytes
Desc: not available
Url : http://cpmd.org/pipermail/cpmd-list/attachments/20030807/08abb672/attachment.obj 


More information about the CPMD-list mailing list