[CPMD-list] Trying to get MPI to work

Axel Kohlmeyer axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Wed Aug 6 12:44:23 CEST 2003


>>> "MK" == mkosmows  <mkosmows at mailbox.syr.edu> writes:

MK> Dear CPMD community:

hello mark,
 
MK> I have gotten the MPI version of CPMD 3.7.2 (PGI f77, GCC cc) to run.  
MK> However, I get an error, described below.  I am running MPICH 1.2.5 and 
MK> CPMD 3.7.2.  I have not tried other versions of CPMD.  I am using a "cluster" 
MK> of two Athlon 1.3GHz workstations with 1Gb RAM each.  Both computers are 
MK> running Mandrake 9.1 linux.
 
MK> In the terminal where the program was running:
 
MK> [mark at linux SP]$ mpirun -np 2 ~/work/bin/cpmd.x BH3NH3.in >BH3NH3.out
MK>     p4_error: latest msg from perror: No route to host 
MK> Killed by signal 2. 
MK> /home/mark/work/mpich-1.2.5/bin/mpirun: line 1:  8167 Broken pipe             
MK> /home/mark/work/bin/cpmd.x "BH3NH3.in" -p4pg 
MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP/PI8082 -p4wd 
MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP 
MK> [mark at linux SP]$ mpirun -np 2 ~/work/bin/cpmd.x BH3NH3.in >BH3NH3.out
MK>     p4_error: latest msg from perror: No route to host 

that means, that your second machine has crashed hard.
this is most likely due to an overloading of the memory
management (tcp/ip is very tough on the mm-subsystem)
or the ethernet driver (some of them are pretty fragile).
also some gcc versions occasionally miscompile some parts of
the kernel or the drivers.

what kernel version are you using (cat /proc/version)?
and what ethernet card(s) do you have (/sbin/lspci -v | grep -A6 Ether)?

MK> Killed by signal 2. 
MK> /home/mark/work/mpich-1.2.5/bin/mpirun: line 1: 10028 Broken pipe             
MK> /home/mark/work/bin/cpmd.x "BH3NH3.in" -p4pg 
MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP/PI9943 -p4wd 
MK> /home/mark/work/cpmd/BH3NH3/lda/25K8.8.8/SP 
MK> [mark at linux SP]$
 
MK> And in the output file:

MK>  NFI      GEMAX       CNORM           ETOT        DETOT      TCPU
MK>    1  2.112E-01   1.667E-02     -29.677204    0.000E+00     88.83
MK>    2  1.074E-01   6.307E-03     -31.056725   -1.380E+00     90.80
MK>    3  4.252E-02   2.703E-03     -31.253852   -1.971E-01     91.64
MK>    4  3.083E-02   1.413E-03     -31.281939   -2.809E-02     93.56 
MK> p0_10028: (5378.632967) net_send: could not write to fd=5, errno = 113 
MK> p0_10028:  p4_error: net_send write: -1
 
MK> The first time this happened, only two lines after the NFI line were printed.  
MK> Also, 
MK> the second workstation (not the one that mpirun ... cpmd.x was invoked on) 
MK> stops giving a video signal and is unresponsive to ssh or webmin from the 
MK> first 
MK> workstation.  Is this something that mpi is doing, or should I be looking for 
MK> a 
MK> hardware problem?

well, it may also be a hardware problem (wrong bios settings, weak
memory, etc.). you may want to run memtest <http://www.memtest86.com/>
on all machines to rule out weak memory.
 
MK> Thank you,
 
MK> Mark Kosmowski

MK> Chemistry Department
MK> Syracuse University
MK> mkosmows at syr.edu


>>> "MK" == mkosmows  <mkosmows at mailbox.syr.edu> writes:

MK> Dear CPMD communtiy:
 
MK> I just got that same broken pipe error when using my notebook as the second 
MK> processor, so I am inclined to rule out hardware problem.  However, the 
MK> notebook did not hang and is still perfectly functional.

hmm, another question: do you use direct cabeling or a switch or a hub?
if it is one of the latter two, you may want to look up the switch
statistics for the ports you are using (if you have access to that info,
that is). perhaps there is a cabeling problem. please also look at the output of 
'netstat -i'. it should produce something like this (gigabit ethernet
machine, uptime 9 days):

Kernel Interface table
Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0   1500   0 2147483647      0      0      0 2147483647      1      0      0 BRU
lo    16436   0     1837      0      0      0     1837      0      0      0 LRU

or this (100mbit ethernet machine, uptime 9 days):
Kernel Interface table
Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0   1500   0 935317809      0      0     58 915489568      0      0      0 BRU
lo    16436   0    26270      0      0      0    26270      0      0      0 LRU

both machines have been heavily used for tcp/ip based parallel jobs.

hope this helps,
     axel.

MK> I will try using both desktops and the notebook together to see what will 
MK> happen.  Hopefully someone knows what to do.
 
MK> Thank you,
 
MK> Mark Kosmowski

MK> Chemistry Department
MK> Syracuse University
MK> mkosmows at syr.edu

MK> _______________________________________________
MK> CPMD-list mailing list
MK> CPMD-list at cpmd.org
MK> http://www.cpmd.org/mailman/listinfo/cpmd-list



--

=======================================================================
Axel Kohlmeyer       e-mail: axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Lehrstuhl fuer Theoretische Chemie          Phone: ++49 (0)234/32-26673
Ruhr-Universitaet Bochum - NC 03/53         Fax:   ++49 (0)234/32-14045
D-44780 Bochum                   http://www.theochem.ruhr-uni-bochum.de
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.



More information about the CPMD-list mailing list