[CPMD-list] script for PGI-LAMMPI

mjensen at fysik.dtu.dk mjensen at fysik.dtu.dk
Fri Oct 11 20:26:48 CEST 2002


Hi

Sorry for interfereing, Carme Rovira is teaching me CPMD, 
and I'm involved in the problem with running CPMD using LAM MPI.


When running CPMD on two nodes (0 and 1) 

mpirun N -x PP_LIBRARY_PATH cpmd-mpi-lam.x test.inp

we get this error:

-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 525 failed on node n1 with exit status 1.
-----------------------------------------------------------------------------

Running on ONLY node 1 or node 0, respectively there's no problem.

Running a small mpi send-recieve program in exactely the same way
is OK that being using either n0  .or.  n1 .or. on both (n0,1) a.k.a option N
to mpirun (the cluster has only single processor per cpu (i.e. node) ).

Since there's no LAM-MPI option in the Configure script to generate a 
LAM Linux version of CPMD we just used the LAM MPI version to wrap
the PGI compilers (mpif77 and mpicc), and in the .tcsh we have

    setenv PGI /usr/local/lib/PGI
    setenv LAMHOME /usr/local/lib/LAM/
    setenv PATH ${PATH}:/usr/local/lib/LAM/bin


but had littel luck when running. Any suggestion is highly appreciated.


Futhermore, is there any experience on compiling CPMD with
large file support on Linux? 

Adding in this case to a MPICH-PGI CPMD Makefile 

-Mlsf 

as FFLAG 

and

-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE

as CFLAG

resulted in a perfectly clean compilation but 
this problem with the executeable:

-----------------------------------------------------------------------------

 PARAPARAPARAPARAPARAPARAPARAPARAPARAPARAPARAPARAPARAPARAPARAPARA
 LOADPA| PROCESSOR    1 HAS NO G COMPONENT.


 PROGRAM STOPS IN SUBROUTINE LOADPA| TOO MANY PROCESSORS [PROC=   0]
[0] MPI Abort by user Aborting program !
[0] Aborting program!
p0_1356:  p4_error: : 999
Broken pipe

-----------------------------------------------------------------------------

Running the same job with a small ( MPICH CMPD ) exec works fine

Any way out of this?

Thanks -

Morten Jensen

> 
> >>> "CR" == Carme Rovira <crovira at pcb.ub.es> writes:
> 
> CR> Dear Axel,
> 
> dear carme,
> 
> CR> You are absolutely right. Here are some details
> CR> of how we proceed:
> 
> [setup detail deleted]
> 
> that is perfect so far.
> 
> >> 
> >> if this works, you usually just have to create a file
> >> (e.g. hostlist) with all the machines you want to use
> >> in the lam-parallel-machine (name hosts where you want to use
> >> 2 cpus twice) and then initialize lam with
> 
> CR> This is done using PBS batch system as  (lines ripped from the
> CR> batch script):
> 
> ok, pbs works well with lam. one more question, are you using
> dual cpu nodes? if yes, then 'pbsnodes -a' should give you something
> like this:
> 
> dust
>      state = free
>      np = 2
>      properties = dust,dualamd
>      ntype = cluster
> 
> important are 'np=2' and 'ntype=cluster'
> for single cpu accordingly:
> 
> vivaldi
>      state = job-exclusive
>      np = 1
>      properties = athlon,vivaldi,server,medium
>      ntype = cluster
>      jobs = 0/7945.monteverdi.theochem.ruhr-uni-bochum.de
> 
> again, important are 'np=1' and 'ntype=cluster'.
> but this only determines how you use the cpus and how many
> and should not affect the running of a parallel job.
> 
> 
> 
> CR> #create nodelist
> CR> set nodelist = `cat $PBS_NODEFILE`
> 
> CR> # calc number of nodes
> CR> set N = `wc $PBS_NODEFILE | awk '{print $1}'`
> 
> CR> # create lamhost file
> 
> CR> cat $PBS_NODEFILE > lamhosts
> 
> >> lamboot -v hostlist
> 
> 
> ok, your script assumes a csh/tcsh syntax. have you verified, that
> this is actually the case? pbs usually passes the batch script to
> /bin/sh, if i remember correctly.
> 
> 
> CR> Works on the nodes (i.e. lamboot -v lamhosts)
> 
> >> 
> >> then you can start parallel cpmd by
> >> 
> >> mpirun C cpmd.x inputfile > outputfile
> 
> CR> What is "C" doing (is it equivalent to "c")
> 
> no. C is like N but starts multiple copies if you 
> have hosts with multiple cpus.
> 
> 
> >> 
> 
> CR> We tried  this as well as
> 
> CR> mpirun -O -s n0 N cpmd-mpi-lam-large.x test.inp > test.out
> 
> CR> and
> 
> CR> mpirun N cpmd-mpi-lam-large.x test.inp > test.out
> 
> CR> but no sucess...
> 
> 
> all in all you can simplyfy that (and make it shell syntax independent)
> by just using the following script.
> 
> cd $PBS_O_WORKDIR
> 
> lamboot -v $PBS_NODELIST
> 
> mpirun -O C cpmd-mpi-lam-large.x test.inp > test.out
> 
> lamhalt -v
> 
> 
> CR> Should we copy cpmd.x and input to the remote nodes, 
> CR> i.e., tried adding the following to the pbs script:
> 
> CR> #
> CR> shift nodelist
> CR> foreach node ($nodelist)
> CR>   rcp /scratch/{test.inp,cpmd-mpi-lam-large.x} ${node}:/scratch
> CR> end
> 
> CR> Note that the calculation is perfomed on /scratch
> CR> First everything (cpmd.x, input) is copied to here (/scratch)
> CR> one cd's to /scratch and then possibly remote copies
> CR> cpmd.x, input to the nodes
> 
> ok, but if you have a shared, nfs mounted home directory, you
> could put the pseudopotentials and the cpmd executables say in 
> $HOME/cpmd and run it with:
> 
> mpirun -O C $HOME/cpmd/cpmd-mpi-lam-large.x test.inp  $HOME/cpmd >
> test.out
> 
> 
> >> or however you would run a serial cpmd job.
> >> after your job is finished you can stop the
> >> lam infrastructure with
> >> 
> >> lamhalt -v
> >> 
> >> or
> >> 
> >> wipe -v hostlist
> >> 
> CR> Also works fine
> 
> >> if you have to submit your script to a batch system,
> >> then you have to determine how you get the list of
> >> allocated hosts from the batch system. with e.g.
> >> openpbs you have to use $PBS_NODEFILE instead of the
> >> file 'hostlist'.
> 
> CR> Hope this is clear from the lines above
> 
> 
> yes, that was very helpful. if you still can not get it to work, you 
> should also look into the stdout/stderr logs of the batch system.
> those are usually files with the name of the job script and an
> .e<jobid> or .o<jobid> appended.
> 
> 
> good luck,
>         axel.
> 
> >> 
> >> i hope this helps.
> >> 
> >> cheers,
> >> axel.
> >> 
> >> >
> >> > Saludos,
> >> >
> >> >    Carme
> >> >
> >> > -------------------------------------------------------------
> >> > Carme Rovira i Virgili                  Tel: +34 93 4037112
> >> > Centre de Recerca en Química Teòrica    Fax: +34 93 4037225
> >> > Parc Científic de Barcelona             (http://www.pcb.ub.es)
> >> > Josep Samitier 1-5 Annex A              E-mail: crovira at pcb.ub.es
> >> > 08028 Barcelona, Spain   URL:http://www.qf.ub.es/personal/crovira
> >> > --------------------------------------------------------------
> >> > _______________________________________________
> >> > CPMD-list mailing list
> >> > CPMD-list at cpmd.org
> >> > http://www.cpmd.org/mailman/listinfo/cpmd-list
> >> >
> >> 
> >> --
> >> 
> >> =======================================================================
> >> Axel Kohlmeyer      e-mail:  axel.kohlmeyer at theochem.ruhr-uni-bochum.de
> >> Lehrstuhl fuer Theoretische Chemie          Phone: ++49 (0)234/32-26673
> >> Ruhr-Universitaet Bochum - NC 03/53         Fax:   ++49 (0)234/32-14045
> >> D-44780 Bochum                   http://www.theochem.ruhr-uni-bochum.de
> >> =======================================================================
> >> If you make something idiot-proof, the universe creates a better idiot.
> 
> CR> -- 
> CR> -------------------------------------------------------------
> CR> Carme Rovira i Virgili                  Tel: +34 93 4037112
> CR> Centre de Recerca en Química Teòrica    Fax: +34 93 4037225
> CR> Parc Científic de Barcelona             (http://www.pcb.ub.es)
> CR> Josep Samitier 1-5 Annex A              E-mail: crovira at pcb.ub.es
> CR> 08028 Barcelona, Spain   URL:http://www.qf.ub.es/personal/crovira
> CR> --------------------------------------------------------------
> CR> _______________________________________________
> CR> CPMD-list mailing list
> CR> CPMD-list at cpmd.org
> CR> http://www.cpmd.org/mailman/listinfo/cpmd-list
> 
> 
> 
> --
> 
> =======================================================================
> Axel Kohlmeyer       e-mail: axel.kohlmeyer at theochem.ruhr-uni-bochum.de
> Lehrstuhl fuer Theoretische Chemie          Phone: ++49 (0)234/32-26673
> Ruhr-Universitaet Bochum - NC 03/53         Fax:   ++49 (0)234/32-14045
> D-44780 Bochum                   http://www.theochem.ruhr-uni-bochum.de
> =======================================================================
> If you make something idiot-proof, the universe creates a better idiot.
> 






More information about the CPMD-list mailing list