[CPMD-list] Random "exit code #137" on BG/L

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Sun Aug 5 19:07:50 CEST 2007


On Sun, 5 Aug 2007, Matteo Guglielmi wrote:

MG> Hello world,

hi matteo,

MG> Does anyone of you know what's that exit code about?

CPMD is a fortran program, so exit codes have no program
specific meaning, i.e. they are compiler/platform specific.
so you have to consult the documentation of the machine
and/or the user support staff or system administrators of 
that machine.

MG> I'm experiencing this behavior of cpmd only on BG/L
MG> where none of the output files is gracefully closed.

this is an even stronger hint, that your problem is 
most likely a problem of the machine and not of CPMD,
so the machine user support should be the first people
to address. if they are worth their money, they should
be able to help you find the problem (that is what they
get paid for after all).

that being said, you also have to keep in mind, that 
a BG/L is not your common PC cluster, so you have to
make adjustments to the 'nature' of the machines in
how and what you run on it (and this is coming from a
person that has a couple million cpu hours on a BG/L
under his belt). the fact, that you _can_ run across
tousands of nodes with only occasional little hiccups
(say 1 in 10 8hour/1024-node jobs) with excellent scaling
is simply astounding. i'm currently benchmarking a
dual quad-core-cpu pc-cluster with infiniband and 
for over 100 nodes it is simply no longer possible to 
run a communication intensive application 
such as CPMD reliably for even an hour, unless you 
do not use all cpu cores.

also you have to keep in mind, that the amount of
memory available per node on a BG/L is _very_ limited
and there is for the sake of simplicity and speed,
no sophisticated memory management subsystem (i.e.
no swap space). so you have to pay _very_ close attention
to the memory requirements of your jobs.

if you want useful replies to your report from you
sysadmin/user-support staff, i suggest you provide
more detailed information. e.g.
- are the crashes reproducable (or random)?
- do they happen only with certain job types?
- do they happen only with certain systems?
- do they happen only on a certain partition of the machine?
- if the crashes are intermittent, provide a record of
  when the jobs crashed, so that it can be traced in the
  system logs (perhaps there are indications of what has
  happened in the system logs).

you have to remember, that in general you can only get
as good a help from people as good, detailed and accurate
the information is, you provide them. ...and you should
also keep in mind, that most sysadmin people do not know
CPMD or what it does and what its special demands to the
machine are.

hope that helps to get you the help you need.

ciao,
   axel.

MG> 
MG> Thanks,
MG> MG.
MG> _______________________________________________
MG> CPMD-list mailing list
MG> CPMD-list at cpmd.org
MG> http://cpmd.org/mailman/listinfo/cpmd-list
MG> 

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.



More information about the CPMD-list mailing list