[CPMD-list] Weird behavior of specific processor optimization

Axel Kohlmeyer axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Wed Jun 22 22:28:43 CEST 2005


On Wed, 22 Jun 2005, Huiqun Zhou wrote:

HZ> Dear CPMD users,

dear Huiqun Zhou,

HZ> I recently built a serial version of CPMD 3.9.2 based on PC-IFC-P4
HZ> with tiny modifications for using Intel MKL 7.0.1, and I used dynamic
HZ> linking because the parallel version I wanted to create will run on
HZ> ROCKS cluster, which enables very easy distribution of Intel compiler
HZ> to all compute nodes. I modified the compiler option -O to -O3, and

even though it may be easier, but to be able to use a shared library
you lose a general purpose register. on a register starved architecture
as the 32-bit x86 platform (a.k.a. ia32), this can result in up to
10% loss of performance. also, even though some things may be easy to
do, it does not mean, that you have to do it that way. i generally
prefer semi-static linking, i.e. link every intel provided library
statically and only libc, libm and - if needed - libpthread dynamically.
that avoids the problem of version skew (i.e. for some reason you don't
have identical intel compiler packages on all your nodes).

HZ> everything went OK. I tested the build with Si64 example, the elapsed
HZ> time of run is 2'40.29'' on a P4 2.4GHz 1GB compute node. I think the
HZ> result is acceptable (please correct me if I'm wrong).

to get comparable benchmark results for wavefunction optimizations 
with the DIIS optimization you should modify the input to have 
something like

ODIIS NO_RESET=100
 10
MAXSTEP
 50

this way you always have the same amount of computation to do.
if you have resets this is not the same, since the DIIS vector
is deleted during a reset.
on top of that, the Si64 example (actually it is Si63)
seems to have some peculiar problem to get to a converged
wavefunction, when using the DIIS optimizer, on several
platforms and depending on whether you are running in parrallel
or serially (also on the plane wave cutoff).

anyway, since the CP molecular dynamics is the more commonly
used feature in CPMD, you may want to take one of those examples
as refence.

HZ> Then, I changed the compiler option for specific processor
HZ> optimization, that is, added -xN option (-lsvml in LDFLAGS), and
HZ> everything went OK again. But, when I tested the build with same Si64
HZ> example, the running time became 18'31.75'', over 8 times longer than
HZ> the previous less optimized version! I noticed in the output that the
HZ> further optimized version (with -O3 -xN) reset much more times in
HZ> diagonization (DIIS) while the less optimized version reset only one
HZ> time.

as i hinted before, the si64 example seems to trigger some numerical
instabilities. when optimizing heavily, the optimizer may rearrange
code, so that there are small differences which may cause the longer
execution time. this is, however, very untypical for 'standard' 
cpmd jobs. you should modify the input file as described above and
compare the timings again.

HZ> What's wrong here? I noticed that the distributed makefile templates
HZ> are all lack of further optimization, so the weird behavior may have
HZ> been a known fact. Please confirm me whether or not we can use any
HZ> complex optimization options in compiling CPMD. Please give me a

there are two reasons: 1.) without those very specific optimizations
the executable can be compiled and run on more machines without
adapting the makefile (the manual states, that in most cases, but
especially on linux machines, you _have_ to adapt the makefile).

2.) those 'heuristic' vectorizations and aggressive optimizations
seem to hurt more than they help. some time critical parts in CPMD
are written in a way, that is already optimized, i.e. written in
a way that most compilers with modest optimization already generate
the proper code. if the optimizing algorithm changes too much, it
will hurt performance. also another part of the performance critical
code is offloaded to the BLAS/LAPACK libraries, so the compiler
flags don't matter much there. in the case of the SIMD instructions
it seems that switching between the SSE unit and the floating-point
unit costs more as the benefit of using the SIMD instructions.

for demonstration please try the two attached configurations 
for serial compilation using ATLAS (you'll find a ready-to-use
ATLAS binary at) 
http://www.theochem.ruhr-uni-bochum.de/~axel.kohlmeyer/cpmd-linux.html#atlas

of course you can adapt them for MKL as well. the use of the
-i-static flag needs a rather recent update release of the intel
8.1 compilers (i have intel-icc8-8.1-029 intel-ifort8-8.1-025).
some time ago we made some tests and found, that the more optimized
and vectorized version was actually over 10% slower. if you have
the patience, you can try the -ip/-ipo as well, which should
cost about another 5%.

HZ> pointer on how to compile an efficient parallel version because the
HZ> parallel version I created based on the less optimized version (with
HZ> -O3, and add MPICH stuff) cost over 15 minutes on one compute node on
HZ> Si64 example, too.

how efficient the parallel version works, depends largely on 
the speed and latency of the interconnect between your compute
nodes. CPMD needs a fair amount of communication bandwidth,
but for really good scaling you need very low communication 
latencies.
for reasonable results you need at least gigabit ethernet.
but in most cases that does not scale well beyond 4-6 nodes.
for any larger parallel job you need something better like
infiniband, myrinet, SCI, or quadrics.

please see the mailing list archives for many, many
discussions of this topic. also some horribly outdated
(>2 years old) benchmark numbers and some discussion of
the results are at:
http://www.theochem.ruhr-uni-bochum.de/~axel.kohlmeyer/cpmd-bench.html

best regards,
	axel kohlmeyer.

HZ> 
HZ> Will static version be a cure?
HZ> 
HZ> 
HZ> Please HELP.
HZ> 
HZ> 
HZ> ==========================
HZ> Huiqun Zhou, Doctor of Science
HZ> Earth Sciences
HZ> Nanjing University
HZ> 22 Hankou Road, Gulou
HZ> Nanjing, 210093
HZ> China
HZ> 
HZ> e-mail: hqzhou at nju.edu.cn
HZ> Tel.: 86(25)8368-6750
HZ> mobil: 86-13182856800
HZ> ==========================
HZ> 

-- 

=======================================================================
Dr. Axel Kohlmeyer   e-mail: axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Lehrstuhl fuer Theoretische Chemie          Phone: ++49 (0)234/32-26673
Ruhr-Universitaet Bochum - NC 03/53         Fax:   ++49 (0)234/32-14045
D-44780 Bochum  http://www.theochem.ruhr-uni-bochum.de/~axel.kohlmeyer/
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.
-------------- next part --------------
#INFO#
#INFO# Configuration to build a serial cpmd executable for
#INFO# a Pentium 4 machine with the intel fortran compiler 'ifort'.
#INFO#
#INFO#  $Id: BOCHUM-P4,v 1.1 2004/06/03 11:58:13 akohlmey Exp $
#INFO#

     IRAT=2
     CFLAGS='-c -O2 -Wall'
     CPP='/lib/cpp -P -C -traditional'
     CPPFLAGS='-D__Linux -D__PGI -DLAPACK -DFFT_DEFAULT -DLINUX_IFC'
     FFLAGS='-c -r8 -w95 -O3 -pc64 -xKWN -tpp7 -unroll -cm -tune pn4 -arch pn4'
     LFLAGS='-L. -latlas_p4 -lsvml -Vaxlib'
     FFLAGS_GROMOS=' ' 
     if [ $debug ]; then
       FC='ifort -g'
       CC='gcc -g'
       LD='ifort -g'
     else
       FC='ifort '
       CC='gcc'
       LD='ifort -i-static'
     fi
 
-------------- next part --------------
#INFO#
#INFO# Configuration to build a serial cpmd executable for
#INFO# a Pentium 4 machine with the intel fortran compiler 'ifort'.
#INFO#
#INFO#  $Id: BOCHUM-P4,v 1.1 2004/06/03 11:58:13 akohlmey Exp $
#INFO#

     IRAT=2
     CFLAGS='-c -O2 -Wall'
     CPP='/lib/cpp -P -C -traditional'
     CPPFLAGS='-D__Linux -D__PGI -DLAPACK -DFFT_DEFAULT -DLINUX_IFC'
     FFLAGS='-c -r8 -w95 -O2 -pc64 -tpp7 -unroll -cm -tune pn4 -arch pn4'
     LFLAGS='-L. -latlas_p4 -lsvml -Vaxlib'
     FFLAGS_GROMOS=' ' 
     if [ $debug ]; then
       FC='ifort -g'
       CC='gcc -g'
       LD='ifort -g'
     else
       FC='ifort '
       CC='gcc'
       LD='ifort -i-static'
     fi
 


More information about the CPMD-list mailing list