[CPMD-list] CPMD on 64-bit Linux (openSUSE 10.3) Intel Xeon

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Tue Apr 8 17:36:16 CEST 2008


On Tue, 8 Apr 2008, Vladimir Stegailov wrote:

VS> Hi, Axel!

hi vladimir,

VS> 
VS> > FFLAGS= -O2 -unroll -pc64 -march=pentium3 -mtune=core2
VS> > LFLAGS= -L/opt/intel/mkl/10.0.2.018/lib/em64t -lmkl_intel_lp64
VS> > -lmkl_sequential 
VS> >  -lmkl_core
VS> 
VS> A couple of may be stupid questions. Is it usually normal to mix options for
VS> different types of cpus ('pentium3' and 'core2' options)?

yes. this may sound strange, but it makes sense. -march defines
the instruction set, so the resulting exectable will a) run on
all cpus from pentium3 onward and b) not use any SSE/SSE2 instructions.
with all tests that i did so far on x86 cpus on large package
programs (not small special purpose and/or benchmark codes!) it 
always turned out that including automatic SSE/SSE2 dispatch
actually _slowed_ the code down. this seems to be related to whether
you have "pure" linear algebra code or also irrational functions
in there. so with this setting i get a compact and simple code
that is still optimally arranged for the latest cpu architecture.

i'm currently investigating how much the memory alignment is
affecting this (SSE requires 16-byte alignment, but CPMD 
currently has only 8-byte alignments except for BG/L).

VS> Is 'core2' relevant to Xeons? Not only to Core2 cpus?

yes. newer core2 cpus are also "rebranded" as xeon:

[akohlmey at vitriol ~]$ cat /proc/cpuinfo | grep model\ name
model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz

for pentium4 derived xeons (and pentium4), of course you want
to change this to -mtune=pentium4. actually, the difference between
pentium3,core/core2,and athlon/opteron in how to optimally arrange
executable code is rather small (apart from the instruction set support)
compared to pentium4 (which is a quite different CPU and what the
intel compilers optimize by default for). there having the proper
code placements and alignments is even more important since you have
a much longer pipeline. i also found that for pentium4 type cpus
the penalty for SSE is even higher (up to 20% for early generation
models), so it is worth testing the impact of those features.

of course by using SIMD enhanced FFT and BLAS/LAPACK libraries i'm
still using those instruction sets where they _do_ provide a benefit.

VS> > of the FFT different implementations, but the fact that FFTW
VS> > supports more grids, so you may find a smaller grid that
VS> > fits your problem and thus reduces the amount of data to be
VS> > processed (and the memory needed).
VS> 
VS> Is it a general recommendation for any platform?
VS> Can it improve the scalability of the code (since less amount of data should
VS> be communicated)?

that depends a bit on the cpu and on the resulting grid. i seems that
for power-of-two grids most vendor tuned FFTs can outrun FFTW2, but
if your systems needs a grid that is slightly larger and where the
vendor FFT (or FFT_DEFAULT) need to use a much larger grid than FFTW
you'll get better times. 

formally having a larger grid _increases_ the "scalability", since 
you have more data to distribute. but it _decreases_ the efficiency 
since you need more cpus to do the same job (the same goes for 
slower/faster cpus. a faster cpu decreases scalability).

as you can see, the world of benchmarking and performance comparisons
is a bit of a mine-field where numbers can be easily misleading and
you always have to verify whether certain claims apply to your case.

e.g. a common claim is that on opteron cpus you _have_ to use the PGI
compilers, since intel compilers do not optimize for opteron. while 
that is a correct statement we also have to consider that we are 
dealing with a code where the generic pentium3 optimization seems to
give the best performance (provided you use vendor tuned libraries
for BLAS/LAPACK). so it doesn't matter how well the "relative" optimization
for opteron is compared to the generic case, but how well can the 
compiler optimize code in general and this is where the intel compilers
beat the PGI compilers for CPMD on Intel _and_ on AMD cpus (not to 
mention the many, many compiler bugs and miscompilations of PGI
at high optimization levels. ok. intel is not that reliable at the
highest optimization level, too. but luckily that doesn't matter since
that code executes slower than the less optimized.).

cheers,
    axel.


VS> 
VS> best regards,
VS> Vladimir
VS> 
VS> 
VS> >
-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.


More information about the CPMD-list mailing list