[CPMD-list] CPMD on 64-bit Linux (openSUSE 10.3) Intel Xeon

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Tue Apr 8 19:29:17 CET 2008


On Tue, 8 Apr 2008, Vladimir Stegailov wrote:

VS> Thanks a lot for the explanation!

you're welcome.

VS> One more question about perfomance. MKL offers its own FFT.
VS> Is it useful on Xeons in comparison with native FFT and FFTW?

CPMD does not support the native MKL fft directly, but intel
provides an FFTW2 compatible wrapper and i've tried that one
and it works fine with CPMD without any change to the code.
initially, it looked as if it was faster, but that was only
because of the version 10 MKL defaulting to use multi-threading
across all cpu cores if OMP_NUM_THREADS is not set. when linking
with the serial MKL, it was significantly slower than FFTW.

that being said, if you compile an OpenMP or hybrid MPI/OpenMP
parallel executable, this is desired and then the OpenMP support
in MKL is more efficient than the one in FFTW2 and thus the
total performance is better with the fftw-wrapped MKL fft.

the multi-threading support in fftw3 is on par with MKL,
so with CPMD supporting fftw3, i'd expect that in this would
outrun the MKL fft as well.

cheers,
   axel.

VS> 
VS> best,
VS> vladimir
VS> 
VS> ----- Original Message ----- From: "Axel Kohlmeyer"
VS> <akohlmey at cmm.chem.upenn.edu>
VS> To: "Vladimir Stegailov" <stegailov at ihed.ras.ru>
VS> Cc: <cpmd-list at cpmd.org>
VS> Sent: Tuesday, April 08, 2008 7:36 PM
VS> Subject: Re: [CPMD-list] CPMD on 64-bit Linux (openSUSE 10.3) Intel Xeon
VS> 
VS> 
VS> > On Tue, 8 Apr 2008, Vladimir Stegailov wrote:
VS> >
VS> > VS> Hi, Axel!
VS> >
VS> > hi vladimir,
VS> >
VS> > VS>
VS> > VS> > FFLAGS= -O2 -unroll -pc64 -march=pentium3 -mtune=core2
VS> > VS> > LFLAGS= -L/opt/intel/mkl/10.0.2.018/lib/em64t -lmkl_intel_lp64
VS> > VS> > -lmkl_sequential
VS> > VS> >  -lmkl_core
VS> > VS>
VS> > VS>A couple of may be stupid questions. Is it usually normal to mix 
VS> > options for
VS> > VS> different types of cpus ('pentium3' and 'core2' options)?
VS> >
VS> > yes. this may sound strange, but it makes sense. -march defines
VS> > the instruction set, so the resulting exectable will a) run on
VS> > all cpus from pentium3 onward and b) not use any SSE/SSE2 instructions.
VS> > with all tests that i did so far on x86 cpus on large package
VS> > programs (not small special purpose and/or benchmark codes!) it
VS> > always turned out that including automatic SSE/SSE2 dispatch
VS> > actually _slowed_ the code down. this seems to be related to whether
VS> > you have "pure" linear algebra code or also irrational functions
VS> > in there. so with this setting i get a compact and simple code
VS> > that is still optimally arranged for the latest cpu architecture.
VS> >
VS> > i'm currently investigating how much the memory alignment is
VS> > affecting this (SSE requires 16-byte alignment, but CPMD
VS> > currently has only 8-byte alignments except for BG/L).
VS> >
VS> > VS> Is 'core2' relevant to Xeons? Not only to Core2 cpus?
VS> >
VS> > yes. newer core2 cpus are also "rebranded" as xeon:
VS> >
VS> > [akohlmey at vitriol ~]$ cat /proc/cpuinfo | grep model\ name
VS> > model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
VS> > model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
VS> > model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
VS> > model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
VS> >
VS> > for pentium4 derived xeons (and pentium4), of course you want
VS> > to change this to -mtune=pentium4. actually, the difference between
VS> > pentium3,core/core2,and athlon/opteron in how to optimally arrange
VS> > executable code is rather small (apart from the instruction set support)
VS> > compared to pentium4 (which is a quite different CPU and what the
VS> > intel compilers optimize by default for). there having the proper
VS> > code placements and alignments is even more important since you have
VS> > a much longer pipeline. i also found that for pentium4 type cpus
VS> > the penalty for SSE is even higher (up to 20% for early generation
VS> > models), so it is worth testing the impact of those features.
VS> >
VS> > of course by using SIMD enhanced FFT and BLAS/LAPACK libraries i'm
VS> > still using those instruction sets where they _do_ provide a benefit.
VS> >
VS> > VS> > of the FFT different implementations, but the fact that FFTW
VS> > VS> > supports more grids, so you may find a smaller grid that
VS> > VS> > fits your problem and thus reduces the amount of data to be
VS> > VS> > processed (and the memory needed).
VS> > VS>
VS> > VS> Is it a general recommendation for any platform?
VS> > VS> Can it improve the scalability of the code (since less amount of data 
VS> > should
VS> > VS> be communicated)?
VS> >
VS> > that depends a bit on the cpu and on the resulting grid. i seems that
VS> > for power-of-two grids most vendor tuned FFTs can outrun FFTW2, but
VS> > if your systems needs a grid that is slightly larger and where the
VS> > vendor FFT (or FFT_DEFAULT) need to use a much larger grid than FFTW
VS> > you'll get better times.
VS> >
VS> > formally having a larger grid _increases_ the "scalability", since
VS> > you have more data to distribute. but it _decreases_ the efficiency
VS> > since you need more cpus to do the same job (the same goes for
VS> > slower/faster cpus. a faster cpu decreases scalability).
VS> >
VS> > as you can see, the world of benchmarking and performance comparisons
VS> > is a bit of a mine-field where numbers can be easily misleading and
VS> > you always have to verify whether certain claims apply to your case.
VS> >
VS> > e.g. a common claim is that on opteron cpus you _have_ to use the PGI
VS> > compilers, since intel compilers do not optimize for opteron. while
VS> > that is a correct statement we also have to consider that we are
VS> > dealing with a code where the generic pentium3 optimization seems to
VS> > give the best performance (provided you use vendor tuned libraries
VS> > for BLAS/LAPACK). so it doesn't matter how well the "relative"
VS> > optimization
VS> > for opteron is compared to the generic case, but how well can the
VS> > compiler optimize code in general and this is where the intel compilers
VS> > beat the PGI compilers for CPMD on Intel _and_ on AMD cpus (not to
VS> > mention the many, many compiler bugs and miscompilations of PGI
VS> > at high optimization levels. ok. intel is not that reliable at the
VS> > highest optimization level, too. but luckily that doesn't matter since
VS> > that code executes slower than the less optimized.).
VS> >
VS> > cheers,
VS> >    axel.
VS> >
VS> >
VS> > VS>
VS> > VS> best regards,
VS> > VS> Vladimir
VS> > VS>
VS> > VS>
VS> > VS>>
VS> > -- 
VS> > =======================================================================
VS> > Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
VS> >   Center for Molecular Modeling   --   University of Pennsylvania
VS> > Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
VS> > tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
VS> > =======================================================================
VS> > If you make something idiot-proof, the universe creates a better idiot.
VS> > 
VS> 

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.


More information about the CPMD-list mailing list