[CPMD-list] CPMD on 64-bit Linux (openSUSE 10.3) Intel Xeon

Vladimir Stegailov stegailov at ihed.ras.ru
Tue Apr 8 20:04:20 CEST 2008


Thanks a lot for the explanation!

One more question about perfomance. MKL offers its own FFT.
Is it useful on Xeons in comparison with native FFT and FFTW?

best,
vladimir

----- Original Message ----- 
From: "Axel Kohlmeyer" <akohlmey at cmm.chem.upenn.edu>
To: "Vladimir Stegailov" <stegailov at ihed.ras.ru>
Cc: <cpmd-list at cpmd.org>
Sent: Tuesday, April 08, 2008 7:36 PM
Subject: Re: [CPMD-list] CPMD on 64-bit Linux (openSUSE 10.3) Intel Xeon


> On Tue, 8 Apr 2008, Vladimir Stegailov wrote:
>
> VS> Hi, Axel!
>
> hi vladimir,
>
> VS>
> VS> > FFLAGS= -O2 -unroll -pc64 -march=pentium3 -mtune=core2
> VS> > LFLAGS= -L/opt/intel/mkl/10.0.2.018/lib/em64t -lmkl_intel_lp64
> VS> > -lmkl_sequential
> VS> >  -lmkl_core
> VS>
> VS> A couple of may be stupid questions. Is it usually normal to mix 
> options for
> VS> different types of cpus ('pentium3' and 'core2' options)?
>
> yes. this may sound strange, but it makes sense. -march defines
> the instruction set, so the resulting exectable will a) run on
> all cpus from pentium3 onward and b) not use any SSE/SSE2 instructions.
> with all tests that i did so far on x86 cpus on large package
> programs (not small special purpose and/or benchmark codes!) it
> always turned out that including automatic SSE/SSE2 dispatch
> actually _slowed_ the code down. this seems to be related to whether
> you have "pure" linear algebra code or also irrational functions
> in there. so with this setting i get a compact and simple code
> that is still optimally arranged for the latest cpu architecture.
>
> i'm currently investigating how much the memory alignment is
> affecting this (SSE requires 16-byte alignment, but CPMD
> currently has only 8-byte alignments except for BG/L).
>
> VS> Is 'core2' relevant to Xeons? Not only to Core2 cpus?
>
> yes. newer core2 cpus are also "rebranded" as xeon:
>
> [akohlmey at vitriol ~]$ cat /proc/cpuinfo | grep model\ name
> model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
> model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
> model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
> model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
>
> for pentium4 derived xeons (and pentium4), of course you want
> to change this to -mtune=pentium4. actually, the difference between
> pentium3,core/core2,and athlon/opteron in how to optimally arrange
> executable code is rather small (apart from the instruction set support)
> compared to pentium4 (which is a quite different CPU and what the
> intel compilers optimize by default for). there having the proper
> code placements and alignments is even more important since you have
> a much longer pipeline. i also found that for pentium4 type cpus
> the penalty for SSE is even higher (up to 20% for early generation
> models), so it is worth testing the impact of those features.
>
> of course by using SIMD enhanced FFT and BLAS/LAPACK libraries i'm
> still using those instruction sets where they _do_ provide a benefit.
>
> VS> > of the FFT different implementations, but the fact that FFTW
> VS> > supports more grids, so you may find a smaller grid that
> VS> > fits your problem and thus reduces the amount of data to be
> VS> > processed (and the memory needed).
> VS>
> VS> Is it a general recommendation for any platform?
> VS> Can it improve the scalability of the code (since less amount of data 
> should
> VS> be communicated)?
>
> that depends a bit on the cpu and on the resulting grid. i seems that
> for power-of-two grids most vendor tuned FFTs can outrun FFTW2, but
> if your systems needs a grid that is slightly larger and where the
> vendor FFT (or FFT_DEFAULT) need to use a much larger grid than FFTW
> you'll get better times.
>
> formally having a larger grid _increases_ the "scalability", since
> you have more data to distribute. but it _decreases_ the efficiency
> since you need more cpus to do the same job (the same goes for
> slower/faster cpus. a faster cpu decreases scalability).
>
> as you can see, the world of benchmarking and performance comparisons
> is a bit of a mine-field where numbers can be easily misleading and
> you always have to verify whether certain claims apply to your case.
>
> e.g. a common claim is that on opteron cpus you _have_ to use the PGI
> compilers, since intel compilers do not optimize for opteron. while
> that is a correct statement we also have to consider that we are
> dealing with a code where the generic pentium3 optimization seems to
> give the best performance (provided you use vendor tuned libraries
> for BLAS/LAPACK). so it doesn't matter how well the "relative" 
> optimization
> for opteron is compared to the generic case, but how well can the
> compiler optimize code in general and this is where the intel compilers
> beat the PGI compilers for CPMD on Intel _and_ on AMD cpus (not to
> mention the many, many compiler bugs and miscompilations of PGI
> at high optimization levels. ok. intel is not that reliable at the
> highest optimization level, too. but luckily that doesn't matter since
> that code executes slower than the less optimized.).
>
> cheers,
>    axel.
>
>
> VS>
> VS> best regards,
> VS> Vladimir
> VS>
> VS>
> VS> >
> -- 
> =======================================================================
> Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
>   Center for Molecular Modeling   --   University of Pennsylvania
> Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
> tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
> =======================================================================
> If you make something idiot-proof, the universe creates a better idiot.
> 



More information about the CPMD-list mailing list