[CPMD-list] NCACHE size

Philippe Blaise philippe.blaise at cea.fr
Fri Aug 27 09:12:31 CEST 2004


Axel Kohlmeyer wrote:

>On Thu, 26 Aug 2004, Philippe Blaise wrote:
>
>PB> Hello,
>
>hi!
>
>PB> 
>PB> just to say that a value of ncache = 4 * 1024 is optimal on my opteron 
>PB> machine,
>PB> but the difference between the default value 2 * 1024 and 4 * 1024 is 
>PB> below 10% in time.
>
>well, even 5% saving means a lot, if you need to run your job for 
>weeks or months.
>
>PB> More generally, it seems to me that when you play with the ncache value, 
>PB> you don't always obtain
>PB> huge differences, it's probably due to compiler optimizations and large 
>PB> sizes of nowaday caches -
>PB> so, does the ncache value effect was larger some years ago ?
>PB> Anyway, the fft that comes with cpmd is very good, and I would be 
>PB> suprised if you
>PB> obtain better performances with another one  - for example on a NEC 
>PB> vector machine and
>PB> on a alpha superscalar machine, the vendor implementations are not better.
>
>that depends on the architecture. if your machine has little memory
>bandwidth and few registers (like in the case of pentium-IV/Xeon/athlon)
>clever use of prefetch instructions and exploiting the SIMD units
>can be a large gain. e.g. on an athlon XP machine i found a 2.5 cpmd
>speedup using ATLAS over a (fully optmized BLAS). compared to a
>generic ATLAS (not using any cpu specific instructions) there was still
>a 10% speedup (so this is mainly the effect of prefetching, since the
>SIMD unit on an athlon XP does not support double precision).
>
>  
>

I made some tests on my opteron machine, and cpmd link to atlas is 
approx. 1.5 more efficient than
cpmd link to acml, so that's right, with cpmd the first thing to do is 
to find an optimized blas/lapack lib
for your cpu, and then after you should play with the ncache value.
Many thanks to the guys who made atlas, (even if the mkl is more 
efficient with the itanium2 ).

Au revoir, 
 Philippe.

>salut,
>	axel kohlmeyer.
>
>PB> Good luck,
>PB> 
>PB>   Philippe Blaise
>PB>  
>PB> 
>PB> Axel Kohlmeyer wrote:
>PB> 
>PB> >>>>"IK" == I Kozin <Kozin> writes:
>PB> >>>>        
>PB> >>>>
>PB> >
>PB> >hello,
>PB> >
>PB> >IK> Hello,
>PB> >IK> I'd appreciate to hear any comments on varing NCACHE size 
>PB> >IK> (as given in mltfft.F) on various platforms.
>PB> >IK> Particularly Xeon, Itanium2, Opteron, Power4,5, NEC sx6.
>PB> >
>PB> >IK> In the file above NCACHE is given as 1024*N where N varies.
>PB> >IK> So I'd guess it has to do with L1 cache.
>PB> >IK> For any i386 NCACHE=1024*10.
>PB> >IK> Xeon's L1 cache is 8 KB. So should N be 8?
>PB> >
>PB> >well. i don't know whether there was historically an explicit 
>PB> >relation to the L1 cache. but due to way modern cpus and compilers 
>PB> >work, (e.g. by automatic prefetching) there is none anymore.
>PB> >for instance the i386 value came about by running a series of
>PB> >calculations with different values for NCACHE and picking the
>PB> >value that gave the on average best performance. since the various
>PB> >x86 platforms have quite different characteristics internally,
>PB> >this always has to be a compromise. if you want to squeeze out
>PB> >the last bit of performance, you may want to tune it for your
>PB> >specific machine and your specific example. 
>PB> >
>PB> >btw: the best way to optimize the fft is use it as little as 
>PB> >possible (e.g. by using the REAL SPACE WFN KEEP keyword, provided
>PB> >there is enough memory on your machine).
>PB> >
>PB> >IK> Itanium2: NCACHE=1024*8 but L1 cache size is 16 KB.
>PB> >IK> This is the only machine I've experimented with so far.
>PB> >IK> Taking N = 8, 16, 32 I found that N = 16 is marginally quicker
>PB> >IK> (wat32 benchmark, run 1: 1372 s, 1363 s, 1388 s).
>PB> >IK> So basically no need to bother.
>PB> >
>PB> >you should also try 'in-between' values.
>PB> >
>PB> >IK> More interesting cases:
>PB> >IK> Itanium2 + HPUX: NCACHE=1024*64 
>PB> >IK> (obviously the L1 cache is the same as above)
>PB> >
>PB> >IK> Operon: NCACHE=1024*2 (default) but L1 cache is 64 KB.
>PB> >
>PB> >this has not been tuned yet. if you find an optimal value,
>PB> >please let us know.
>PB> >
>PB> >IK> BTW, are there any efforts in attempting to use ACML or MKL
>PB> >IK> on AMD and Intel respectively for FFT instead of default FFT?
>PB> >
>PB> >not that i know of. feel free to try. the default fft is quite
>PB> >competitive, considering that it is portable fortran code.
>PB> >IMO it would be more interesting to see, how well a port 
>PB> >to fftw3 performs.
>PB> >
>PB> >regards,
>PB> >        axel kohlmeyer.
>PB> >
>PB> >IK> Thanks,
>PB> >
>PB> >IK> Igor Kozin
>PB> >IK> Computational Science & Engineering Dept.
>PB> >IK> CCLRC Daresbury Laboratory
>PB> >IK> Keckwick Lane
>PB> >IK> Warrington
>PB> >IK> WA4 4AD
>PB> >IK> UK
>PB> >
>PB> >IK> i. kozin at dl.ac.uk
>PB> >IK> +44 (0) 1925 603308
>PB> >IK> http://www.cse.clrc.ac.uk/disco
>PB> >IK> _______________________________________________
>PB> >IK> CPMD-list mailing list
>PB> >IK> CPMD-list at cpmd.org
>PB> >IK> http://cpmd.org/mailman/listinfo/cpmd-list
>PB> >
>PB> >
>PB> >
>PB> >--
>PB> >
>PB> >=======================================================================
>PB> >Axel Kohlmeyer       e-mail: axel.kohlmeyer at theochem.ruhr-uni-bochum.de
>PB> >Lehrstuhl fuer Theoretische Chemie          Phone: ++49 (0)234/32-26673
>PB> >Ruhr-Universitaet Bochum - NC 03/53         Fax:   ++49 (0)234/32-14045
>PB> >D-44780 Bochum  http://www.theochem.ruhr-uni-bochum.de/~axel.kohlmeyer/
>PB> >=======================================================================
>PB> >If you make something idiot-proof, the universe creates a better idiot.
>PB> >_______________________________________________
>PB> >CPMD-list mailing list
>PB> >CPMD-list at cpmd.org
>PB> >http://cpmd.org/mailman/listinfo/cpmd-list
>PB> >
>PB> >  
>PB> >
>PB> 
>PB> _______________________________________________
>PB> CPMD-list mailing list
>PB> CPMD-list at cpmd.org
>PB> http://cpmd.org/mailman/listinfo/cpmd-list
>PB> 
>PB> 
>
>  
>




More information about the CPMD-list mailing list