|
|
Tuning FFTW For Win32 Compilers
Jonathan Hardwick, September 1998
As part of an ongoing project, I've been investigating
the performance of Fast Fourier Transforms on Win32/x86 systems. This document
has two main purposes:
- Describe the stack-alignment hacks necessary
to get good performance out of 64-bit floating-point code on the x86 architecture
using the gcc, Visual C, and Intel C compilers.
- Present a comprehensive set of x86 FFT benchmarks.
The Problem
The FFT implementation of choice is FFTW
("the Fastest Fourier Transform in the West"),
from Matteo Frigo and Steven G. Johnson at MIT. As well as being general-purpose,
portable, well-documented, parallelised, and freely-available, it has also been
shown to be faster than existing implementations.
However, on the x86 architecture there are performance problems. To quote FFTW's
README.hacks file:
Pentium-type processors impose a huge performance
penalty if double-precision values are not aligned to 8-byte boundaries in
memory. (We have seen factors of 3 or more in tight loops.) Unfortunately,
the Intel ABI specifies that variables on the stack need only be aligned to
4-byte boundaries. Even more unfortunately, this convention is followed by
Linux/x86 and gcc.
Thus, a given function invocation (stack frame)
has a 50-50 chance of having all of its double-precision local variables misaligned
at run-time. FFTW supplies a macro to spot this: here's a slightly extended
version, which can be called at the end of a list of 64-bit variable declarations
to check their alignment.
#define ASSERT_ALIGNED_DOUBLE() { \
double __foo; \
if ((((long) &__foo) & 0x7)) { \
printf ("Unaligned at line %d in file %s\n", __LINE__, __FILE__); \
fflush (stdout); \
exit(1); \
} \
}
Compiling the "test_threads" and "time_threads"
executables of FFTW with this macro enabled reveals that gcc 2.7.2, Visual
C++ 5.0, and the Intel C++ compiler v2.4 all suffer from the 64-bit-misalignment
problem, with the following caveats:
- Although the stock gcc has the misalignment
problem, experimental versions such as pgcc provide new compiler flags:
-mstack-align-double
This switch tries to align doubles on the stack (i.e. auto variables in C)
to an 8 byte boundary.
- Visual C++ generates code that automatically
performs run-time stack alignment if a function contains a "sufficient" number
of accesses to double variables. Unfortunately, my experiments indicate that
FFTW fails to trigger this heuristic. December 1998: Visual
C++ 6.0 has fixed the problem, and therefore doesn't need the hacks described
below.
- The Intel C++ compiler will generate 64-bit-aligned
code for FFTW if given the -Qipo (interprocedural optimization) flag,
even though this is not documented as an effect of the flag. However, using
-Qipo requires access to all of the source code (i.e. no separate compilation),
and a lot more memory and time to complete the compile, so it might not always
be a feasible solution.
The Hacks
Since the compilers cannot guarantee alignment,
the code must adjust stack alignment as necessary at run-time. FFTW includes
two gcc macros to do this they are invoked immediately before calling a computatonally
expensive function, and ensure that 64-bit local variables defined at the top level
of that function are properly aligned. The choice of macro depends on the total
size of the arguments being passed to the function.
#define HACK_ALIGN_STACK_EVEN() { \
if ((((long) (__builtin_alloca(0))) & 0x7)) __builtin_alloca(4); \
}
#define HACK_ALIGN_STACK_ODD() { \
if (!(((long) (__builtin_alloca(0))) & 0x7)) __builtin_alloca(4); \
}
These functions are trivially portable to Visual
C:
#define HACK_ALIGN_STACK_EVEN() { \
if ((((long) (_alloca(0))) & 0x7)) _alloca(4); \
}
#define HACK_ALIGN_STACK_ODD() { \
if (!(((long) (_alloca(0))) & 0x7)) _alloca(4); \
}
The Intel compiler is a little harder to fool.
Although it claims to be plug-compatible with the Visual C compiler, this is not
quite true. In particular, the Intel alloca() function always returns multiples
of 8 bytes, which means it can't be used to adjust our stack pointer by 4 bytes.
Instead, we have to resort to assembly code (thanks to Raymond Chen for the idea
behind this):
#define HACK_ALIGN_STACK_EVEN() \
_asm { \
mov ebx, esp \
and esp, 0xfffffff8 \
sub esp, 4 \
push ebx \
}
#define HACK_ALIGN_STACK_ODD() \
_asm { \
mov ebx, esp \
and esp, 0xfffffff8 \
sub esp, 4 \
push ebx \
}
Note that we leave the stack in the same state
irrespective of the total size of the number of arguments we are pushing this
was unexpected given that the Intel and Visual C compilers should be "plug compatible",
but verified by experiment. Note also that we have to save the old stack pointer
on the stack itself, because the Intel compiler will happily tread all over any
registers you use in assembly code. We therefore need to call another macro after
the function call to restore the old stack pointer:
#define HACK_CLEANUP_STACK() \
_asm { \
pop ebx \
mov esp, ebx \
}
Thus, using inline assembly is more complex than
using the alloca-based method. It will also be faster to execute. For FFTW
this isn't an issue since the vast majority of its execution time is spent within
the called functions, rather than in their calling sequences.
Benchmark Summary
I present results from the "time_threads" program
provided with FFTW, which provides a quick comparison of the performance
of sequential and threaded FFT code. The more comprehensive benchmarks of benchFFT
take too long to run for a short experiment of this sort. There are several axes
along which to compare the results:
- Hardware: in this case, a quad-processor
PPro-200 versus a dual-processor PII-300
- Compiler: Visual C++ 5.0 vs Intel C++ 2.4
vs gcc 2.7.2
- Alignment: with and without the alignment
hacks. For one experiment I also reversed the alignment hacks, so the stack
is always pessimally misaligned.
- Precision: 32-bit versus 64-bit arithmetic.
All results are for 64-bit unless otherwise stated.
- Dimensionality: 2D versus 3D FFTs.
The results are in "FFTW MFLOPS" (higher is better),
as provided by time_threads, and are shown in full at the end of this page. Here's
the executive summary:
- In general, performance trends are as we
would expect:
- Small problems that fit in cache run
faster (up to twice as fast for 3D FFTs).
- Adding more processors speeds things
up if the problem is large enough (typically 1.7-1.9 times for 2 processors,
and 2.3-3.8 times for 4 processors).
- 32-bit floats are faster than 64-bit
floats (typically 1.1-1.2 times faster for large problems).
- For the Visual C++ and gcc compilers, alignment
hacks are always a win. For example, for a 3D serial FFT of size 128x128x128
running on a PII-300:
- Visual C++ improves from 44 to 50 MFLOPS
- gcc improves from 37 to 45 MFLOPS
- For the Intel compiler, even though alignment
checks show that misalignments are being avoided, the alignment hacks offers
very little additional performance gain for single-processor code (53 MFLOPS
for the previous example). As noted previously, the -Qipo flag can also be
used to generate aligned code.
- However, the Intel compiler does comparatively
poorly on threaded code unless either alignment hacks or -Qipo are
used. For example, the relative speedups of using a second PII-300 processor
on a 2D FFT of size 1500x1500:
- Visual C++ achieves a speedup of about
1.92 on both normal and aligned code.
- gcc achieves a speedup of 1.97 on normal
code and 1.91 on aligned code.
- Intel C++ achieves a speedup of only
1.44 on normal code, but 1.91 on aligned code.
- For 64-bit arithmetic, the Intel compiler
typically generates the fastest code, followed by Visual C++, followed by gcc.
For 32-bit arithmetic, gcc does much better, especially for small problem sizes.
- Flipping the alignment hacks so that stack
frames are always misaligned suggests that compilers do indeed have a 50-50
shot at getting things right by themselves. For example, using Visual C++ on
a 3D FFT of size 64x64x64:
- Optimal alignment: 56 MFLOPS
- Default alignment: 41 MFLOPS
- Pessimal alignment: 36 MFLOPS
The Results
Dual-processor PII-300, running Windows NT Server 4.0
SP3
Visual C++ 5.0, /O2 /G6
No alignment
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
46.4468 |
78.7668 |
1.69585 |
| 1024x1024 |
44.8196 |
77.4163 |
1.72729 |
| 1500x1500 |
43.3575 |
83.2007 |
1.91895 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
79.4211 |
60.8555 |
0.766238 |
| 32x32x32 |
56.84 |
82.5112 |
1.45164 |
| 64x64x64 |
40.9153 |
69.1853 |
1.69094 |
| 80x80x80 |
49.1572 |
83.3896 |
1.69638 |
| 100x100x100 |
53.2076 |
81.4856 |
1.53146 |
| 128x128x128 |
43.8305 |
76.53 |
1.74605 |
Aligned (/DFFTW_ENABLE_I386_HACKS)
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
54.8358 |
89.175 |
1.62622 |
| 1024x1024 |
51.8411 |
84.669 |
1.63324 |
| 1500x1500 |
49.9237 |
96.1163 |
1.92526 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
103.726 |
79.7006 |
0.768379 |
| 32x32x32 |
102.054 |
131.121 |
1.28482 |
| 64x64x64 |
56.0358 |
90.2351 |
1.61031 |
| 80x80x80 |
55.0976 |
100.416 |
1.82251 |
| 100x100x100 |
57.6435 |
106.59 |
1.84912 |
| 128x128x128 |
49.5991 |
85.9075 |
1.73204 |
Pessimally aligned (/DFFTW_ENABLE_I386_HACKS)
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
43.8663 |
77.1122 |
1.75789 |
| 1024x1024 |
44.001 |
73.1384 |
1.6622 |
| 1500x1500 |
42.7456 |
82.0431 |
1.91933 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
52.5601 |
57.0314 |
1.08507 |
| 32x32x32 |
45.8269 |
77.5542 |
1.69233 |
| 64x64x64 |
36.4027 |
67.2586 |
1.84763 |
| 80x80x80 |
46.6312 |
85.8723 |
1.84152 |
| 100x100x100 |
43.6408 |
81.5308 |
1.86823 |
| 128x128x128 |
39.512 |
75.006 |
1.89831 |
32-bit arithmetic (/DFFTW_ENABLE_FLOATS)
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
63.9779 |
99.3764 |
1.55329 |
| 1024x1024 |
65.1618 |
104.819 |
1.60859 |
| 1500x1500 |
61.6628 |
122.045 |
1.97924 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
113.641 |
89.6102 |
0.78854 |
| 32x32x32 |
115.265 |
176.516 |
1.53139 |
| 64x64x64 |
62.046 |
97.341 |
1.56885 |
| 80x80x80 |
69.9391 |
133.142 |
1.90369 |
| 100x100x100 |
72.61 |
140.658 |
1.93717 |
| 128x128x128 |
53.8866 |
94.0386 |
1.74512 |
Intel C++ compiler v 2.4, -O2 -G6 -Qxi
No alignment
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
56.8583 |
79.0642 |
1.39055 |
| 1024x1024 |
54.982 |
76.7122 |
1.39523 |
| 1500x1500 |
57.7356 |
83.2823 |
1.44248 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
87.7032 |
72.243 |
0.823721 |
| 32x32x32 |
98.3866 |
102.976 |
1.04665 |
| 64x64x64 |
53.3607 |
71.6307 |
1.34239 |
| 80x80x80 |
62.566 |
91.737 |
1.46624 |
| 100x100x100 |
65.5439 |
88.1877 |
1.34548 |
| 128x128x128 |
52.9488 |
79.4848 |
1.50116 |
No alignment, -Qmem -Qipo (whole-program
optimized)
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
56.8811 |
93.9206 |
1.65117 |
| 1024x1024 |
55.4744 |
92.6221 |
1.66964 |
| 1500x1500 |
57.4747 |
109.958 |
1.91316 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
87.4744 |
76.0967 |
0.869932 |
| 32x32x32 |
97.4344 |
130.181 |
1.33608 |
| 64x64x64 |
53.859 |
85.9767 |
1.59633 |
| 80x80x80 |
63.0622 |
116.15 |
1.84183 |
| 100x100x100 |
65.6054 |
122.153 |
1.86193 |
| 128x128x128 |
53.0509 |
91.3189 |
1.72134 |
Aligned
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
56.6581 |
93.1216 |
1.64357 |
| 1024x1024 |
54.899 |
92.1629 |
1.67877 |
| 1500x1500 |
57.4493 |
110.188 |
1.918 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
87.5528 |
75.7572 |
0.865274 |
| 32x32x32 |
97.905 |
129.279 |
1.32045 |
| 64x64x64 |
53.1379 |
85.3389 |
1.60599 |
| 80x80x80 |
60.5689 |
111.317 |
1.83786 |
| 100x100x100 |
64.9277 |
120.689 |
1.85882 |
| 128x128x128 |
52.6955 |
90.579 |
1.71891 |
Aligned, -Qmem -Qipo
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
56.8825 |
93.6472 |
1.64633 |
| 1024x1024 |
54.8021 |
92.2492 |
1.68331 |
| 1500x1500 |
57.3423 |
108.533 |
1.89272 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
87.2975 |
76.0747 |
0.871441 |
| 32x32x32 |
98.6525 |
129.719 |
1.31491 |
| 64x64x64 |
53.8646 |
86.6275 |
1.60824 |
| 80x80x80 |
62.5237 |
115.384 |
1.84545 |
| 100x100x100 |
64.7789 |
121.609 |
1.87729 |
| 128x128x128 |
53.1022 |
91.0974 |
1.71551 |
Pessimally aligned
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
47.2322 |
82.0113 |
1.73634 |
| 1024x1024 |
48.6876 |
85.087 |
1.74761 |
| 1500x1500 |
48.7475 |
93.6218 |
1.92055 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
65.2711 |
65.67 |
1.00611 |
| 32x32x32 |
65.5549 |
98.8316 |
1.50762 |
| 64x64x64 |
41.851 |
69.9433 |
1.67124 |
| 80x80x80 |
54.2883 |
98.7294 |
1.81861 |
| 100x100x100 |
54.5995 |
97.6763 |
1.78896 |
| 128x128x128 |
44.9831 |
79.8495 |
1.7751 |
32-bit arithmetic
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
65.8038 |
86.9999 |
1.32211 |
| 1024x1024 |
66.2495 |
95.2089 |
1.43713 |
| 1500x1500 |
69.1847 |
100.411 |
1.45135 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
109.215 |
84.458 |
0.773317 |
| 32x32x32 |
103.825 |
130.871 |
1.26049 |
| 64x64x64 |
56.7951 |
73.3917 |
1.29222 |
| 80x80x80 |
73.1616 |
111.353 |
1.52201 |
| 100x100x100 |
77.0388 |
104.487 |
1.35629 |
| 128x128x128 |
55.4671 |
85.822 |
1.54726 |
32-bit arithmetic, -Qmem -Qipo
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
65.265 |
104.482 |
1.60088 |
| 1024x1024 |
66.3592 |
109.348 |
1.64782 |
| 1500x1500 |
69.1597 |
136.019 |
1.96674 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
109.076 |
91.5887 |
0.839675 |
| 32x32x32 |
103.993 |
164.826 |
1.58497 |
| 64x64x64 |
56.7376 |
88.9229 |
1.56726 |
| 80x80x80 |
72.5244 |
138.136 |
1.90469 |
| 100x100x100 |
77.4831 |
148.308 |
1.91407 |
| 128x128x128 |
55.7762 |
97.181 |
1.74234 |
Dual-processor PII-300, running RedHat Linux 5.0
gcc 2.7.2, -O6 -fomit-frame-pointer -malign-double
No alignment
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
40.4177 |
66.5506 |
1.64657 |
| 1024x1024 |
40.2611 |
65.6595 |
1.63084 |
| 1500x1500 |
38.3901 |
75.5683 |
1.96843 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
72.9391 |
79.3052 |
1.08728 |
| 32x32x32 |
70.73 |
108.963 |
1.54054 |
| 64x64x64 |
41.4734 |
64.5481 |
1.55637 |
| 80x80x80 |
44.2654 |
78.6183 |
1.77607 |
| 100x100x100 |
46.0124 |
79.69 |
1.73193 |
| 128x128x128 |
37.6332 |
62.1941 |
1.65264 |
Aligned (/DFFTW_ENABLE_I386_HACKS)
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
50.6498 |
77.4055 |
1.52825 |
| 1024x1024 |
48.7196 |
77.0829 |
1.58217 |
| 1500x1500 |
54.5134 |
104.143 |
1.91042 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
96.8322 |
85.6645 |
0.884669 |
| 32x32x32 |
93.7424 |
130.213 |
1.38905 |
| 64x64x64 |
50.0981 |
72.4086 |
1.44534 |
| 80x80x80 |
54.9132 |
105.489 |
1.92102 |
| 100x100x100 |
60.7076 |
113.748 |
1.8737 |
| 128x128x128 |
45.2486 |
74.8915 |
1.6551 |
32-bit arithmetic (/DFFTW_ENABLE_FLOATS)
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
70.5227 |
113.461 |
1.60885 |
| 1024x1024 |
69.4126 |
113.994 |
1.64227 |
| 1500x1500 |
69.2059 |
135.086 |
1.95194 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
124.403 |
100.614 |
0.80877 |
| 32x32x32 |
131.643 |
199.338 |
1.51423 |
| 64x64x64 |
72.154 |
110.576 |
1.5325 |
| 80x80x80 |
76.7718 |
149.369 |
1.94562 |
| 100x100x100 |
78.1413 |
152.053 |
1.94588 |
| 128x128x128 |
57.9987 |
101.25 |
1.74574 |
Quad-processor PPro-200, running Windows NT Server
4.0 SP3
Visual C++ 5.0, /O2 /G6
No alignment
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
33.3899 |
58.2365 |
1.74414 |
| 1024x1024 |
31.5184 |
59.3668 |
1.88356 |
| 1500x1500 |
32.495 |
62.6291 |
1.92735 |
| 2D array size |
fftw |
fftw_threads (4) |
speedup |
| 512x512 |
33.756 |
87.7342 |
2.59907 |
| 1024x1024 |
33.9538 |
93.122 |
2.74261 |
| 1500x1500 |
32.7109 |
114.979 |
3.51501 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
32.2374 |
38.8463 |
1.20501 |
| 32x32x32 |
35.9523 |
56.4447 |
1.56999 |
| 64x64x64 |
30.2042 |
52.0695 |
1.72392 |
| 80x80x80 |
34.1322 |
64.4705 |
1.88884 |
| 100x100x100 |
32.6096 |
61.8373 |
1.89629 |
| 128x128x128 |
31.9402 |
57.163 |
1.78969 |
| 3D array size |
fftw |
fftw_threads (4) |
speedup |
| 16x16x16 |
32.6752 |
35.4585 |
1.08518 |
| 32x32x32 |
36.8107 |
83.3606 |
2.26457 |
| 64x64x64 |
30.7961 |
76.6096 |
2.48764 |
| 80x80x80 |
34.6859 |
111.107 |
3.20322 |
| 100x100x100 |
33.1145 |
111.569 |
3.3692 |
| 128x128x128 |
31.9756 |
86.1691 |
2.69484 |
Aligned (/DFFTW_ENABLE_I386_HACKS)
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
40.57 |
68.249 |
1.68225 |
| 1024x1024 |
37.2883 |
68.1408 |
1.8274 |
| 1500x1500 |
42.4182 |
81.026 |
1.91017 |
| 2D array size |
fftw |
fftw_threads (4) |
speedup |
| 512x512 |
40.7392 |
93.5986 |
2.29751 |
| 1024x1024 |
41.018 |
98.2431 |
2.39512 |
| 1500x1500 |
42.5523 |
146.639 |
3.44608 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
85.1925 |
62.5818 |
0.734592 |
| 32x32x32 |
76.9068 |
90.9157 |
1.18215 |
| 64x64x64 |
37.6152 |
64.2844 |
1.709 |
| 80x80x80 |
42.2172 |
78.793 |
1.86637 |
| 100x100x100 |
40.5209 |
76.3298 |
1.88372 |
| 128x128x128 |
41.2653 |
69.5499 |
1.68543 |
| 3D array size |
fftw |
fftw_threads (4) |
speedup |
| 16x16x16 |
86.3327 |
42.8769 |
0.496648 |
| 32x32x32 |
78.7973 |
113.882 |
1.44525 |
| 64x64x64 |
38.0106 |
86.1048 |
2.26529 |
| 80x80x80 |
42.8227 |
132.362 |
3.09092 |
| 100x100x100 |
41.0811 |
132.569 |
3.22701 |
| 128x128x128 |
41.053 |
96.4073 |
2.34836 |
Pessimally aligned (/DFFTW_ENABLE_I386_HACKS)
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
32.2017 |
55.3866 |
1.71999 |
| 1024x1024 |
32.7021 |
56.1799 |
1.71793 |
| 1500x1500 |
31.5052 |
60.3094 |
1.91426 |
| 2D array size |
fftw |
fftw_threads (4) |
speedup |
| 512x512 |
32.3148 |
85.0938 |
2.63328 |
| 1024x1024 |
32.5576 |
89.9746 |
2.76355 |
| 1500x1500 |
31.4971 |
113.558 |
3.60533 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
27.947 |
35.9526 |
1.28646 |
| 32x32x32 |
32.2533 |
51.2054 |
1.5876 |
| 64x64x64 |
28.0968 |
49.0044 |
1.74412 |
| 80x80x80 |
31.5337 |
58.7588 |
1.86337 |
| 100x100x100 |
29.8807 |
57.3176 |
1.91821 |
| 128x128x128 |
29.6938 |
53.2519 |
1.79337 |
| 3D array size |
fftw |
fftw_threads (4) |
speedup |
| 16x16x16 |
27.9497 |
35.4039 |
1.2667 |
| 32x32x32 |
31.8826 |
79.7991 |
2.50291 |
| 64x64x64 |
28.0653 |
74.4849 |
2.65399 |
| 80x80x80 |
30.9942 |
105.083 |
3.39041 |
| 100x100x100 |
29.9655 |
104.792 |
3.4971 |
| 128x128x128 |
29.5436 |
83.2133 |
2.81663 |
32-bit arithmetic (/DFFTW_ENABLE_FLOATS)
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
44.939 |
74.5564 |
1.65906 |
| 1024x1024 |
47.3524 |
78.3161 |
1.6539 |
| 1500x1500 |
50.3582 |
98.7201 |
1.96036 |
| 2D array size |
fftw |
fftw_threads (4) |
speedup |
| 512x512 |
44.9217 |
98.2019 |
2.18607 |
| 1024x1024 |
47.3702 |
109.582 |
2.31332 |
| 1500x1500 |
50.3493 |
187.88 |
3.73153 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
94.6326 |
74.2902 |
0.785038 |
| 32x32x32 |
91.5342 |
133.254 |
1.45578 |
| 64x64x64 |
42.2776 |
68.8599 |
1.62876 |
| 80x80x80 |
51.7307 |
98.5845 |
1.90573 |
| 100x100x100 |
49.1778 |
96.7585 |
1.96752 |
| 128x128x128 |
45.4205 |
75.9071 |
1.67121 |
| 3D array size |
fftw |
fftw_threads (4) |
speedup |
| 16x16x16 |
94.6322 |
46.9839 |
0.49649 |
| 32x32x32 |
91.5842 |
179.153 |
1.95616 |
| 64x64x64 |
42.285 |
87.9033 |
2.07883 |
| 80x80x80 |
51.79 |
179.604 |
3.46792 |
| 100x100x100 |
49.2607 |
177.037 |
3.59388 |
| 128x128x128 |
45.5543 |
106.719 |
2.34268 |
Intel C++ compiler v 2.4, -O2 -G6 -Qxi
No alignment
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
43.4952 |
55.333 |
1.27217 |
| 1024x1024 |
39.0919 |
57.9617 |
1.4827 |
| 1500x1500 |
45.747 |
46.1656 |
1.00915 |
| 2D array size |
fftw |
fftw_threads (4) |
speedup |
| 512x512 |
42.8586 |
79.6357 |
1.8581 |
| 1024x1024 |
43.0969 |
87.7847 |
2.03692 |
| 1500x1500 |
45.5663 |
87.9138 |
1.92936 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
80.9314 |
47.6881 |
0.589242 |
| 32x32x32 |
79.9746 |
57.094 |
0.713902 |
| 64x64x64 |
41.4909 |
46.532 |
1.1215 |
| 80x80x80 |
45.9014 |
50.3729 |
1.09741 |
| 100x100x100 |
45.8433 |
44.1373 |
0.962788 |
| 128x128x128 |
45.1381 |
52.9254 |
1.17252 |
| 3D array size |
fftw |
fftw_threads (4) |
speedup |
| 16x16x16 |
80.9302 |
39.0644 |
0.482693 |
| 32x32x32 |
79.8851 |
83.6522 |
1.04716 |
| 64x64x64 |
41.3638 |
66.9941 |
1.61963 |
| 80x80x80 |
45.8951 |
92.6891 |
2.01958 |
| 100x100x100 |
45.8827 |
84.8491 |
1.84926 |
| 128x128x128 |
44.7655 |
82.0377 |
1.83261 |
No alignment, -Qmem -Qipo (whole-program
optimized)
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
41.8402 |
68.8055 |
1.64448 |
| 1024x1024 |
37.4827 |
71.2355 |
1.90049 |
| 1500x1500 |
45.4093 |
87.8273 |
1.93413 |
| 2D array size |
fftw |
fftw_threads (4) |
speedup |
| 512x512 |
42.2349 |
84.4022 |
1.9984 |
| 1024x1024 |
37.9486 |
95.8772 |
2.5265 |
| 1500x1500 |
46.1121 |
158.589 |
3.43921 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
79.8164 |
60.2316 |
0.754628 |
| 32x32x32 |
80.1718 |
90.5531 |
1.12949 |
| 64x64x64 |
40.5786 |
64.727 |
1.5951 |
| 80x80x80 |
46.1979 |
86.0449 |
1.86253 |
| 100x100x100 |
45.9611 |
86.1206 |
1.87377 |
| 128x128x128 |
45.1941 |
75.7427 |
1.67594 |
| 3D array size |
fftw |
fftw_threads (4) |
speedup |
| 16x16x16 |
79.8168 |
42.5397 |
0.532967 |
| 32x32x32 |
80.1517 |
112.246 |
1.40042 |
| 64x64x64 |
40.2844 |
77.7492 |
1.93001 |
| 80x80x80 |
46.0684 |
138.013 |
2.99582 |
| 100x100x100 |
45.8895 |
142.459 |
3.10439 |
| 128x128x128 |
44.795 |
99.4168 |
2.21937 |
Aligned
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
42.6217 |
70.0699 |
1.644 |
| 1024x1024 |
38.8828 |
73.3712 |
1.88698 |
| 1500x1500 |
45.6918 |
87.1022 |
1.9063 |
| 2D array size |
fftw |
fftw_threads (4) |
speedup |
| 512x512 |
43.1915 |
89.4052 |
2.06997 |
| 1024x1024 |
43.4361 |
100.65 |
2.31719 |
| 1500x1500 |
45.7277 |
153.981 |
3.36735 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
80.0944 |
59.4118 |
0.741772 |
| 32x32x32 |
80.5123 |
90.6849 |
1.12635 |
| 64x64x64 |
41.4145 |
65.7292 |
1.58711 |
| 80x80x80 |
46.2679 |
84.493 |
1.82617 |
| 100x100x100 |
45.4719 |
85.6731 |
1.88409 |
| 128x128x128 |
45.0108 |
75.066 |
1.66773 |
| 3D array size |
fftw |
fftw_threads (4) |
speedup |
| 16x16x16 |
80.0953 |
42.8406 |
0.53487 |
| 32x32x32 |
80.5359 |
113.824 |
1.41333 |
| 64x64x64 |
41.3527 |
81.517 |
1.97126 |
| 80x80x80 |
46.1736 |
137.094 |
2.9691 |
| 100x100x100 |
45.4402 |
143.205 |
3.1515 |
| 128x128x128 |
44.5505 |
100.686 |
2.26003 |
Aligned, -Qmem -Qipo
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
43.358 |
70.7148 |
1.63095 |
| 1024x1024 |
38.7767 |
73.037 |
1.88353 |
| 1500x1500 |
45.5634 |
86.9225 |
1.90772 |
| 2D array size |
fftw |
fftw_threads (4) |
speedup |
| 512x512 |
43.3482 |
90.0088 |
2.07641 |
| 1024x1024 |
43.3224 |
99.5012 |
2.29676 |
| 1500x1500 |
45.356 |
155.401 |
3.42626 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
79.0579 |
61.1223 |
0.773134 |
| 32x32x32 |
78.4386 |
90.6996 |
1.15631 |
| 64x64x64 |
41.7961 |
66.8261 |
1.59886 |
| 80x80x80 |
46.1989 |
85.4321 |
1.84922 |
| 100x100x100 |
44.7491 |
85.6325 |
1.91361 |
| 128x128x128 |
44.3598 |
74.5734 |
1.6811 |
| 3D array size |
fftw |
fftw_threads (4) |
speedup |
| 16x16x16 |
80.1954 |
42.3156 |
0.527656 |
| 32x32x32 |
80.6296 |
113.479 |
1.40742 |
| 64x64x64 |
42.2715 |
82.8768 |
1.96058 |
| 80x80x80 |
46.765 |
139.305 |
2.97882 |
| 100x100x100 |
45.2863 |
143.925 |
3.17811 |
| 128x128x128 |
44.3641 |
100.204 |
2.25867 |
Pessimally aligned
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
32.3068 |
61.2201 |
1.89496 |
| 1024x1024 |
32.4185 |
63.1438 |
1.94777 |
| 1500x1500 |
33.1689 |
64.6584 |
1.94937 |
| 2D array size |
fftw |
fftw_threads (4) |
speedup |
| 512x512 |
32.4271 |
88.5288 |
2.73009 |
| 1024x1024 |
35.2717 |
93.0972 |
2.63943 |
| 1500x1500 |
33.1774 |
120.657 |
3.63672 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
41.7387 |
43.9083 |
1.05198 |
| 32x32x32 |
38.9408 |
58.6015 |
1.50489 |
| 64x64x64 |
33.1686 |
56.3004 |
1.6974 |
| 80x80x80 |
36.5222 |
68.9123 |
1.88686 |
| 100x100x100 |
32.9632 |
63.4848 |
1.92593 |
| 128x128x128 |
33.8606 |
61.0494 |
1.80296 |
| 3D array size |
fftw |
fftw_threads (4) |
speedup |
| 16x16x16 |
42.3317 |
37.8494 |
0.894116 |
| 32x32x32 |
39.8781 |
82.108 |
2.05897 |
| 64x64x64 |
33.585 |
78.1175 |
2.32596 |
| 80x80x80 |
36.9998 |
116.703 |
3.15416 |
| 100x100x100 |
33.1756 |
116.491 |
3.51134 |
| 128x128x128 |
33.9556 |
87.2379 |
2.56918 |
32-bit arithmetic
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
45.7732 |
58.3599 |
1.27498 |
| 1024x1024 |
47.8791 |
56.1051 |
1.17181 |
| 1500x1500 |
52.9941 |
51.4422 |
0.970715 |
| 2D array size |
fftw |
fftw_threads (4) |
speedup |
| 512x512 |
45.178 |
80.9349 |
1.79147 |
| 1024x1024 |
47.71 |
86.1603 |
1.80592 |
| 1500x1500 |
53.1794 |
100.65 |
1.89265 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
88.9634 |
51.2476 |
0.576053 |
| 32x32x32 |
91.7443 |
69.5582 |
0.758174 |
| 64x64x64 |
41.4929 |
47.5948 |
1.14706 |
| 80x80x80 |
52.7152 |
60.5488 |
1.1486 |
| 100x100x100 |
51.5715 |
51.1475 |
0.991778 |
| 128x128x128 |
45.0236 |
54.6663 |
1.21417 |
| 3D array size |
fftw |
fftw_threads (4) |
speedup |
| 16x16x16 |
90.2187 |
42.6391 |
0.47262 |
| 32x32x32 |
93.1208 |
110.411 |
1.18567 |
| 64x64x64 |
41.6996 |
69.3236 |
1.66245 |
| 80x80x80 |
52.4418 |
111.654 |
2.12911 |
| 100x100x100 |
52.1397 |
100.007 |
1.91805 |
| 128x128x128 |
45.1681 |
84.4793 |
1.87033 |
32-bit arithmetic, -Qmem -Qipo
| 2D array size |
fftw |
fftw_threads (2) |
speedup |
| 512x512 |
45.1246 |
72.9839 |
1.61739 |
| 1024x1024 |
47.2389 |
79.5551 |
1.6841 |
| 1500x1500 |
52.3838 |
102.992 |
1.96611 |
| 2D array size |
fftw |
fftw_threads (4) |
speedup |
| 512x512 |
45.4932 |
90.9095 |
1.99831 |
| 1024x1024 |
47.718 |
105.674 |
2.21455 |
| 1500x1500 |
52.4809 |
197.162 |
3.75684 |
| 3D array size |
fftw |
fftw_threads (2) |
speedup |
| 16x16x16 |
90.5735 |
71.5021 |
0.789438 |
| 32x32x32 |
93.3124 |
131.275 |
1.40683 |
| 64x64x64 |
43.1575 |
70.1784 |
1.6261 |
| 80x80x80 |
53.0547 |
101.444 |
1.91207 |
| 100x100x100 |
52.6358 |
101.885 |
1.93565 |
| 128x128x128 |
45.6622 |
77.9548 |
1.70721 |
| 3D array size |
fftw |
fftw_threads (4) |
speedup |
| 16x16x16 |
| |