*
Quick Links|Home|Worldwide
Microsoft*
Search for



Tuning FFTW For Win32 Compilers

Jonathan Hardwick, September 1998

As part of an ongoing project, I've been investigating the performance of Fast Fourier Transforms on Win32/x86 systems. This document has two main purposes:

  1. Describe the stack-alignment hacks necessary to get good performance out of 64-bit floating-point code on the x86 architecture using the gcc, Visual C, and Intel C compilers.
  2. Present a comprehensive set of x86 FFT benchmarks.

The Problem

The FFT implementation of choice is FFTW ("the Fastest Fourier Transform in the West"), from Matteo Frigo and Steven G. Johnson at MIT. As well as being general-purpose, portable, well-documented, parallelised, and freely-available, it has also been shown to be faster than existing implementations. However, on the x86 architecture there are performance problems. To quote FFTW's README.hacks file:

Pentium-type processors impose a huge performance penalty if double-precision values are not aligned to 8-byte boundaries in memory. (We have seen factors of 3 or more in tight loops.) Unfortunately, the Intel ABI specifies that variables on the stack need only be aligned to 4-byte boundaries. Even more unfortunately, this convention is followed by Linux/x86 and gcc.

Thus, a given function invocation (stack frame) has a 50-50 chance of having all of its double-precision local variables misaligned at run-time. FFTW supplies a macro to spot this: here's a slightly extended version, which can be called at the end of a list of 64-bit variable declarations to check their alignment.

#define ASSERT_ALIGNED_DOUBLE() {                                       \
   double __foo;                                                        \
   if ((((long) &__foo) & 0x7)) {                                       \
      printf ("Unaligned at line %d in file %s\n", __LINE__, __FILE__); \
      fflush (stdout);                                                  \
      exit(1);                                                          \
   }                                                                    \
}

Compiling the "test_threads" and "time_threads" executables of FFTW with this macro enabled reveals that gcc 2.7.2, Visual C++ 5.0, and the Intel C++ compiler v2.4 all suffer from the 64-bit-misalignment problem, with the following caveats:

  • Although the stock gcc has the misalignment problem, experimental versions such as pgcc provide new compiler flags:
    -mstack-align-double
    This switch tries to align doubles on the stack (i.e. auto variables in C) to an 8 byte boundary.
  • Visual C++ generates code that automatically performs run-time stack alignment if a function contains a "sufficient" number of accesses to double variables. Unfortunately, my experiments indicate that FFTW fails to trigger this heuristic. December 1998: Visual C++ 6.0 has fixed the problem, and therefore doesn't need the hacks described below.
  • The Intel C++ compiler will generate 64-bit-aligned code for FFTW if given the -Qipo (interprocedural optimization) flag, even though this is not documented as an effect of the flag. However, using -Qipo requires access to all of the source code (i.e. no separate compilation), and a lot more memory and time to complete the compile, so it might not always be a feasible solution.

The Hacks

Since the compilers cannot guarantee alignment, the code must adjust stack alignment as necessary at run-time. FFTW includes two gcc macros to do this they are invoked immediately before calling a computatonally expensive function, and ensure that 64-bit local variables defined at the top level of that function are properly aligned. The choice of macro depends on the total size of the arguments being passed to the function.

#define HACK_ALIGN_STACK_EVEN() {                                      \
     if ((((long) (__builtin_alloca(0))) & 0x7)) __builtin_alloca(4);  \
}

#define HACK_ALIGN_STACK_ODD() {                                       \
     if (!(((long) (__builtin_alloca(0))) & 0x7)) __builtin_alloca(4); \
}

These functions are trivially portable to Visual C:

#define HACK_ALIGN_STACK_EVEN() {                    \
     if ((((long) (_alloca(0))) & 0x7)) _alloca(4);  \
}

#define HACK_ALIGN_STACK_ODD() {                     \
     if (!(((long) (_alloca(0))) & 0x7)) _alloca(4); \
}

The Intel compiler is a little harder to fool. Although it claims to be plug-compatible with the Visual C compiler, this is not quite true. In particular, the Intel alloca() function always returns multiples of 8 bytes, which means it can't be used to adjust our stack pointer by 4 bytes. Instead, we have to resort to assembly code (thanks to Raymond Chen for the idea behind this):

#define HACK_ALIGN_STACK_EVEN() \
_asm {                          \
    mov ebx, esp                \
    and esp, 0xfffffff8         \
    sub esp, 4                  \
    push ebx                    \
}

#define HACK_ALIGN_STACK_ODD()  \
_asm {                          \
    mov ebx, esp                \
    and esp, 0xfffffff8         \
    sub esp, 4                  \
    push ebx                    \
}

Note that we leave the stack in the same state irrespective of the total size of the number of arguments we are pushing this was unexpected given that the Intel and Visual C compilers should be "plug compatible", but verified by experiment. Note also that we have to save the old stack pointer on the stack itself, because the Intel compiler will happily tread all over any registers you use in assembly code. We therefore need to call another macro after the function call to restore the old stack pointer:

#define HACK_CLEANUP_STACK()    \
_asm {                          \
    pop ebx                     \
    mov esp, ebx                \
}

Thus, using inline assembly is more complex than using the alloca-based method. It will also be faster to execute. For FFTW this isn't an issue since the vast majority of its execution time is spent within the called functions, rather than in their calling sequences.

Benchmark Summary

I present results from the "time_threads" program provided with FFTW, which provides a quick comparison of the performance of sequential and threaded FFT code. The more comprehensive benchmarks of benchFFT take too long to run for a short experiment of this sort. There are several axes along which to compare the results:
  • Hardware: in this case, a quad-processor PPro-200 versus a dual-processor PII-300
  • Compiler: Visual C++ 5.0 vs Intel C++ 2.4 vs gcc 2.7.2
  • Alignment: with and without the alignment hacks. For one experiment I also reversed the alignment hacks, so the stack is always pessimally misaligned.
  • Precision: 32-bit versus 64-bit arithmetic. All results are for 64-bit unless otherwise stated.
  • Dimensionality: 2D versus 3D FFTs.

The results are in "FFTW MFLOPS" (higher is better), as provided by time_threads, and are shown in full at the end of this page. Here's the executive summary:

  1. In general, performance trends are as we would expect:
    • Small problems that fit in cache run faster (up to twice as fast for 3D FFTs).
    • Adding more processors speeds things up if the problem is large enough (typically 1.7-1.9 times for 2 processors, and 2.3-3.8 times for 4 processors).
    • 32-bit floats are faster than 64-bit floats (typically 1.1-1.2 times faster for large problems).
  2. For the Visual C++ and gcc compilers, alignment hacks are always a win. For example, for a 3D serial FFT of size 128x128x128 running on a PII-300:
    • Visual C++ improves from 44 to 50 MFLOPS
    • gcc improves from 37 to 45 MFLOPS
  3. For the Intel compiler, even though alignment checks show that misalignments are being avoided, the alignment hacks offers very little additional performance gain for single-processor code (53 MFLOPS for the previous example). As noted previously, the -Qipo flag can also be used to generate aligned code.
  4. However, the Intel compiler does comparatively poorly on threaded code unless either alignment hacks or -Qipo are used. For example, the relative speedups of using a second PII-300 processor on a 2D FFT of size 1500x1500:
    • Visual C++ achieves a speedup of about 1.92 on both normal and aligned code.
    • gcc achieves a speedup of 1.97 on normal code and 1.91 on aligned code.
    • Intel C++ achieves a speedup of only 1.44 on normal code, but 1.91 on aligned code.
  5. For 64-bit arithmetic, the Intel compiler typically generates the fastest code, followed by Visual C++, followed by gcc. For 32-bit arithmetic, gcc does much better, especially for small problem sizes.
  6. Flipping the alignment hacks so that stack frames are always misaligned suggests that compilers do indeed have a 50-50 shot at getting things right by themselves. For example, using Visual C++ on a 3D FFT of size 64x64x64:
    • Optimal alignment: 56 MFLOPS
    • Default alignment: 41 MFLOPS
    • Pessimal alignment: 36 MFLOPS

The Results

Dual-processor PII-300, running Windows NT Server 4.0 SP3

Visual C++ 5.0, /O2 /G6

No alignment

2D array size fftw fftw_threads (2) speedup
512x512 46.4468 78.7668 1.69585
1024x1024 44.8196 77.4163 1.72729
1500x1500 43.3575 83.2007 1.91895

3D array size fftw fftw_threads (2) speedup
16x16x16 79.4211 60.8555 0.766238
32x32x32 56.84 82.5112 1.45164
64x64x64 40.9153 69.1853 1.69094
80x80x80 49.1572 83.3896 1.69638
100x100x100 53.2076 81.4856 1.53146
128x128x128 43.8305 76.53 1.74605

Aligned (/DFFTW_ENABLE_I386_HACKS)

2D array size fftw fftw_threads (2) speedup
512x512 54.8358 89.175 1.62622
1024x1024 51.8411 84.669 1.63324
1500x1500 49.9237 96.1163 1.92526

3D array size fftw fftw_threads (2) speedup
16x16x16 103.726 79.7006 0.768379
32x32x32 102.054 131.121 1.28482
64x64x64 56.0358 90.2351 1.61031
80x80x80 55.0976 100.416 1.82251
100x100x100 57.6435 106.59 1.84912
128x128x128 49.5991 85.9075 1.73204

Pessimally aligned (/DFFTW_ENABLE_I386_HACKS)

2D array size fftw fftw_threads (2) speedup
512x512 43.8663 77.1122 1.75789
1024x1024 44.001 73.1384 1.6622
1500x1500 42.7456 82.0431 1.91933

3D array size fftw fftw_threads (2) speedup
16x16x16 52.5601 57.0314 1.08507
32x32x32 45.8269 77.5542 1.69233
64x64x64 36.4027 67.2586 1.84763
80x80x80 46.6312 85.8723 1.84152
100x100x100 43.6408 81.5308 1.86823
128x128x128 39.512 75.006 1.89831

32-bit arithmetic (/DFFTW_ENABLE_FLOATS)

2D array size fftw fftw_threads (2) speedup
512x512 63.9779 99.3764 1.55329
1024x1024 65.1618 104.819 1.60859
1500x1500 61.6628 122.045 1.97924

3D array size fftw fftw_threads (2) speedup
16x16x16 113.641 89.6102 0.78854
32x32x32 115.265 176.516 1.53139
64x64x64 62.046 97.341 1.56885
80x80x80 69.9391 133.142 1.90369
100x100x100 72.61 140.658 1.93717
128x128x128 53.8866 94.0386 1.74512

Intel C++ compiler v 2.4, -O2 -G6 -Qxi

No alignment

2D array size fftw fftw_threads (2) speedup
512x512 56.8583 79.0642 1.39055
1024x1024 54.982 76.7122 1.39523
1500x1500 57.7356 83.2823 1.44248

3D array size fftw fftw_threads (2) speedup
16x16x16 87.7032 72.243 0.823721
32x32x32 98.3866 102.976 1.04665
64x64x64 53.3607 71.6307 1.34239
80x80x80 62.566 91.737 1.46624
100x100x100 65.5439 88.1877 1.34548
128x128x128 52.9488 79.4848 1.50116

No alignment, -Qmem -Qipo (whole-program optimized)

2D array size fftw fftw_threads (2) speedup
512x512 56.8811 93.9206 1.65117
1024x1024 55.4744 92.6221 1.66964
1500x1500 57.4747 109.958 1.91316

3D array size fftw fftw_threads (2) speedup
16x16x16 87.4744 76.0967 0.869932
32x32x32 97.4344 130.181 1.33608
64x64x64 53.859 85.9767 1.59633
80x80x80 63.0622 116.15 1.84183
100x100x100 65.6054 122.153 1.86193
128x128x128 53.0509 91.3189 1.72134

Aligned

2D array size fftw fftw_threads (2) speedup
512x512 56.6581 93.1216 1.64357
1024x1024 54.899 92.1629 1.67877
1500x1500 57.4493 110.188 1.918

3D array size fftw fftw_threads (2) speedup
16x16x16 87.5528 75.7572 0.865274
32x32x32 97.905 129.279 1.32045
64x64x64 53.1379 85.3389 1.60599
80x80x80 60.5689 111.317 1.83786
100x100x100 64.9277 120.689 1.85882
128x128x128 52.6955 90.579 1.71891

Aligned, -Qmem -Qipo

2D array size fftw fftw_threads (2) speedup
512x512 56.8825 93.6472 1.64633
1024x1024 54.8021 92.2492 1.68331
1500x1500 57.3423 108.533 1.89272

3D array size fftw fftw_threads (2) speedup
16x16x16 87.2975 76.0747 0.871441
32x32x32 98.6525 129.719 1.31491
64x64x64 53.8646 86.6275 1.60824
80x80x80 62.5237 115.384 1.84545
100x100x100 64.7789 121.609 1.87729
128x128x128 53.1022 91.0974 1.71551

Pessimally aligned

2D array size fftw fftw_threads (2) speedup
512x512 47.2322 82.0113 1.73634
1024x1024 48.6876 85.087 1.74761
1500x1500 48.7475 93.6218 1.92055

3D array size fftw fftw_threads (2) speedup
16x16x16 65.2711 65.67 1.00611
32x32x32 65.5549 98.8316 1.50762
64x64x64 41.851 69.9433 1.67124
80x80x80 54.2883 98.7294 1.81861
100x100x100 54.5995 97.6763 1.78896
128x128x128 44.9831 79.8495 1.7751

32-bit arithmetic

2D array size fftw fftw_threads (2) speedup
512x512 65.8038 86.9999 1.32211
1024x1024 66.2495 95.2089 1.43713
1500x1500 69.1847 100.411 1.45135

3D array size fftw fftw_threads (2) speedup
16x16x16 109.215 84.458 0.773317
32x32x32 103.825 130.871 1.26049
64x64x64 56.7951 73.3917 1.29222
80x80x80 73.1616 111.353 1.52201
100x100x100 77.0388 104.487 1.35629
128x128x128 55.4671 85.822 1.54726

32-bit arithmetic, -Qmem -Qipo

2D array size fftw fftw_threads (2) speedup
512x512 65.265 104.482 1.60088
1024x1024 66.3592 109.348 1.64782
1500x1500 69.1597 136.019 1.96674

3D array size fftw fftw_threads (2) speedup
16x16x16 109.076 91.5887 0.839675
32x32x32 103.993 164.826 1.58497
64x64x64 56.7376 88.9229 1.56726
80x80x80 72.5244 138.136 1.90469
100x100x100 77.4831 148.308 1.91407
128x128x128 55.7762 97.181 1.74234

Dual-processor PII-300, running RedHat Linux 5.0

gcc 2.7.2, -O6 -fomit-frame-pointer -malign-double

No alignment

2D array size fftw fftw_threads (2) speedup
512x512 40.4177 66.5506 1.64657
1024x1024 40.2611 65.6595 1.63084
1500x1500 38.3901 75.5683 1.96843

3D array size fftw fftw_threads (2) speedup
16x16x16 72.9391 79.3052 1.08728
32x32x32 70.73 108.963 1.54054
64x64x64 41.4734 64.5481 1.55637
80x80x80 44.2654 78.6183 1.77607
100x100x100 46.0124 79.69 1.73193
128x128x128 37.6332 62.1941 1.65264

Aligned (/DFFTW_ENABLE_I386_HACKS)

2D array size fftw fftw_threads (2) speedup
512x512 50.6498 77.4055 1.52825
1024x1024 48.7196 77.0829 1.58217
1500x1500 54.5134 104.143 1.91042

3D array size fftw fftw_threads (2) speedup
16x16x16 96.8322 85.6645 0.884669
32x32x32 93.7424 130.213 1.38905
64x64x64 50.0981 72.4086 1.44534
80x80x80 54.9132 105.489 1.92102
100x100x100 60.7076 113.748 1.8737
128x128x128 45.2486 74.8915 1.6551

32-bit arithmetic (/DFFTW_ENABLE_FLOATS)

2D array size fftw fftw_threads (2) speedup
512x512 70.5227 113.461 1.60885
1024x1024 69.4126 113.994 1.64227
1500x1500 69.2059 135.086 1.95194

3D array size fftw fftw_threads (2) speedup
16x16x16 124.403 100.614 0.80877
32x32x32 131.643 199.338 1.51423
64x64x64 72.154 110.576 1.5325
80x80x80 76.7718 149.369 1.94562
100x100x100 78.1413 152.053 1.94588
128x128x128 57.9987 101.25 1.74574

Quad-processor PPro-200, running Windows NT Server 4.0 SP3

Visual C++ 5.0, /O2 /G6

No alignment

2D array size fftw fftw_threads (2) speedup
512x512 33.3899 58.2365 1.74414
1024x1024 31.5184 59.3668 1.88356
1500x1500 32.495 62.6291 1.92735

2D array size fftw fftw_threads (4) speedup
512x512 33.756 87.7342 2.59907
1024x1024 33.9538 93.122 2.74261
1500x1500 32.7109 114.979 3.51501

3D array size fftw fftw_threads (2) speedup
16x16x16 32.2374 38.8463 1.20501
32x32x32 35.9523 56.4447 1.56999
64x64x64 30.2042 52.0695 1.72392
80x80x80 34.1322 64.4705 1.88884
100x100x100 32.6096 61.8373 1.89629
128x128x128 31.9402 57.163 1.78969

3D array size fftw fftw_threads (4) speedup
16x16x16 32.6752 35.4585 1.08518
32x32x32 36.8107 83.3606 2.26457
64x64x64 30.7961 76.6096 2.48764
80x80x80 34.6859 111.107 3.20322
100x100x100 33.1145 111.569 3.3692
128x128x128 31.9756 86.1691 2.69484

Aligned (/DFFTW_ENABLE_I386_HACKS)

2D array size fftw fftw_threads (2) speedup
512x512 40.57 68.249 1.68225
1024x1024 37.2883 68.1408 1.8274
1500x1500 42.4182 81.026 1.91017

2D array size fftw fftw_threads (4) speedup
512x512 40.7392 93.5986 2.29751
1024x1024 41.018 98.2431 2.39512
1500x1500 42.5523 146.639 3.44608

3D array size fftw fftw_threads (2) speedup
16x16x16 85.1925 62.5818 0.734592
32x32x32 76.9068 90.9157 1.18215
64x64x64 37.6152 64.2844 1.709
80x80x80 42.2172 78.793 1.86637
100x100x100 40.5209 76.3298 1.88372
128x128x128 41.2653 69.5499 1.68543

3D array size fftw fftw_threads (4) speedup
16x16x16 86.3327 42.8769 0.496648
32x32x32 78.7973 113.882 1.44525
64x64x64 38.0106 86.1048 2.26529
80x80x80 42.8227 132.362 3.09092
100x100x100 41.0811 132.569 3.22701
128x128x128 41.053 96.4073 2.34836

Pessimally aligned (/DFFTW_ENABLE_I386_HACKS)

2D array size fftw fftw_threads (2) speedup
512x512 32.2017 55.3866 1.71999
1024x1024 32.7021 56.1799 1.71793
1500x1500 31.5052 60.3094 1.91426

2D array size fftw fftw_threads (4) speedup
512x512 32.3148 85.0938 2.63328
1024x1024 32.5576 89.9746 2.76355
1500x1500 31.4971 113.558 3.60533

3D array size fftw fftw_threads (2) speedup
16x16x16 27.947 35.9526 1.28646
32x32x32 32.2533 51.2054 1.5876
64x64x64 28.0968 49.0044 1.74412
80x80x80 31.5337 58.7588 1.86337
100x100x100 29.8807 57.3176 1.91821
128x128x128 29.6938 53.2519 1.79337

3D array size fftw fftw_threads (4) speedup
16x16x16 27.9497 35.4039 1.2667
32x32x32 31.8826 79.7991 2.50291
64x64x64 28.0653 74.4849 2.65399
80x80x80 30.9942 105.083 3.39041
100x100x100 29.9655 104.792 3.4971
128x128x128 29.5436 83.2133 2.81663

32-bit arithmetic (/DFFTW_ENABLE_FLOATS)

2D array size fftw fftw_threads (2) speedup
512x512 44.939 74.5564 1.65906
1024x1024 47.3524 78.3161 1.6539
1500x1500 50.3582 98.7201 1.96036

2D array size fftw fftw_threads (4) speedup
512x512 44.9217 98.2019 2.18607
1024x1024 47.3702 109.582 2.31332
1500x1500 50.3493 187.88 3.73153

3D array size fftw fftw_threads (2) speedup
16x16x16 94.6326 74.2902 0.785038
32x32x32 91.5342 133.254 1.45578
64x64x64 42.2776 68.8599 1.62876
80x80x80 51.7307 98.5845 1.90573
100x100x100 49.1778 96.7585 1.96752
128x128x128 45.4205 75.9071 1.67121

3D array size fftw fftw_threads (4) speedup
16x16x16 94.6322 46.9839 0.49649
32x32x32 91.5842 179.153 1.95616
64x64x64 42.285 87.9033 2.07883
80x80x80 51.79 179.604 3.46792
100x100x100 49.2607 177.037 3.59388
128x128x128 45.5543 106.719 2.34268

Intel C++ compiler v 2.4, -O2 -G6 -Qxi

No alignment

2D array size fftw fftw_threads (2) speedup
512x512 43.4952 55.333 1.27217
1024x1024 39.0919 57.9617 1.4827
1500x1500 45.747 46.1656 1.00915

2D array size fftw fftw_threads (4) speedup
512x512 42.8586 79.6357 1.8581
1024x1024 43.0969 87.7847 2.03692
1500x1500 45.5663 87.9138 1.92936

3D array size fftw fftw_threads (2) speedup
16x16x16 80.9314 47.6881 0.589242
32x32x32 79.9746 57.094 0.713902
64x64x64 41.4909 46.532 1.1215
80x80x80 45.9014 50.3729 1.09741
100x100x100 45.8433 44.1373 0.962788
128x128x128 45.1381 52.9254 1.17252

3D array size fftw fftw_threads (4) speedup
16x16x16 80.9302 39.0644 0.482693
32x32x32 79.8851 83.6522 1.04716
64x64x64 41.3638 66.9941 1.61963
80x80x80 45.8951 92.6891 2.01958
100x100x100 45.8827 84.8491 1.84926
128x128x128 44.7655 82.0377 1.83261

No alignment, -Qmem -Qipo (whole-program optimized)

2D array size fftw fftw_threads (2) speedup
512x512 41.8402 68.8055 1.64448
1024x1024 37.4827 71.2355 1.90049
1500x1500 45.4093 87.8273 1.93413
2D array size fftw fftw_threads (4) speedup
512x512 42.2349 84.4022 1.9984
1024x1024 37.9486 95.8772 2.5265
1500x1500 46.1121 158.589 3.43921

3D array size fftw fftw_threads (2) speedup
16x16x16 79.8164 60.2316 0.754628
32x32x32 80.1718 90.5531 1.12949
64x64x64 40.5786 64.727 1.5951
80x80x80 46.1979 86.0449 1.86253
100x100x100 45.9611 86.1206 1.87377
128x128x128 45.1941 75.7427 1.67594

3D array size fftw fftw_threads (4) speedup
16x16x16 79.8168 42.5397 0.532967
32x32x32 80.1517 112.246 1.40042
64x64x64 40.2844 77.7492 1.93001
80x80x80 46.0684 138.013 2.99582
100x100x100 45.8895 142.459 3.10439
128x128x128 44.795 99.4168 2.21937

Aligned

2D array size fftw fftw_threads (2) speedup
512x512 42.6217 70.0699 1.644
1024x1024 38.8828 73.3712 1.88698
1500x1500 45.6918 87.1022 1.9063
2D array size fftw fftw_threads (4) speedup
512x512 43.1915 89.4052 2.06997
1024x1024 43.4361 100.65 2.31719
1500x1500 45.7277 153.981 3.36735

3D array size fftw fftw_threads (2) speedup
16x16x16 80.0944 59.4118 0.741772
32x32x32 80.5123 90.6849 1.12635
64x64x64 41.4145 65.7292 1.58711
80x80x80 46.2679 84.493 1.82617
100x100x100 45.4719 85.6731 1.88409
128x128x128 45.0108 75.066 1.66773

3D array size fftw fftw_threads (4) speedup
16x16x16 80.0953 42.8406 0.53487
32x32x32 80.5359 113.824 1.41333
64x64x64 41.3527 81.517 1.97126
80x80x80 46.1736 137.094 2.9691
100x100x100 45.4402 143.205 3.1515
128x128x128 44.5505 100.686 2.26003

Aligned, -Qmem -Qipo

2D array size fftw fftw_threads (2) speedup
512x512 43.358 70.7148 1.63095
1024x1024 38.7767 73.037 1.88353
1500x1500 45.5634 86.9225 1.90772

2D array size fftw fftw_threads (4) speedup
512x512 43.3482 90.0088 2.07641
1024x1024 43.3224 99.5012 2.29676
1500x1500 45.356 155.401 3.42626

3D array size fftw fftw_threads (2) speedup
16x16x16 79.0579 61.1223 0.773134
32x32x32 78.4386 90.6996 1.15631
64x64x64 41.7961 66.8261 1.59886
80x80x80 46.1989 85.4321 1.84922
100x100x100 44.7491 85.6325 1.91361
128x128x128 44.3598 74.5734 1.6811

3D array size fftw fftw_threads (4) speedup
16x16x16 80.1954 42.3156 0.527656
32x32x32 80.6296 113.479 1.40742
64x64x64 42.2715 82.8768 1.96058
80x80x80 46.765 139.305 2.97882
100x100x100 45.2863 143.925 3.17811
128x128x128 44.3641 100.204 2.25867

Pessimally aligned

2D array size fftw fftw_threads (2) speedup
512x512 32.3068 61.2201 1.89496
1024x1024 32.4185 63.1438 1.94777
1500x1500 33.1689 64.6584 1.94937

2D array size fftw fftw_threads (4) speedup
512x512 32.4271 88.5288 2.73009
1024x1024 35.2717 93.0972 2.63943
1500x1500 33.1774 120.657 3.63672

3D array size fftw fftw_threads (2) speedup
16x16x16 41.7387 43.9083 1.05198
32x32x32 38.9408 58.6015 1.50489
64x64x64 33.1686 56.3004 1.6974
80x80x80 36.5222 68.9123 1.88686
100x100x100 32.9632 63.4848 1.92593
128x128x128 33.8606 61.0494 1.80296

3D array size fftw fftw_threads (4) speedup
16x16x16 42.3317 37.8494 0.894116
32x32x32 39.8781 82.108 2.05897
64x64x64 33.585 78.1175 2.32596
80x80x80 36.9998 116.703 3.15416
100x100x100 33.1756 116.491 3.51134
128x128x128 33.9556 87.2379 2.56918

32-bit arithmetic

2D array size fftw fftw_threads (2) speedup
512x512 45.7732 58.3599 1.27498
1024x1024 47.8791 56.1051 1.17181
1500x1500 52.9941 51.4422 0.970715

2D array size fftw fftw_threads (4) speedup
512x512 45.178 80.9349 1.79147
1024x1024 47.71 86.1603 1.80592
1500x1500 53.1794 100.65 1.89265

3D array size fftw fftw_threads (2) speedup
16x16x16 88.9634 51.2476 0.576053
32x32x32 91.7443 69.5582 0.758174
64x64x64 41.4929 47.5948 1.14706
80x80x80 52.7152 60.5488 1.1486
100x100x100 51.5715 51.1475 0.991778
128x128x128 45.0236 54.6663 1.21417

3D array size fftw fftw_threads (4) speedup
16x16x16 90.2187 42.6391 0.47262
32x32x32 93.1208 110.411 1.18567
64x64x64 41.6996 69.3236 1.66245
80x80x80 52.4418 111.654 2.12911
100x100x100 52.1397 100.007 1.91805
128x128x128 45.1681 84.4793 1.87033

32-bit arithmetic, -Qmem -Qipo

2D array size fftw fftw_threads (2) speedup
512x512 45.1246 72.9839 1.61739
1024x1024 47.2389 79.5551 1.6841
1500x1500 52.3838 102.992 1.96611

2D array size fftw fftw_threads (4) speedup
512x512 45.4932 90.9095 1.99831
1024x1024 47.718 105.674 2.21455
1500x1500 52.4809 197.162 3.75684

3D array size fftw fftw_threads (2) speedup
16x16x16 90.5735 71.5021 0.789438
32x32x32 93.3124 131.275 1.40683
64x64x64 43.1575 70.1784 1.6261
80x80x80 53.0547 101.444 1.91207
100x100x100 52.6358 101.885 1.93565
128x128x128 45.6622 77.9548 1.70721

3D array size fftw fftw_threads (4) speedup
16x16x16