Here is multi threaded code to do the 2 dimensional fast Fourier discrete transform, using more than one processor.

sz is the size of a side of the array; it must be a power of 2.
The code can employ cpus processors. If there are fewer processors then things still work and the performance gain is nearly as much as if cpus had been an accurate cpu count.

a is the 2D array.
wave is a switch: wave=0 for spike input; wave=1 for plane wave.
cexp (the library complex exponential) is used in main to intialize the grid to a plane wave.
fft is the routine introduced here but for floats instead of double values.
The routine psam just prints a small select rectangle from the mesh.
d2Dft causes the transform of the values in a to replace those values.
fp(th, m) finds and reports the values in a with the greatest magnitude.

The first transform leaves one peak which reflects the mono frequency initial wave. The peak is spread a bit because the wave was not periodic; it is discontinuous at the boundary. After two transforms we have the mesh reversed about the center. After four transforms we are back to the initial values.

I compile on the Mac with clang ft.c fft.c -Wall -O3 and run with ./a.out.
On some Linux boxes “gcc ft.c -lm -Wall fft.c -pthread -std=c99” works.

Nascent Thread Nexus

Demo of __sync_fetch_and_add; thread stuff