I'm doing some tests with FFTW using threads, the time of 1d transform(forward and backward) of large double complex values is always better for 1 thread than 2-3 or 4 threads. Can someone help me to solve this problem ? Thank you!!
An output for 1 thread is:
time N
0.001515 16384
0.003364 32768
0.002625 65536
0.006060 131072
0.016190 262144
0.042389 524288
0.091719 1048576
0.209468 2097152
0.523317 4194304
1.196903 8388608
while for 4 threads (results is similar for 2 or 3 threads...) :
time N
0.002071 16384
0.004009 32768
0.007989 65536
0.008715 131072
0.020615 262144
0.055483 524288
0.159392 1048576
0.322355 2097152
0.761479 4194304
1.647288 8388608
I test my code on two different machine with same results. machine 1 :
Ubuntu 10.04.1 LTS
2.6.32-24-generic x86_64 GNU/Linux
gcc version 4.4.3
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz
ram 4gb
machine 2 :
Ubuntu 10.04.1 LTS
2.6.32-21-server x86_64 GNU/Linux
gcc version 4.4.3
Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
ram 8gb
I have code that generate random complex values and make forward and backward transform and take time of this two operation without take into account the calls for plan or memory allocations.
FFTW is configurated as :
./configure --prefix=/home/.... --enable-threads
I try with -sse2
option too but with same results, 1 thread is always better.
I compile with :
gcc 1DFFTW.c -o 1DFFTW -I/$HOME/opt/fftw-3.3.2/include -L/$HOME/opt/fftw-3.3.2/lib -lrt -lfftw3_threads -lfftw3 -lpthread -lm
The important part of code is :
if(nThreads>1){
int err=fftw_init_threads();
if (err==0)
printf("thread creation error : %d\n",err);
else
fftw_plan_with_nthreads(nThreads);
}
int i;
fftw_complex *in;
fftw_complex *in2;
fftw_complex *out;
fftw_plan plan_backward;
fftw_plan plan_forward;
struct timespec start, stop;
printf ( "\n" );
printf ( "N= %d \n",n);
in = fftw_malloc ( sizeof ( fftw_complex ) * n );
srand ( time(NULL) );
for ( i = 0; i < n; i++ )
{
in[i][0] = rand() / (double)RAND_MAX;
in[i][1] = rand() / (double)RAND_MAX;
}
out = fftw_malloc ( sizeof ( fftw_complex ) * n );
in2 = fftw_malloc ( sizeof ( fftw_complex ) * n );
plan_forward = fftw_plan_dft_1d ( n, in, out, FFTW_FORWARD, FFTW_ESTIMATE );
plan_backward = fftw_plan_dft_1d ( n, out, in2, FFTW_BACKWARD, FFTW_ESTIMATE );
clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&start);
fftw_execute ( plan_forward );
fftw_execute ( plan_backward );
clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&stop);
The multithreaded algorithm has overhead associated with distributing the task among multiple CPUs and consolidating the results of the individual sub-problem. You are measuring the CPU time, not the wall clock time.
If you want to minimize CPU time, use one thread. That way, there's no threading overhead. If you want to minimize wall time, use more threads.