c multithreading fft performance-testing fftw

FFTW 1 thread is always better than many threads

I'm doing some tests with FFTW using threads, the time of 1d transform(forward and backward) of large double complex values is always better for 1 thread than 2-3 or 4 threads. Can someone help me to solve this problem ? Thank you!!

An output for 1 thread is:

time           N
0.001515        16384   
0.003364        32768   
0.002625        65536   
0.006060        131072  
0.016190        262144  
0.042389        524288  
0.091719        1048576         
0.209468        2097152         
0.523317        4194304        
1.196903        8388608

while for 4 threads (results is similar for 2 or 3 threads...) :

time            N
0.002071        16384   
0.004009        32768  
0.007989        65536   
0.008715        131072  
0.020615        262144  
0.055483        524288  
0.159392        1048576         
0.322355        2097152         
0.761479        4194304         
1.647288        8388608

I test my code on two different machine with same results. machine 1 :

Ubuntu 10.04.1 LTS
2.6.32-24-generic  x86_64 GNU/Linux
gcc version 4.4.3 
Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
ram 4gb

machine 2 :

Ubuntu 10.04.1 LTS
2.6.32-21-server  x86_64 GNU/Linux
gcc version 4.4.3 
Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz
ram 8gb

I have code that generate random complex values and make forward and backward transform and take time of this two operation without take into account the calls for plan or memory allocations.

FFTW is configurated as :

./configure --prefix=/home/.... --enable-threads

I try with -sse2 option too but with same results, 1 thread is always better.

I compile with :

gcc 1DFFTW.c -o 1DFFTW -I/$HOME/opt/fftw-3.3.2/include -L/$HOME/opt/fftw-3.3.2/lib -lrt -lfftw3_threads -lfftw3 -lpthread -lm

The important part of code is :

  if(nThreads>1){
     int err=fftw_init_threads();
     if (err==0) 
        printf("thread creation error : %d\n",err);
     else 
        fftw_plan_with_nthreads(nThreads);
  }
  int i;
  fftw_complex *in;
  fftw_complex *in2;

  fftw_complex *out;

  fftw_plan plan_backward;
  fftw_plan plan_forward;

  struct timespec start, stop;
  printf ( "\n" );
  printf ( "N= %d \n",n);

  in = fftw_malloc ( sizeof ( fftw_complex ) * n );

  srand ( time(NULL) );

  for ( i = 0; i < n; i++ )
  {
    in[i][0] = rand() / (double)RAND_MAX;
    in[i][1] = rand() / (double)RAND_MAX;
  }


  out = fftw_malloc ( sizeof ( fftw_complex ) * n );

  in2 = fftw_malloc ( sizeof ( fftw_complex ) * n );


  plan_forward = fftw_plan_dft_1d ( n, in, out, FFTW_FORWARD, FFTW_ESTIMATE );


  plan_backward = fftw_plan_dft_1d ( n, out, in2, FFTW_BACKWARD, FFTW_ESTIMATE );

  clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&start);

  fftw_execute ( plan_forward );


  fftw_execute ( plan_backward );

  clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&stop);

Solution

The multithreaded algorithm has overhead associated with distributing the task among multiple CPUs and consolidating the results of the individual sub-problem. You are measuring the CPU time, not the wall clock time.

If you want to minimize CPU time, use one thread. That way, there's no threading overhead. If you want to minimize wall time, use more threads.