Search code examples
c++multithreadingopenmpfftw

How to enable OpenMP with Multi-threaded FFTW in C/C++?


I was struggling with slow implemetnation of my FFTW in C/C++ when I read in their documentation that you should create all fftw plans once and execute many times, which I was able to implement correctly. Now, I am trying to use built-in openmp functionality with Parallel FFTW. I followed with their documentation regarding fftw thread-saftey:

The only thread-safe routine in FFTW is fftw_execute.
All other routines (e.g. the planner) should only be called from one thread at a time.
So, for example, you can wrap a semaphore lock around any calls to the planner.

And that's what I did using std::lock_guard<std::mutex> lock(... However, the issue is now I am not sure how or where to call the following:

Third, before creating a plan that you want to parallelize, you should call:

void fftw_plan_with_nthreads(int nthreads);

Whatever example code I have right now does not seem to work or be ''parallelized''!

Code Example:

#include"omp.h"  
#include <mutex>
#include <thread>
#include "fftw3.h"

static const int nx = 128;
static const int ny = 128;
static const int ncomp = 2;
static const int nyk = ny/2 + 1;

class MyFftwClass {
public:
    MyFftwClass(void);
    ~MyFftwClass(void);
    void execute(double rArr[], double cArr[][ncomp]);

private:
    static fftw_plan s_plan; // <-- shared by all instances
    double *m_buffer_in;
    fftw_complex *m_buffer_out;
};


class MyFftwClass1 {
public:
    MyFftwClass1(void);
    ~MyFftwClass1(void);
    void execute1(double cArr[][ncomp],double rArr[]);

private:
    static fftw_plan s_plan1; // <-- shared by all instances
    fftw_complex *m_buffer_in1;
    double *m_buffer_out1;
};

int main(){ 
    
fftw_init_threads(); //before calling any FFTW routines, you should call the function
//This function, which should only be called once (probably in your main() function), performs any one-time initialization required to use threads on your system.
int nThreads =1;//4;
omp_set_num_threads(nThreads);

double *Function;
Function= (double*) fftw_malloc(nx*ny*sizeof(double)); 
    for (int i = 0; i < nx; i++){
        for (int j = 0; j < ny; j++){
          Function[j + ny*i]  = //some initialization; 
            
        }
    }

fftw_complex *Functionk;
Functionk= (fftw_complex*) fftw_malloc(nx*nyk*sizeof(fftw_complex)); 
memset(Functionk, 42, nx*nyk* sizeof(fftw_complex));

MyFftwClass r2c1; //declare r2c1 of type MyFftwClass
r2c1.execute(Function,Functionk);

double *Out;
Out = (double*) fftw_malloc(nx*ny*sizeof(double)); 
memset(Out, 42, nx*ny* sizeof(double));

MyFftwClass1 c2r1; //declare r2c1 of type MyFftwClass
c2r1.execute1(Functionk,Out);
//fftw_free stuff
}
std::mutex g_my_fftw_mutex; // <-- must lock before using any fftw_*() function, except for fftw_execute*()
MyFftwClass1::MyFftwClass1(void)
{
    // serialize the initialization!
    std::lock_guard<std::mutex> lock(g_my_fftw_mutex);

    // allocate separate buffers for each instance
    m_buffer_in1 = fftw_alloc_complex(nx * nyk);
    m_buffer_out1 = fftw_alloc_real(nx * ny); 
    if (!(m_buffer_in1 && m_buffer_out1))
    {
        throw new std::runtime_error("Failed to allocate memory!");
    }

    // create plan *once* for all instances/threads
    if (!s_plan1)
    {
        s_plan1 = fftw_plan_dft_c2r_2d(nx, ny, m_buffer_in1, m_buffer_out1, FFTW_PATIENT);
        if (!s_plan1)
        {
            throw new std::runtime_error("Failed to create plan!");
        }
    }

}
MyFftwClass1::~MyFftwClass1(void)
{
    std::lock_guard<std::mutex> lock(g_my_fftw_mutex);
    fftw_destroy_plan(s_plan1);
    fftw_free(m_buffer_in1);
    fftw_free(m_buffer_out1);
}

void MyFftwClass1::execute1(double cArr[][ncomp],double rArr[]) //void MyFftwClass::execute(double *const in, fftw_complex *const out)
{
    // No serialization is needed here
    memcpy(m_buffer_in1,cArr, sizeof(fftw_complex) * nx*(nyk));
    fftw_execute_dft_c2r(s_plan1, m_buffer_in1, m_buffer_out1); //instead of fftw_excute(plan)
    memcpy(rArr, m_buffer_out1,sizeof(double) * nx*ny);
    //(rArr, 1.0 / (nx*ny), rArr); //renormalize 
}

fftw_plan MyFftwClass1::s_plan1 = NULL;

std::mutex g_my_fftw_mutex; // <-- must lock before using any fftw_*() function, except for fftw_execute*()
MyFftwClass::MyFftwClass(void)
//int r2cfft_initialize(r2cfft_t *const ctx, const std::size_t nx, const std::size_t ny)
{
    // serialize the initialization!
    std::lock_guard<std::mutex> lock(g_my_fftw_mutex);

    //allocating buffers first before plan creating gets rid off seg fault error

    // allocate separate buffers for each instance
    m_buffer_in = fftw_alloc_real(nx * ny); 
    m_buffer_out = fftw_alloc_complex(nx * nyk);
    if (!(m_buffer_in && m_buffer_out))
    {
        throw new std::runtime_error("Failed to allocate memory!");
    }

    // create plan *once* for all instances/threads
   if (!s_plan)
    {
        s_plan = fftw_plan_dft_r2c_2d(nx, ny, m_buffer_in, m_buffer_out, FFTW_PATIENT);
        if (!s_plan)
        {
            throw new std::runtime_error("Failed to create plan!");
        }
    }

    
}

MyFftwClass::~MyFftwClass(void)
{
    std::lock_guard<std::mutex> lock(g_my_fftw_mutex);
    fftw_free(m_buffer_in);
    fftw_free(m_buffer_out);
    fftw_destroy_plan(s_plan);
}

void MyFftwClass::execute(double rArr[], double cArr[][ncomp]) //void MyFftwClass::execute(double *const in, fftw_complex *const out)
{
    // No serialization is needed here
    memcpy(m_buffer_in, rArr,  sizeof(double) * nx*ny);
    fftw_execute_dft_r2c(s_plan, m_buffer_in, m_buffer_out); //instead of fftw_excute(plan)
    memcpy(cArr, m_buffer_out, sizeof(fftw_complex) * nx*(nyk));
}

fftw_plan MyFftwClass::s_plan = NULL;

Does this look thread-safe and/or parallelized? Thanks!


Solution

  • Two things:

    1. You plan with FFTW_PATIENT. That is very expensive, doubly so with threads. You need to either reuse plans often enough doing many, many transformations; or you need to save the planned wisdom so that the cost is only incurred once

    2. Your FFT is rather small. It may not be possible to parallelize it

    Minimal fix

    Here is a minimally modified version. There is plenty more issues that I consider outdated practise, or which could be done better. But Let's focus on this first:

    int main(){
    
        fftw_init_threads();
        fftw_import_system_wisdom();
        fftw_import_wisdom_from_filename("/var/tmp/my_application_fftw.wisdom");
        int nThreads =1;//4;
        omp_set_num_threads(nThreads);
        fftw_plan_with_nthreads(nThreads);
    
    
        double *Function;
        Function= (double*) fftw_malloc(nx*ny*sizeof(double));
        for (int i = 0; i < nx; i++){
            for (int j = 0; j < ny; j++){
                Function[j + ny*i]  = 1.;//some initialization;
            }
        }
    
        fftw_complex *Functionk;
        Functionk= (fftw_complex*) fftw_malloc(nx*nyk*sizeof(fftw_complex));
        memset(Functionk, 42, nx*nyk* sizeof(fftw_complex));
    
        double *Out;
        Out = (double*) fftw_malloc(nx*ny*sizeof(double));
        memset(Out, 42, nx*ny* sizeof(double));
        MyFftwClass r2c1; //declare r2c1 of type MyFftwClass
        MyFftwClass1 c2r1; //declare r2c1 of type MyFftwClass
        for(int i = 0; i < 100000; ++i) {
            r2c1.execute(Function,Functionk);
            c2r1.execute1(Functionk,Out);
        }
        fftw_export_wisdom_to_filename("/var/tmp/my_application_fftw.wisdom");
        //fftw_free stuff
    }
    

    Compile with g++ -O3 -lfftw -lfftw_omp -fopenmp (or -lfftw_threads if fftw_omp is not available)

    Notice how I'm reusing the same two plans 100,000 times. You actually need to make use of these plans for a considerable number of runs. And we cache the planning result across application calls. The final application needs to put more thought into where you want to put that file. For example on a cluster it should be in your home directory, not some compute node's temporary directory.

    The whole mutex stuff is not relevant to the test case you wrote. It is only required if you create plans from within multiple threads in parallel. It is possible that you do this in other parts of your code, but not here. It will not harm the test case, but it is not necessary either.

    Benchmark results

    If I run this with the 128x128 sized FFT and 100,000 iterations, 16 threads take 19 seconds on the first run (with planning), 7 seconds on all future runs. One thread takes 5.8 seconds initially, 4.6 seconds with cached wisdom.

    If I run this with a 1024x1024 sized FFT, 16 thread, 1000 iterations, I get 3 minutes runtime the first time, then 5 seconds the second time; using all threads. With only one thread, the first run takes 34 seconds, all others 9 seconds.

    In conclusion

    Your FFT size is probably too small to make efficient use of multithreading. You will get better results by using batches of independent, small FFTs using fftw_plan_many_dft_r2c or by calling many independent single-threaded FFTs from an outer pragma omp parallel section