Search code examples
pythonnumpyparallel-processingfft

How to do 100000 times 2d fft in a faster way using python?


I have a 3d numpy array with a shape of (100000, 256, 256), and I'd like to do FFT on every stack of the 2d array, which means 100000 times of FFT.

I have tested the speed of single and the stacked data with minimum code below.

import numpy as np
a = np.random.random((256, 256))
b = np.random.random((10, 256, 256))

%timeit np.fft.fft2(a)

%timeit np.fft.fftn(b, axes=(1, 2,))

Which gives the following:

872 µs ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

6.46 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

100000 times of fft will take more than one minite.

Is there any faster way to do multiple fft or ifft at the same time?

Update: After a bit search, I found cupy, which seems can help.


Solution

  • pyfftw, wrapping the FFTW library, is likely faster than the FFTPACK library wrapped by np.fft and scipy.fftpack. After all, FFTW stands for Fastest Fourier Transform in the West.

    The minimal code is:

    import numpy as np
    import pyfftw
    import multiprocessing
    b = np.random.random((100, 256, 256))
    bb = pyfftw.empty_aligned((100,256, 256), dtype='float64')
    bf= pyfftw.empty_aligned((100,256, 129), dtype='complex128')
    fft_object_b = pyfftw.FFTW(bb, bf,axes=(1,2),flags=('FFTW_MEASURE',), direction='FFTW_FORWARD',threads=multiprocessing.cpu_count())
    bb=b
    fft_object_b(bb)
    

    Here is an extended code timing the execution of np.fft and pyfftw:

    import numpy as np
    from timeit import default_timer as timer
    import multiprocessing
    a = np.random.random((256, 256))
    b = np.random.random((100, 256, 256))
    
    start = timer()
    for i in range(10):
        np.fft.fft2(a)
    end = timer()
    print"np.fft.fft2, 1 slice", (end - start)/10
    
    start = timer()
    for i in range(10):
         bf=np.fft.fftn(b, axes=(1, 2,))
    end = timer()
    print "np.fft.fftn, 100 slices", (end - start)/10
    print "bf[3,42,42]",bf[3,42,42]
    
    
    import pyfftw
    
    aa = pyfftw.empty_aligned((256, 256), dtype='float64')
    af= pyfftw.empty_aligned((256, 129), dtype='complex128')
    bb = pyfftw.empty_aligned((100,256, 256), dtype='float64')
    bf= pyfftw.empty_aligned((100,256, 129), dtype='complex128')
    print 'number of threads:' , multiprocessing.cpu_count()
    
    fft_object_a = pyfftw.FFTW(aa, af,axes=(0,1), flags=('FFTW_MEASURE',), direction='FFTW_FORWARD',threads=multiprocessing.cpu_count())
    
    fft_object_b = pyfftw.FFTW(bb, bf,axes=(1,2),flags=('FFTW_MEASURE',), direction='FFTW_FORWARD',threads=multiprocessing.cpu_count())
    
    
    aa=a
    bb=b
    start = timer()
    for i in range(10):
        fft_object_a(aa)
    end = timer()
    print "pyfftw, 1 slice",(end - start)/10
    
    start = timer()
    for i in range(10):
        fft_object_b(bb)
    end = timer()
    print "pyfftw, 100 slices", (end - start)/10
    print "bf[3,42,42]",bf[3,42,42]
    

    Finally, the outcome is a significant speed up: pyfftw proves 10 times faster than np.fft on my computer., using 2 threads.

    np.fft.fft2, 1 slice 0.00459032058716
    np.fft.fftn, 100 slices 0.478203487396
    bf[3,42,42] (-38.190256258791734+43.03902512127183j)
    number of threads: 2
    pyfftw, 1 slice 0.000421094894409
    pyfftw, 100 slices 0.0439268112183
    bf[3,42,42] (-38.19025625879178+43.03902512127183j)
    

    Your computer seems much better than mine!