Search code examples
pythonperformancefortranf2py

Optimize fortran code that is running slower than the its python version


I have a Fortran code that I'm compiling with f2py to run in python. I wrote this code to be a faster approach to a python code that already existed, but it's actually running slower than its python counterpart, which makes me think that it isn't optimized. (This is probably related to this question although the culprit in that case was something that I'm not using here, so it doesn't apply to me.)

The code gets a 4D matrix as input performs a 2D correlation using the dimensions 3 and 4 (as if they were x and y).

The code is this one:

SUBROUTINE correlate4D(Vars, nxc, nyc, nv, nt, vCorr)
! Correlates 4D array assuming that x, y are dims 3 and 4
! Returns a 4D array with shape nv, nt, 2*nxc+1, 2*nyc+1
IMPLICIT NONE
real(kind=8), intent(in), dimension(:,:,:,:) :: Vars
real(kind=8) :: dummysize
integer, intent(in) :: nxc, nyc
integer :: ii, i, iix, iiy, iv, it, dims(4), nv, nt, nx, ny
integer, dimension(2*nxc+1) :: xdel
integer, dimension(2*nyc+1) :: ydel
real(kind=8), intent(out) :: vCorr(nv, nt, 2*nxc+1, 2*nyc+1)
real(kind=8), dimension(:,:,:,:), allocatable :: rolled, prerolled
real(kind=8), dimension(:,:), allocatable :: Mean
real(kind=8), dimension(:,:,:), allocatable :: Mean3d

dims = shape(Vars)
nx=dims(3)
ny=dims(4)
dummysize=nx*ny
allocate(rolled(nv, nt, nx, ny))
allocate(prerolled(nv, nt, nx, ny))
allocate(Mean3d(nv, nt, nx))
allocate(Mean(nv, nt))

Mean3d = sum(Vars, dim=4)/size(Vars, dim=4)
Mean = sum(Mean3d, dim=3)/size(Mean3d, dim=3)

! These replace np.arange()
ii=1
do i=-nxc,+nxc
    xdel(ii)=i
    ii=ii+1
enddo
ii=1
do i=-nyc,+nyc
    ydel(ii)=i
    ii=ii+1
enddo

! Calculate the correlation
do iiy=1,size(ydel)
    print*,'fortran loop:',iiy,' of', size(ydel)
    ! cshift replaces np.roll()
    prerolled = cshift(Vars, ydel(iiy), dim=4)
    do iix=1,size(xdel)
        rolled = cshift(prerolled, xdel(iix), dim=3)
        forall (it=1:nt)
            forall (iv=1:nv)
                vCorr(iv,it,iix,iiy) = (sum(Vars(iv,it,:,:) * rolled(iv,it,:,:))/dummysize) / (Mean(iv,it)**2)
            endforall
        endforall
    enddo
enddo

END SUBROUTINE

Running this with a matrix of the size (3, 50, 100, 100) takes 251 seconds with this code compiled with f2py and takes only 103 seconds with pure python/numpy code. By the way, this is somewhat the average size of matrix this takes as input should be something like (3, 300, 100, 100), but not much larger than that.

Can anyone point out to me ways in which this code can be optimized?

EDIT

I'm compiling with f2py3.4 -c mwe.f90 -m mwe, and then it can be called with

In [1]: import mwe
In [2]: import numpy as np
In [3]: a=np.random.randn(3,15,100,100)
In [4]: mwe.correlate4d(a, 50, 50, 3, 15)

EDIT2

After reading the comments, I was able to improve it by changing the order of the indices. Now it is about 10% faster than Python, but it's still too slow. I'm sure this can be done faster.

SUBROUTINE correlate4D2(Vars, nxc, nyc, nt, nv, vCorr)
! Correlates 4D array assuming that x, y are dims 1 and 2
! Returns a 4D array with shape 2*nxc+1, 2*nyc+1, nt, nv
IMPLICIT NONE
INTEGER, PARAMETER  ::  dp = SELECTED_REAL_KIND (13)
real(kind=8), intent(in), dimension(:,:,:,:) :: Vars
real(kind=8) :: dummysize
integer, intent(in) :: nxc, nyc
integer :: ii, i, iix, iiy, iv, it, dims(4), nv, nt, nx, ny
integer, dimension(2*nxc+1) :: xdel
integer, dimension(2*nyc+1) :: ydel
!real(kind=8), intent(out) :: vCorr(nv, nt, 2*nxc+1, 2*nyc+1)
real(kind=8), intent(out) :: vCorr(2*nxc+1, 2*nyc+1, nt, nv)
real(kind=8), dimension(:,:,:,:), allocatable :: rolled, prerolled
real(kind=8), dimension(:,:), allocatable :: Mean
real(kind=8), dimension(:,:,:), allocatable :: Mean3d

dims = shape(Vars)
nx=dims(1)
ny=dims(1)
dummysize=nx*ny
allocate(rolled(nx, ny, nt, nv))
allocate(prerolled(nx, ny, nt, nv))
allocate(Mean3d(ny, nt, nv))
allocate(Mean(nt, nv))

Mean3d = sum(Vars, dim=1)/size(Vars, dim=1)
Mean = sum(Mean3d, dim=1)/size(Mean3d, dim=1)

ii=1
do i=-nxc,+nxc
    xdel(ii)=i
    ii=ii+1
enddo
ii=1
do i=-nyc,+nyc
    ydel(ii)=i
    ii=ii+1
enddo

do iiy=1,size(ydel)
    print*,'fortran loop:',iiy,' of', size(ydel)
    prerolled = cshift(Vars, ydel(iiy), dim=2)
    do iix=1,size(xdel)
        rolled = cshift(prerolled, xdel(iix), dim=1)
        forall (iv=1:nv)
            forall (it=1:nt)
                vCorr(iix,iiy,it,iv) = (sum(Vars(:,:,it,iv) * rolled(:,:,it,iv))/dummysize) / (Mean(it,iv)**2)
            endforall
        endforall
    enddo
enddo

END SUBROUTINE

Also, even though there's a dp parameter in the code now (which returns 8, as it should), if I declare variables with real(dp) f2py throws me this error: Parameter 'dp' at (1) has not been declared or is a variable, even though it is declared. So that's why I'm using 8 directly.


Solution

  • Note: A rather long and boring answer follows...

    Because it seems expensive to use cshift() repeatedly for large matrices, I have tried some modifications around cshift. To do so, I first created a minimal version of the OP's code:

    program main
        implicit none
        integer, parameter :: N = 100, nt = 50, dp = kind(0.0d0)
        real(dp), allocatable, dimension(:,:,:) :: A, Ashift_y, Ashift, B
        integer :: sx, sy, i, t
    
        allocate( A( N, N, nt ), Ashift_y( N, N, nt ), Ashift( N, N, nt ), &
                  B( -N:N-1, -N:N-1, nt ) )
        call initA
    
        do sy = -N, N-1
            if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy
    
            Ashift_y = cshift( A, sy, dim=2 )
    
            do sx = -N, N-1
                Ashift = cshift( Ashift_y, sx, dim=1 )
    
                do t = 1, nt
                    B( sx, sy, t )= sum( A( :, :, t ) * Ashift( :, :, t ) )
                enddo
            enddo
        enddo
    
        print *, "sum(B) = ", sum(B)
        print *, "sum( B( 0:N-1, 0:N-1, : ) ) = ", sum( B( 0:N-1, 0:N-1, : ) )
    
    contains
        subroutine initA
            integer ix, iy
            forall( t = 1:nt, iy = 1:N, ix = 1:N )  &   ! (forall not recommended but for brevity)
                    A( ix, iy, t ) = 1.0_dp / ( mod(ix + iy + t, 100) + 1 )
        endsubroutine    
    endprogram
    

    which gives

    sum(B) =    53817771.021093562     
    sum( B( 0:N-1, 0:N-1, : ) ) =    13454442.755258575 
    
    Mac mini (2.3GHz,4-core), gfortran-6.3 -O3 : 50 sec
    Linux (2.6GHz,16-core),   gfortran-4.8 -O3 : 32 sec
    

    Next, because cshift(A,s,dim=1 (or 2)) is periodic with respect to the shift s (with periodicity N), the loop over sx and sy can be split into four parts and only the first quadrant is retained (i.e., sx and sy in [0,N-1]). The data of other quadrants are obtained by simply copying the data of the first quadrant. This gives a reduction of the CPU time by 4. (More simply, we could calculate sx and sy only in [-N/2,N/2] because B for other regions give no new information.)

        do sy = 0, N-1
            if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy
            Ashift_y = cshift( A, sy, dim=2 )
    
            do sx = 0, N-1
                Ashift = cshift( Ashift_y, sx, dim=1 )
    
                do t = 1, nt
                    B( sx, sy, t )= sum( A( :, :, t ) * Ashift( :, :, t ) )
                enddo
            enddo
        enddo
    
        print *, "sum( B( 0:N-1, 0:N-1, : ) ) = ", sum( B( 0:N-1, 0:N-1, : ) )
    
        !! Make "full" B.
        B( -N :  -1,  0 : N-1, : ) = B( 0 : N-1, 0 : N-1, : )
        B(  0 : N-1, -N :  -1, : ) = B( 0 : N-1, 0 : N-1, : )
        B( -N :  -1, -N :  -1, : ) = B( 0 : N-1, 0 : N-1, : )
        print *, "sum(B) = ", sum(B)
    

    The result agrees with the full calculation, as expected:

    sum(B) =    53817771.021093562     
    sum( B( 0:N-1, 0:N-1, : ) ) =    13454442.755258575     
    
    Mac   : 12.8 sec
    Linux :  8.3 sec
    

    The corresponding Python code may look like this:

    from __future__ import print_function, division
    import numpy as np
    
    N, nt = 100, 50
    
    A = np.zeros( (nt, N, N) )
    B = np.zeros( (nt, N, N) )
    
    for t in range(nt):
        for iy in range(N):
            for ix in range(N):
                A[ t, iy, ix ] = 1.0 / ( (ix + iy + t) % 100 + 1 )
    
    for sy in range( N ):
        if sy % (N // 10) == 0 : print( "sy = ", sy )
        Ashift_y = np.roll( A, -sy, axis=1 )
    
        for sx in range( N ):
            Ashift = np.roll( Ashift_y, -sx, axis=2 )
    
            for t in range( nt ):
                B[ t, sy, sx ] = np.sum( A[ t, :, : ] * Ashift[ t, :, : ] )
    
    print( "sum( B( :, 0:N-1, 0:N-1 ) ) = ",  np.sum( B ) )
    

    which runs with 22--24 sec on both Mac and Linux (python3.5).

    To further reduce the cost, we utilize the fact that cshift may be used in two equivalent ways:

    cshift( array, s ) == array( cshift( [1,2,...,n], s ) )   !! assuming that "array" is declared as a( n )
    

    Then we can rewrite the above code such that cshift() receives only ind = [1,2,...,N]:

        integer, dimension(N) :: ind, indx, indy
        ind = [( i, i=1,N )]
    
        do sy = 0, N-1
            if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy
            indy = cshift( ind, sy )
    
            do sx = 0, N-1
                indx = cshift( ind, sx )
    
                do t = 1, nt
                    B( sx, sy, t )= sum( A( :, :, t ) * A( indx, indy, t ) )
                enddo
            enddo
        enddo
    

    which runs in ~5 sec on both Mac and Linux. Similar methods may be applicable to Python as well. (I also tried using mod() explicitly for indices to eliminate cshift completely, but a bit surprisingly, it was slower by more than two times than the above code...)

    Even with this reduction, the code becomes slow for large nt (say 300 as shown in the question). In that case, we may resort to the final weapon such that the loop over sy is parallelized:

    program main
        implicit none
        integer, parameter :: N = 100, nt = 50, dp = kind(0.0d0)
    !    integer, parameter :: N = 100, nt = 300, dp = kind(0.0d0)
        real(dp), allocatable, dimension(:,:,:) :: A, B
        integer, dimension(N) :: ind, indx, indy
        integer :: sx, sy, i, t
    
        allocate( A( N, N, nt ), B( -N:N-1, -N:N-1, nt ) )
        call initA
        ind = [( i, i=1,N )]
    
        !$omp parallel do private(sx,sy,t,indx,indy)
        do sy = 0, N-1
            if ( mod( sy, N/10 ) == 0 ) print *, "sy = ", sy
            indy = cshift( ind, sy )
    
            do sx = 0, N-1
                indx = cshift( ind, sx )
    
                do t = 1, nt
                    B( sx, sy, t )= sum( A( :, :, t ) * A( indx, indy, t ) )
                enddo
            enddo
        enddo
        !$omp end parallel do
    
        print *, "sum( B( 0:N-1, 0:N-1, : ) ) = ", sum( B( 0:N-1, 0:N-1, : ) )
    
        ! "contains subroutine initA ..." here
    
    endprogram
    

    The timing data are like this (with gfortran -O3 -fopenmp):

    N = 100, nt = 50
    sum( B( 0:N-1, 0:N-1, : ) ) =    13454442.755258575     
    Mac:
       serial : 5.3 sec
    2 threads : 2.6 sec
    4 threads : 1.4 sec
    
    N = 100, nt = 50
    sum( B( 0:N-1, 0:N-1, : ) ) =    13454442.755258575     
    Linux:
        serial : 4.8 sec
     2 threads : 2.7 sec
     4 threads : 1.3 sec
     8 threads : 0.7 sec
    16 threads : 0.4 sec
    32 threads : 0.4 sec
    
    N = 100, nt = 300   // heavy case
    sum( B( 0:N-1, 0:N-1, : ) ) =    80726656.531429410     
    Linux:
     2 threads: 16.5 sec
     4 threads:  8.4 sec
     8 threads:  4.4 sec
    16 threads:  2.5 sec
    

    So, if the above code does not have a bug (hopefully!), we can save much CPU time by (1) limiting sx and sy to [0,N-1] (or more simply [-N/2,N/2] without further copy), (2) applying cshift to an index array (rather than data arrays), and/or (3) parallelization over sy (which may be combined with f2py hopefully...)