how to reduce data movement to improve performance

Im trying to improve performance in a Fortran code using openACC, however the compiler shows that there are many transfers between the device and the host

I tried using data regions to reduce memory movement

This is the part of the code, the complete subroutine is for solving equations using LU

!EXCHANGING AND ELIMINATING COLUMNS

!$acc data copy(a(:,k:n))create(a(:,k:n))   
!$acc kernels

                                DO J = K + 1, N
                                        TQ = A(M, J)
                                        A(M, J) = A(K, J)
                                        A(K, J) = TQ

                                        IF (DABS(TQ) .GT. 0) THEN
                                        DO I = K + 1,N
                                                A(I,J)=A(I,J)+A(I,K)*TQ
                                        ENDDO
                                        END IF
                                ENDDO
!$acc end kernels
!$acc end data

I expected to reduce the computation time, the code works without the data region but very slow, and when using a data region, the program stops due to an impossible factoring

Solution

OpenACC compute regions, i.e. "kernels" or "parallel", include an implicit data region. When not within the same scope as a structured data region and without any data clauses on the compute region, the compiler will implicitly copy the data for you. Adding the data region will override this implicit data and give you more control on when the data is transferred.

The error you're seeing with the data region is most likely due to putting "a" in both a "copy" and "create" clause. Since variables can only appear once, the right most clause (i.e. create) will be used and hence your data isn't getting copied to or from the device. To fix, remove the create clause. (Note that copy will do a create and copy)

However, since you have data region directly around the compute region, your performance will remain unchanged. To help with this, I'd advise moving the data region to a spot much earlier in the code, such as directly after "a" is allocated or initialized. Then add a "present(a)" clause to the "kernels" directive to enforce that the data is present on the device.

To ensure data is synchronized between the host and device, use "update" directives before and after the compute region. The next step is to start moving the "update" directives outwards while offloading more compute to the device. Ideally, you would offload all code that works on the "A" array so that the data is copied to the device once at the start of the program and back to the host once once you need to print the results.