fortrangfortranintel-fortranmicro-optimization# Do most compilers optimize MATMUL(TRANSPOSE(A),B)?

### matrix times vector

### matrix times matrix

In a Fortran program, I need to compute several expressions like **M** · * v*,

I was wondering if writing `MATMUL(TRANSPOSE(M),v)`

would unfold at compile time into some code as efficient as `MATMUL(N,v)`

, where `N`

is explicitly stored as `N=TRANSPOSE(M)`

. I am specifically interested in the gnu and ifort compilers with "strong" optimization flags (-O2, -O3 or -Ofast for instance).

Solution

Below you find a couple of execution times of various methods.

system:

- Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz
- cache size : 6144 KB
- RAM : 16MB
- GNU Fortran (GCC) 6.3.1 20170216 (Red Hat 6.3.1-3)
- ifort (IFORT) 18.0.5 20180823
- BLAS : for gnu compiler, the used blas is the default version

compilation:`[gnu] $ gfortran -O3 x.f90 -lblas [intel] $ ifort -O3 -mkl x.f90`

execution:`[gnu] $ ./a.out > matmul.gnu.txt [intel] $ EXPORT MKL_NUM_THREADS=1; ./a.out > matmul.intel.txt`

In order, to make the results as neutral as possible, I've rescaled the answers with the average time of an equivalent set of operations done. I ignored threading.

Six different implementations were compared:

**manual:**`do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do`

**matmul:**`matmul(P,v)`

**blas N:**`dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)`

**matmul-transpose:**`matmul(transpose(P),v)`

**matmul-transpose-tmp:**`Q=transpose(P); w=matmul(Q,v)`

**blas T:**`dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)`

In Figure 1 and Figure 2 you can compare the timing results for the above cases. Overall we can say that the usage of a temporary is in both `gfortran`

and `ifort`

not advised. Both compilers can optimize `MATMUL(TRANSPOSE(P),v)`

much better. While in `gfortran`

, the implementation of `MATMUL`

is faster than default BLAS, `ifort`

clearly shows that `mkl-blas`

is faster.

**figure 1:** Matrix-vector multiplication. Comparison of various implementations ran on `gfortran`

. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.

**figure 2:** Matrix-vector multiplication. Comparison of various implementations ran on a single-threaded `ifort`

compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.

Six different implementations were compared:

**manual:**`do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do`

**matmul:**`matmul(P,P)`

**blas N:**`dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)`

**matmul-transpose:**`matmul(transpose(P),P)`

**matmul-transpose-tmp:**`Q=transpose(P); matmul(Q,P)`

**blas T:**`dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)`

In Figure 3 and Figure 4 you can compare the timing results for the above cases. In contrast to the vector-case, the usage of a temporary is only advised for gfortran. While in `gfortran`

, the implementation of `MATMUL`

is faster than default BLAS, `ifort`

clearly shows that `mkl-blas`

is faster. Remarkably, the manual implementation is comparable to `mkl-blas`

.

**figure 3:** Matrix-matrix multiplication. Comparison of various implementations ran on `gfortran`

. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.

**figure 4:** Matrix-matrix multiplication. Comparison of various implementations ran on a single-threaded `ifort`

compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.

The used code:

```
program matmul_test
implicit none
double precision, dimension(:,:), allocatable :: P,Q,R
double precision, dimension(:), allocatable :: v,w
integer :: n,i,j,k,l
double precision,dimension(12) :: t1,t2
do n = 1,1000
allocate(P(n,n),Q(n,n), R(n,n), v(n),w(n))
call random_number(P)
call random_number(v)
i=0
i=i+1
call cpu_time(t1(i))
do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
w=matmul(P,v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
w=matmul(transpose(P),v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=transpose(P)
w=matmul(Q,v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=matmul(P,P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=matmul(transpose(P),P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=transpose(P)
R=matmul(Q,P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
call cpu_time(t2(i))
write(*,'(I6,12D25.17)') n, t2-t1
deallocate(P,Q,R,v,w)
end do
end program matmul_test
```

- Why am I getting this error? <class 'TypeError'>: wrong type
- GNU Fortran - Function 'dcosd' has no IMPLICIT type
- f2py does not properly import, despite successfully compiling
- Error in Python trying to create a list of a certain length after a call to Fortran to get the length
- How to get command line arguments of unknown length in Fortran?
- implicit real - complex conversion in fortran
- Optimal way to create once, then frequently access to a large array in Fortran
- How do you iterate through an array in Fortran?
- Why Intel Fortran + Intel MPI report warn(error) when using MPI_Bcast?
- Convert c_int to default kind integer
- Program in Fortran print different results with each execution
- How does work the OpenMP "nonmonotonic:dynamic" schedule?
- GFortran error: ld: library not found for -lSystem when trying to compile
- Write unformatted (binary data) to stdout
- How to call Fortran's FINDLOC within numpy's f2py
- QR factorization in Fortran
- Distribution of for loop iterations over triangular matrix with MPI
- Why MPI_REDUCE shows different number at some array locations?
- Conditional compilation in gfortran
- Fortran with Sparse BLAS not flushing memory
- Standard conforming way to get command line arguments in FORTRAN77
- Fortran C interface for program containing non-c-interoperable derived type
- Processing a shared array by a passed index in a subroutine in a parallel loop
- Gfortran type mismatch error despite "-fallow-argument-mismatch" flag
- Removing whitespace in string
- Pointer to OpenBLAS subroutines in fortran
- Valgrind complains reading from a file
- Use FP exception traps (-ffpe-trap/-fpe0) for code linked against SIGFPE-unsafe library (libxml2)
- Reading missing data from a file
- Do most compilers optimize MATMUL(TRANSPOSE(A),B)?