I am trying to optimize the array multiplication of the following type in fortran:
do i = 1, n
do j = 1, n
do k = 1, n
do i1 = 1, n
do j1 = 1, n
do k1 = 1, n
B(k1,j1,i1) = B(k1,j1,i1) + &
A(k,j,i)*v1(k,k1)*v2(j,j1)*v3(i,i1)
enddo
enddo
enddo
enddo
enddo
enddo
Here A and B are of size (n,n,n) while v1, v2, and v3 are (n,n). It seems that there ought to be vectorization of some sort that would allow me to speed up this evaluation. I tried different ordering of the indices. Is there a faster way of calculating B?
EDIT:
Alright, one can use matmul to obtain about 10x speed up:
do i = 1, n
do i1 = 1, n
do j = 1, n
do j1 = 1, n
B(:,j1,i1) = B(:,j1,i1) + &
matmul(A(:,j,i),v1)*v2(j,j1)*v3(i,i1)
enddo
enddo
enddo
enddo
Can we do even better?
You can at the very least reduce two of the loop pairs to matrix multiplication, using either matmul
or BLAS/LAPACK, e.g.
do i1=1,n
do i=1,n
B(:,:,i1) = B(:,:,i1) &
& + matmul(matmul(transpose(v1), A(:,:,i)), v2) * v3(i,i1)
enddo
enddo
You can almost certainly speed this up further by caching the results of the double matmul
, e.g.
real :: intermediate(n,n,n)
do i=1,n
intermediate(:,:,i) = matmul(matmul(transpose(v1), A(:,:,i)), v2)
enddo
do i1=1,n
do i=1,n
B(:,:,i1) = B(:,:,i1) + intermediate(:,:,i) * v3(i,i1)
enddo
enddo
Which might in turn be sped up with more matmul
, e.g.
real :: intermediate(n,n,n)
do i=1,n
intermediate(:,:,i) = matmul(matmul(transpose(v1), A(:,:,i)), v2)
enddo
do j1=1,n
B(:,j1,:) = matmul(intermediate(:,j1,:), v3))
enddo