I would like to do addition and multiplication on top of array transposes.
Given A
is an array.
sum += p(A) * factor(p)
where p
is a permutation/transpose, factor(p)
is a list of prefactors. For example, A
is a 2 dimensional array, the target is
0.1*A + 0.1*transpose(A,(1,0))
I found in my Python
code, when the higher dimensional permutations are applied, the time for array addition is far larger than transpose. Perhaps numpy.transpose
uses pointer in C
. I want to know, is there any method to optimize the timing for the array addition part? numpy.add
does not help much. Should I somehow, only sum the part of transpose which affects the array, the rest part use multiplication? For example, a permutation (0,1,3,2)
, the (0,1)
part multiplies a common factor on top of the previous array. Or use cpython
to improve the performance?
Here is my Python
code
import numpy as np
import time
import itertools as it
ref_list = [0, 1, 2, 3, 4]
p = it.permutations(ref_list)
transpose_list = tuple(p)
print(type(transpose_list),type(transpose_list[0]),transpose_list[0])
n_loop = 2
na = nb = nc = nd = ne = 20
A = np.random.random((na,nb,nc,nd,ne))
sum_A = np.zeros((na,nb,nc,nd,ne))
factor_list = [i*0.1 for i in range(120)]
time_transpose = 0
time_add = 0
time_multiply = 0
for n in range(n_loop):
for m, t in enumerate(transpose_list):
start = time.time()
B = np.transpose(A, transpose_list[m] )
finish = time.time()
time_transpose += finish - start
start = time.time()
B_p = B * factor_list[m]
finish = time.time()
time_multiply += finish - start
start = time.time()
sum_A += B_p
finish = time.time()
time_add += finish - start
print(time_transpose, time_multiply, time_add, time_multiply/time_transpose, time_add/time_transpose)
The output is
0.004961967468261719 1.1218750476837158 3.7830252647399902 226.09480107630213 762.404285988852
The addition time is about ~700 folds larger than the transpose.
I tried to use numba's transpose in How to avoid huge overhead of single-threaded NumPy's transpose?
by adding
import numba as nb
@nb.njit('void(float64[:,::1], float64[:,::1])', parallel=True)
def transpose(mat, out):
blockSize, tileSize = 256, 32 # To be tuned
n, m = mat.shape
assert blockSize % tileSize == 0
for tmp in nb.prange((m+blockSize-1)//blockSize):
i = tmp * blockSize
for j in range(0, n, blockSize):
tiMin, tiMax = i, min(i+blockSize, m)
tjMin, tjMax = j, min(j+blockSize, n)
for ti in range(tiMin, tiMax, tileSize):
for tj in range(tjMin, tjMax, tileSize):
out[ti:ti+tileSize, tj:tj+tileSize] = mat[tj:tj+tileSize, ti:ti+tileSize].T
and use
B = transpose(A, transpose_list[m] )
received
Traceback (most recent call last):
File "transpose_test_v2.py", line 46, in <module>
B = transpose(A, transpose_list[m] )
File "/home/.../lib/python3.8/site-packages/numba/core/dispatcher.py", line 717, in _explain_matching_error
raise TypeError(msg)
TypeError: No matching definition for argument type(s) array(float64, 6d, C), UniTuple(int64 x 6)
or using
B = nb.transpose(A, transpose_list[m] )
and got
B = nb.transpose(A, transpose_list[m] )
AttributeError: 'int' object has no attribute 'transpose'
np.transpose
does not physically transpose the array in memory. It change the strides of the array so that the resulting view is (virtually) transposed. The thing is the memory layout of the view is very inefficient after the transposition so that operations on it are not efficient. You can perform a copy of the transposed view or call np.ascontiguousarray
so to perform the transposition eagerly. Be aware that the current implementation of the Numpy copy of transposed arrays is quite inefficient.
Here are results on my machine:
Initial code:
0.0030066967010498047 1.2316536903381348 1.971841812133789 409.6368249940528 655.8166679882642
With a copy of the transposed array:
2.483657121658325 1.221949577331543 0.6028566360473633 0.4919960837893974 0.2427294133277298
Besides this, preallocating the arrays and using the parameter out
of np.add
and np.multiply
should help to make the multiply and the add faster. For faster performance, you can use a parallel Numba code with loops. Optimizing the transposition in you case is pretty hard due to the high number of dimension.
Update:
Here is a faster Numba implementation that mostly removes the cost of the add/multiply and reduce the cost of the transposition using multiple threads:
# Same setup than in the question
@numba.njit('(float64[:,:,:,:,::1], float64[:,:,:,:,::1], float64, int_, int_, int_, int_, int_)', parallel=True, fastmath=True)
def computeRound(A, sum_A, factor, i0, i1, i2, i3, i4):
size = 20
B = np.transpose(A, (i0, i1, i2, i3, i4))
for j0 in numba.prange(size):
for j1 in range(size):
for j2 in range(size):
for j3 in range(size):
for j4 in range(size):
sum_A[j0, j1, j2, j3, j4] += B[j0, j1, j2, j3, j4] * factor
for n in range(n_loop):
for m, t in enumerate(transpose_list):
i0, i1, i2, i3, i4 = transpose_list[m]
computeRound(A, sum_A, factor_list[m], i0, i1, i2, i3, i4)
This code is 4 times faster than the initial one. It is almost fully bound by the bad access pattern of the transposition. However, this is hard to optimize the transposition. One solution is to apply the transposition in an order that maximize cache locality from the previous transposition (fast but especially hard to implement). A simpler (but slower) solution is simply to manually work on the (i0, i1, i2, i4, i3)
case at the same time than the (i0, i1, i2, i3, i4)
case using the previously-provided optimized Numba transposition code.