I have asked about finding sum of higher elements in the row/column and got really good answer. However this approach does not allow me to manipulate current element.
My input dataframe is something like this:
array([[-1, 7, -2, 1, 4],
[ 6, 3, -3, 5, 1]])
Basically, I would like to have a output matrix which shows me for each element how many values are higher in the given row and column, like this:
array([[3, 0, 4, 2, 1],
[0, 2, 4, 1, 3]], dtype=int64)
scipy ranked
function really works well here. (Thanks to @Tom)
the tricky part is here since this matrix is correlation matrix and scores are between -1 and 1,
I would like to add one middle step (normalization factor) before counting higher values:
If the element is negative, add +3 to that element and then count how many values are higher
If the element is positive, subtract -3 from that element and then count how many values are higher in the row.
e.g.:
first element of row is negative we add +3 and then row would be
2 7 -2 1 4
-> sum of the higher values from that element is 2
second element of row is positive we subtract -3 and then row would be
-1 4 -2 1 4
-> sum of the higher values from that element is 0
...
so we do this normalization for each row and row-wise desired output would be:
2 0 2 3 1
1 3 4 2 3
I don't want to use loop for that because since the matrix is 11kx12k
, it takes so much time.
If I use ranked
with lamda
, than instead of doing for each element, It adds and subtracts in the same time to the all row values, which It is not what I want.
corr = np.array([[-1, 7, -2, 1, 4],
[ 6, 3, -3, 5, 1]])
def element_wise_bigger_than(x, axis):
return x.shape[axis] - rankdata(x, method='max', axis=axis)
ld = lambda t: t + 3 if t<0 else t-3
f = np.vectorize(ld)
element_wise_bigger_than(f(corr), 1)
A possible solution, based on numba
and numba prange
to parallelize the for
loop:
from numba import jit, prange, njit, set_num_threads
import numpy as np
@njit(parallel=True)
def get_horizontal(a):
z = np.zeros((a.shape[0], a.shape[1]), dtype=np.int32)
for i in prange(a.shape[0]):
for j in range(a.shape[1]):
aux = a[i, j]
if a[i, j] < 0:
a[i, j] += 3
elif a[i, j] > 0:
a[i, j] -= 3
else:
pass
z[i, j] = (a[i, j] < a[i, :]).sum()
a[i, j] = aux
return z
a = np.array([[-1, 7, -2, 1, 4],
[ 6, 3, -3, 5, 1]])
set_num_threads(6) # to use only 6 threads
get_horizontal(a)
Runtime:
By using the following array,
a = np.random.randint(-10, 10, size=(11000, 12000))
the runtime, on my computer, is less than 1 minute.
Output:
array([[2, 0, 2, 3, 1],
[1, 3, 4, 2, 3]], dtype=int32)