Search code examples
pythonpandasnumpymatrixscipy

manipulate the element before finding sum of higher elements in the row


I have asked about finding sum of higher elements in the row/column and got really good answer. However this approach does not allow me to manipulate current element.

My input dataframe is something like this:

array([[-1,  7, -2,  1,  4],
       [ 6,  3,  -3,  5,  1]])

Basically, I would like to have a output matrix which shows me for each element how many values are higher in the given row and column, like this:

array([[3, 0, 4, 2, 1],
       [0, 2, 4, 1, 3]], dtype=int64)

scipy ranked function really works well here. (Thanks to @Tom)

the tricky part is here since this matrix is correlation matrix and scores are between -1 and 1,
I would like to add one middle step (normalization factor) before counting higher values:

If the element is negative, add +3 to that element and then count how many values are higher
If the element is positive, subtract -3 from that element and then count how many values are higher in the row.

e.g.:

first element of row is negative we add +3 and then row would be
2 7 -2 1 4 -> sum of the higher values from that element is 2
second element of row is positive we subtract -3 and then row would be
-1 4 -2 1 4 -> sum of the higher values from that element is 0

...

so we do this normalization for each row and row-wise desired output would be:

2 0 2 3 1 
1 3 4 2 3

I don't want to use loop for that because since the matrix is 11kx12k, it takes so much time. If I use ranked with lamda, than instead of doing for each element, It adds and subtracts in the same time to the all row values, which It is not what I want.

corr = np.array([[-1,  7, -2,  1,  4],
                 [ 6,  3,  -3,  5,  1]])


def element_wise_bigger_than(x, axis):
    return x.shape[axis] - rankdata(x, method='max', axis=axis)


ld = lambda t: t + 3 if t<0 else t-3
f = np.vectorize(ld)


element_wise_bigger_than(f(corr), 1)

Solution

  • A possible solution, based on numba and numba prange to parallelize the for loop:

    from numba import jit, prange, njit, set_num_threads
    import numpy as np
    
    @njit(parallel=True)
    def get_horizontal(a):
        z = np.zeros((a.shape[0], a.shape[1]), dtype=np.int32)
        
        for i in prange(a.shape[0]):
            for j in range(a.shape[1]):
                aux = a[i, j]
                
                if a[i, j] < 0:
                    a[i, j] += 3
                elif a[i, j] > 0:
                    a[i, j] -= 3
                else:
                    pass
                
                z[i, j] = (a[i, j] < a[i, :]).sum()
                a[i, j] = aux
        return z
    
    a = np.array([[-1,  7, -2,  1,  4],
           [ 6,  3,  -3,  5,  1]])
    
    set_num_threads(6) # to use only 6 threads
    
    get_horizontal(a) 
    

    Runtime:

    By using the following array,

    a = np.random.randint(-10, 10, size=(11000, 12000))
    

    the runtime, on my computer, is less than 1 minute.


    Output:

    array([[2, 0, 2, 3, 1],
           [1, 3, 4, 2, 3]], dtype=int32)