Search code examples
pythonpandascountmaxconsecutive-months

How to get the maximum number of consecutive columns with values above zero for each row in a pandas dataframe, vectorized?


In a dataframe with 12 columns and 4 million rows, I need to add a column that gets the maximum number of consecutive columns with values above zero for each row.

Here's a sample

df = pd.DataFrame(np.array([[284.77, 234.37, 243.8, 84.36, 0., 0., 0., 55.04, 228.2, 181.97, 0., 0.],
                            [13.78, 0., 38.58, 33.16, 0., 38.04, 74.02, 45.74, 27.2, 9.19, 0., 0.],
                            [88.66, 255.72, 323.19, 7.24, 0., 73.38, 45.73, 0., 0., 77.39, 26.57, 279.34],
                            [0., 0., 34.42, 9.16, 0., 43.4, 42.17, 123.69, 60.5, 25.47, 72.32, 7.29],
                            [320.6, 1445.56, 856.23, 371.21, 0., 244.22, 134.58, 631.59, 561.82, 1172.44, 895.68, 186.28],
                            [0., 0., 32.29, 1000.91, 0., 680., 585.46, 466.6, 0., 493.48, 157.1, 125.31]]),
                  columns=[1,2,3,4,5,6,7,8,9,10,11,12])

And here's an example of my goal:

df['MAX_CONSECUTIVE_COL'] = pd.Series([4,5,4,7,7,3])

Due to the size of dataframe, performance is a must have for the solution.

I've tried to mask the data with boolean values and do a cumulative sum to identify each group of consecutive columns with values == 0 or != 0

((df\>0) != (df\>0).shift(axis=1)).cumsum(axis=1)

Then, I've got the results of one row:

((df>0) != (df>0).shift(axis=1)).cumsum(axis=1).iloc[0]

Applied a value_counts and transformed the result in a dataframe:

pd.DataFrame(((df>0) != (df>0).shift(axis=1)).cumsum(axis=1).iloc[0].value_counts())

applied a sort_values:

pd.DataFrame(((df>0) != (df>0).shift(axis=1)).cumsum(axis=1).iloc[0].value_counts()).sort_values('count', ascending=False)

and, finally, got the first value (the max number of consecutive columns with values !=0 or == 0):

pd.DataFrame(((df>0) != (df>0).shift(axis=1)).cumsum(axis=1).iloc[0].value_counts()).sort_values('count', ascending=False).iloc[0,0]

Now, I've got a problem: I don't know how to filter only the consecutive columns with values != 0.

But let's consider that this method worked and we have now the number of consecutive columns with values !=0 for the first row. The only solution I was capable to develop to get the results for the other rows is iterating each one.

Something like this:

df['MAX_CONSECUTIVE_COL'] = 0

for n in range(0,df.shape[0]-1):
    df.loc[df.index[n], 'MAX_CONSECUTIVE_COL'] = pd.DataFrame(((df>0) != df>0).shift(axis=1)).cumsum(axis=1).iloc[n].value_counts()).sort_values('count',ascending=False).iloc[0,0]

But remember we have 4 million rows, so this iteration would take a looooong time to be completed, and that's the second problem I have.


Solution

  • If performance is concern I'd consider to use :

    from numba import njit, prange
    
    
    @njit(parallel=True)
    def get_max(matrix, out):
        n, m = matrix.shape
    
        for row in prange(n):
            mx, cnt = 0, 0
            for col in range(m - 1):  # -1 because last column is OUT
                if matrix[row, col] > 0:
                    cnt += 1
                    mx = max(mx, cnt)
                else:
                    cnt = 0
            out[row] = mx
    
    
    df["OUT"] = 0
    get_max(df.values, df["OUT"].values)
    print(df)
    

    Prints:

            1        2       3        4    5       6       7       8       9       10      11      12  OUT
    0  284.77   234.37  243.80    84.36  0.0    0.00    0.00   55.04  228.20   181.97    0.00    0.00    4
    1   13.78     0.00   38.58    33.16  0.0   38.04   74.02   45.74   27.20     9.19    0.00    0.00    5
    2   88.66   255.72  323.19     7.24  0.0   73.38   45.73    0.00    0.00    77.39   26.57  279.34    4
    3    0.00     0.00   34.42     9.16  0.0   43.40   42.17  123.69   60.50    25.47   72.32    7.29    7
    4  320.60  1445.56  856.23   371.21  0.0  244.22  134.58  631.59  561.82  1172.44  895.68  186.28    7
    5    0.00     0.00   32.29  1000.91  0.0  680.00  585.46  466.60    0.00   493.48  157.10  125.31    3
    

    Quick benchmark:

    from time import monotonic
    
    # >4 million rows
    df = pd.concat([df] * 800_000, ignore_index=True)
    
    start_time = monotonic()
    
    df["OUT"] = 0
    get_max(df.values, df["OUT"].values)
    
    print(monotonic() - start_time)
    

    Prints on my computer (AMD 5700x):

    0.21921994583681226