Search code examples
pythonpandasdataframevectorization

Count number of consecutive rows that are greater than current row value but less than the value from other column


Say I have the following sample dataframe (there are about 25k rows in the real dataframe)

df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
   A  B
0  0  9
1  3  8
2  2  3
3  9  5
4  1  5
5  0  5
6  4  5
7  7  8
8  3  0
9  2  4

For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.

So my expected output is :

A   B next count  previous count
0   9     2          0
3   8     0          0
2   3     0          1
9   5     0          0
1   5     0          0
0   5     2          1
4   5     1          0
7   8     0          0
3   0     0          2
2   4     0          0

Explanation :

  • First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
  • Second row is calculated as : since next value 2 is not greater than 3
  • Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value

Similarly, previous count is calculated

Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.

My Solution

Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.

This is what I have tried.

This function gets the number of previous/next rows that satisfy the condition.

def get_prev_next_count(row):
    next_nrow = df.loc[row['index']+1:,['A', 'B']]
    prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
    if (next_nrow.size == 0):
        return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
    if (prev_nrow.size == 0):
        return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
    return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())

Generating output :

df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")

Output :

This gives us the expected output

df
   A  B  next count  previous count
0  0  9           2               0
1  3  8           0               0
2  2  3           0               1
3  9  5           0               0
4  1  5           0               0
5  0  5           2               1
6  4  5           1               0
7  7  8           0               0
8  3  0           0               2
9  2  4           0               0

Solution

  • I made some optimizations:

    1. You don't need to reset_index() you can access the index with .name
    2. If you only pass df[['A']] instead of the whole frame, that may help.
    3. prev_nrow.empty is the same as (prev_nrow.size == 0)
    4. Applied different logic to get the desired value via first_false, this speeds things up significantly.
    def first_false(val1, val2, A):
        i = 0
        for x, y in zip(val1, val2):
            if A < x < y:
                i += 1
            else:
                break
        return i
    
    def get_prev_next_count(row):
        A = row['A']
        next_nrow = df.loc[row.name+1:,['A', 'B']]
        prev_nrow = df2.loc[row.name-1:,['A', 'B']]
        if next_nrow.empty:
            return 0, first_false(prev_nrow.A, prev_nrow.B, A)
        if prev_nrow.empty:
            return first_false(next_nrow.A, next_nrow.B, A), 0
        return (first_false(next_nrow.A, next_nrow.B, A),
                first_false(prev_nrow.A, prev_nrow.B, A))
    
    df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
    
    df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
    print(df)
    

    Output:

       A  B  next count  previous count
    0  0  9           2               0
    1  3  8           0               0
    2  2  3           0               1
    3  9  5           0               0
    4  1  5           0               0
    5  0  5           2               1
    6  4  5           1               0
    7  7  8           0               0
    8  3  0           0               2
    9  2  4           0               0
    

    Timing

    Expanding the data:

    df = pd.concat([df]*(10000//4), ignore_index=True)
    # df.shape == (25000, 2)
    

    Original Method:

    • Gave up at 15 minutes.

    New Method:

    • 1m 20sec

    Throw pandarallel at it:

    from pandarallel import pandarallel
    pandarallel.initialize()
    
    df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
    
    • 26sec