Search code examples
pythonpandasdataframe

How can I compare a value in one column to all values that are BEFORE it in another column to find the number of unique values that are less than?


This is my DataFrame:

import pandas as pd
df = pd.DataFrame(
    {
        'a': [100, 100, 105, 106, 106, 107, 108, 109],
        'b': [99, 100, 110, 107, 100, 110, 120, 106],
    }
)

Expected output is creating column x:

     a    b    x
0  100   99    0
1  100  100    1
2  105  110    2
3  106  107    3
4  106  100    1
5  107  110    4
6  108  120    5
7  109  106    3

Logic:

This is somehow an extension to this answer. I explain the logic by examples and I start from row 1:

For row 1, the b column value is 100. Then in order to get x for row 1 this value should be compared with all UNIQUE values in a that are on the same row or before it to find out how many values in a are less than or equal to it. The only unique value that is on the same row or before it is 100, so 1 is chosen for x.

For row 2, there are two unique values in a that are less than or equal to 110 which are 100, 105.

The logic is the same for the rest of rows.

This is my attempt based on the linked answer but it does not work:

t = df.a.unique()
m1 = np.arange(len(t))[:,None] >= np.arange(len(t))

h = df['b'].to_numpy()
m2 = t <= h[:, None]

Solution

  • IIUC, use broadcasting, tril, and a mask to only keep the unique values:

    a = df['a'].mask(df['a'].duplicated()).to_numpy()
    b = df['b'].to_numpy()
    df['x'] = np.tril(b[:, None]>=a).sum(axis=1)
    

    As a one-liner:

    df['x'] = np.tril(df[['b']].to_numpy() >=
                       df['a'].mask(df['a'].duplicated()).to_numpy()).sum(axis=1)
    

    Output:

         a    b  x
    0  100   99  0
    1  100  100  1
    2  105  110  2
    3  106  107  3
    4  106  100  1
    5  107  110  4
    6  108  120  5
    7  109  106  3
    

    Intermediates:

    # a
    array([100.,  nan, 105., 106.,  nan, 107., 108., 109.])
    
    # b
    array([ 99, 100, 110, 107, 100, 110, 120, 106])
    
    # np.tril(b[:, None]>=a)
    array([[False, False, False, False, False, False, False, False],
           [ True, False, False, False, False, False, False, False],
           [ True, False,  True, False, False, False, False, False],
           [ True, False,  True,  True, False, False, False, False],
           [ True, False, False, False, False, False, False, False],
           [ True, False,  True,  True, False,  True, False, False],
           [ True, False,  True,  True, False,  True,  True, False],
           [ True, False,  True,  True, False, False, False, False]])