Search code examples
pythonpandasnumpysortingisin

populating binary column based on historic values within group Python


I am working on populating a binary column based on status values a certain product ID traveled through during its lifetime. Hence, I first sorted my frame by 'id' and 'date'. Now, I would need to code something like:

"If ID currently in status 4 or 6 AND ever been before in status 2 or 3, then frame_ordered['binary'] = 1, else frame_ordered['binary'] = 0"

Accordingly, ID 4556 would be assigned the value 1, and the rest gets 0. I tried working this out with .isin(), but this is rather simplified version, and doesn't yield the correct result. It lacks the time "traveling part": checking the latest status, and all the values beforehand within a given ID. Does anyone have a better idea how to solve this? Would be much appreciated.

today =  datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248, 4556, 4556],'status': [1,2,4,5,6, 3, 4], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03', '2022-05-03', '2022-07-01']})
frame_ordered = frame.sort_values(['id','date'], ascending=True)
# mask where the next row has same id as current one
mask = frame_ordered['id'] == frame_ordered['id'].shift(-1)
frame_ordered['binary'] = np.where(mask & frame_ordered['id'].isin([4,6]) & frame_ordered['id'].isin([2, 3]), 1, 0)
frame_ordered

Solution

  • The "ever been before in status 2 or 3" condition can be coded using cumsum counting all previous (and the current) occurrences of 2 or 3 in the group.

    import pandas as pd
    
    frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248, 4556, 4556, 2345],'status': [1,2,4,5,6, 3, 4, 2], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03', '2022-05-03', '2022-07-01', '2022-05-10']})
    
    frame_ordered = frame.sort_values(['id','date'])
    frame_ordered['binary'] = frame_ordered.groupby('id').status.transform(lambda x: (x.isin([4, 6]) & x.isin([2,3]).cumsum().gt(0)).astype(int))
    

    Result:

         id  status        date  binary
    0  1245       1  2022-07-01       0
    4  1248       6  2022-01-03       0
    2  2345       4  2022-04-20       0
    7  2345       2  2022-05-10       0
    3  4556       5  2022-02-02       0
    1  4556       2  2022-03-12       0
    5  4556       3  2022-05-03       0
    6  4556       4  2022-07-01       1
    

    (please note that I added another row for id 2345 where the 4 is followed (instead for preceded) by a 2 and hence should not be marked with a 1)