Search code examples
pythonpandastransactionsiteration

Flagging a row according to the various amount of neighbour values in a pandas dataframe


I got a transactional operation that produces a feed like below:


df = pd.DataFrame({'action':['transacted','transacted','transacted','transacted','undo','transacted','transacted','transacted','transacted','transacted','undo','undo','undo','transacted'],
                  'transaction_count':10,20,35,60,60,60,80,90,100,10,10,100,90,90]})
action transaction_count
0 transacted 10
1 transacted 20
2 transacted 35
3 transacted 60
4 undo 60
5 transacted 60
6 transacted 80
7 transacted 90
8 transacted 100
9 transacted 10
10 undo 10
11 undo 100
12 undo 90
13 transacted 90

The counts are in a pattern but not in a linear way. (10-20-35-60-80-90-100-10-20...)

undo states which transaction count is cancelled.

There can be multiple undo's for multiple cancellations.

# This is an initial apply, to set it up
df['is_undone']=df.apply(lambda x: 1 if x['action']=='undo' else 0, axis=1).shift(-1)
df=df.fillna(0)  # For shift

df=df.loc[df['is_undone']==0]
df=df.fillna(0)
df=df.loc[df['action']!='undo']
df.reset_index(drop=True,inplace=True)

Unfortunately, it only works for single undo but not for multiple in a row. Apply does not let accessing neighbour row values and I can't think of any else solution. It should also need to calculate 300k rows, so, performance is also an issue.

Expected result is:

action transaction_count
0 transacted 10
1 transacted 20
2 transacted 35
3 transacted 60
4 transacted 80
5 transacted 90

Thanks in advance!


Solution

  • A slight modification of this excellent answer might give you what you want:

    Solution

    def undo(frame):
        d = {"transacted": 0, "undo": 1}
        condition = frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1]).mask(lambda x: x==0).bfill(limit=1).fillna(0)
        return frame[condition==0].reset_index(drop=True)
                   
    result = df.groupby("transaction_count").apply(undo).reset_index(drop=True)
    >>> result
           action  transaction_count
    0  transacted                 10
    1  transacted                 20
    2  transacted                 35
    3  transacted                 60
    4  transacted                 80
    5  transacted                 90
    

    Explanation

    groupby is being used to deal with each transaction_count separately. As, an example, consider the case where transaction_count is 10.

    frame = df[df["transaction_count"]==10]
    >>> frame
            action  transaction_count
    0   transacted                 10
    9   transacted                 10
    10        undo                 10
    

    In the undo function, we first map the action column to a number:

    >>> frame["action"].map(d)
    0     0
    9     0
    10    1
    

    Recognize that we want to remove rows where 0 (transacted) is immediately followed by 1 (undo). In the above, this corresponds to rows with indices 9 and 10.

    To do this, we work with 2 rows of the above frame at a time using pd.rolling and lambda:

    >>> frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1])
    0     NaN
    9     0.0
    10    1.0
    

    Now, mask 0s to np.nan, bfill (back-fill) exactly once, and fillna with 0.

    >>> frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1]).mask(lambda x: x==0).bfill(limit=1).fillna(0)
    0     0.0
    9     1.0
    10    1.0
    Name: action, dtype: float64
    

    From the above, we need all rows that are not equal to 1. This is what is returned in the undo function.