I got a transactional operation that produces a feed like below:
df = pd.DataFrame({'action':['transacted','transacted','transacted','transacted','undo','transacted','transacted','transacted','transacted','transacted','undo','undo','undo','transacted'],
'transaction_count':10,20,35,60,60,60,80,90,100,10,10,100,90,90]})
action | transaction_count | |
---|---|---|
0 | transacted | 10 |
1 | transacted | 20 |
2 | transacted | 35 |
3 | transacted | 60 |
4 | undo | 60 |
5 | transacted | 60 |
6 | transacted | 80 |
7 | transacted | 90 |
8 | transacted | 100 |
9 | transacted | 10 |
10 | undo | 10 |
11 | undo | 100 |
12 | undo | 90 |
13 | transacted | 90 |
The counts are in a pattern but not in a linear way. (10-20-35-60-80-90-100-10-20...)
undo states which transaction count is cancelled.
There can be multiple undo's for multiple cancellations.
# This is an initial apply, to set it up
df['is_undone']=df.apply(lambda x: 1 if x['action']=='undo' else 0, axis=1).shift(-1)
df=df.fillna(0) # For shift
df=df.loc[df['is_undone']==0]
df=df.fillna(0)
df=df.loc[df['action']!='undo']
df.reset_index(drop=True,inplace=True)
Unfortunately, it only works for single undo but not for multiple in a row. Apply does not let accessing neighbour row values and I can't think of any else solution. It should also need to calculate 300k rows, so, performance is also an issue.
Expected result is:
action | transaction_count | |
---|---|---|
0 | transacted | 10 |
1 | transacted | 20 |
2 | transacted | 35 |
3 | transacted | 60 |
4 | transacted | 80 |
5 | transacted | 90 |
Thanks in advance!
A slight modification of this excellent answer might give you what you want:
def undo(frame):
d = {"transacted": 0, "undo": 1}
condition = frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1]).mask(lambda x: x==0).bfill(limit=1).fillna(0)
return frame[condition==0].reset_index(drop=True)
result = df.groupby("transaction_count").apply(undo).reset_index(drop=True)
>>> result
action transaction_count
0 transacted 10
1 transacted 20
2 transacted 35
3 transacted 60
4 transacted 80
5 transacted 90
groupby
is being used to deal with each transaction_count
separately.
As, an example, consider the case where transaction_count
is 10.
frame = df[df["transaction_count"]==10]
>>> frame
action transaction_count
0 transacted 10
9 transacted 10
10 undo 10
In the undo
function, we first map
the action
column to a number:
>>> frame["action"].map(d)
0 0
9 0
10 1
Recognize that we want to remove rows where 0
(transacted) is immediately followed by 1
(undo). In the above, this corresponds to rows with indices 9
and 10
.
To do this, we work with 2 rows of the above frame at a time using pd.rolling
and lambda
:
>>> frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1])
0 NaN
9 0.0
10 1.0
Now, mask
0s to np.nan
, bfill
(back-fill) exactly once, and fillna
with 0
.
>>> frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1]).mask(lambda x: x==0).bfill(limit=1).fillna(0)
0 0.0
9 1.0
10 1.0
Name: action, dtype: float64
From the above, we need all rows that are not equal to 1
.
This is what is returned in the undo
function.