I have a DataFrame df
and I am trying to calculate a cumulative count based on the condition that the date in the column at
is bigger or equal to the dates in the column recovery_date
.
Here is the original df
:
at recovery_date
0 2020-02-01 2020-03-02
1 2020-03-01 2020-03-31
2 2020-04-01 2020-05-01
3 2020-05-01 2020-05-31
4 2020-06-01 2020-07-01
Here is the desired outcome:
at recovery_date result
0 2020-02-01 2020-03-02 0
1 2020-03-01 2020-03-31 0
2 2020-04-01 2020-05-01 2
3 2020-05-01 2020-05-31 3
4 2020-06-01 2020-07-01 4
The interpretation is that for each at
there are x amount of recovery_date
s preceding it or on the same day.
I am trying to avoid using a for loop as I am implementing this for a time-sensitive application.
This is a solution I was able to find, however I am looking for something more performant:
def how_many(at: pd.Timestamp, recoveries: pd.Series) -> int:
return (at >= recoveries).sum()
df["result"] = [how_many(row["at"], df["recovery_date"][:idx]) for idx, row in df.iterrows()]
Thanks a lot!!
You're looking for something like this:
df['result'] = df['at'].apply(lambda at: (at >= df['recovery_date']).sum())
What this does is: For each value in the at
column, check if there are any recovery_date
s that are bigger or equal (at this point we have an array of True (=1) and False (=0) values) then sum them.
This yields your desired output
at recovery_date count result
0 2020-02-01 2020-03-02 1 0
1 2020-03-01 2020-03-31 1 0
2 2020-04-01 2020-05-01 1 2
3 2020-05-01 2020-05-31 1 3
4 2020-06-01 2020-07-01 1 4