Search code examples
pythonpandasconditional-statementscountifcumsum

Pandas cumulative countif (based on condition)


I have a DataFrame df and I am trying to calculate a cumulative count based on the condition that the date in the column at is bigger or equal to the dates in the column recovery_date.

Here is the original df:

    at  recovery_date   
0   2020-02-01  2020-03-02
1   2020-03-01  2020-03-31
2   2020-04-01  2020-05-01
3   2020-05-01  2020-05-31
4   2020-06-01  2020-07-01

Here is the desired outcome:

    at  recovery_date   result
0   2020-02-01  2020-03-02  0
1   2020-03-01  2020-03-31  0
2   2020-04-01  2020-05-01  2
3   2020-05-01  2020-05-31  3
4   2020-06-01  2020-07-01  4

The interpretation is that for each at there are x amount of recovery_dates preceding it or on the same day.

I am trying to avoid using a for loop as I am implementing this for a time-sensitive application.

This is a solution I was able to find, however I am looking for something more performant:

def how_many(at: pd.Timestamp, recoveries: pd.Series) -> int:
    return (at >= recoveries).sum()
df["result"] = [how_many(row["at"], df["recovery_date"][:idx]) for idx, row in df.iterrows()]

Thanks a lot!!


Solution

  • You're looking for something like this:

    df['result'] = df['at'].apply(lambda at: (at >= df['recovery_date']).sum())
    

    What this does is: For each value in the at column, check if there are any recovery_dates that are bigger or equal (at this point we have an array of True (=1) and False (=0) values) then sum them.

    This yields your desired output

              at recovery_date  count  result
    0 2020-02-01    2020-03-02      1       0
    1 2020-03-01    2020-03-31      1       0
    2 2020-04-01    2020-05-01      1       2
    3 2020-05-01    2020-05-31      1       3
    4 2020-06-01    2020-07-01      1       4