Search code examples
pythonpandascumsum

Cumcount in new column but only starting from a condition


I want to create a new column in my Pandas dataframe that counts how many days have passed since the value in another column is a particular one. The goal in this example is to count how many days have passed since a player has scored a goal. I show you how would be the structure of the table that I would like to generate, in which a new column "Days since goal" would be created and it would start counting since the value "Yes" is included in the column "Goal".

      Player      Date            Goal     Days since goal
0     John        11-7-2022       No       0
1     Charles     11-7-2022       Yes      1
2     John        12-7-2022       Yes      1
3     Charles     12-7-2022       No       2
4     John        13-7-2022       No       2
5     Charles     13-7-2022       No       3
6     John        14-7-2022       No       3
7     Charles     14-7-2022       No       4
8     Peter       15-7-2022       No       0
9     John        15-7-2022       Yes      1
10    Charles     16-7-2022       No       5
11    Peter       16-7-2022       Yes      1
12    John        16-7-2022       No       2

I'm very closed to get it with this code:

blocks = df['Goal'].ge("Yes").groupby(df['Player']).cumsum()
df['Days since goal'] = df.groupby([df['Player'],blocks]).cumcount()

But the result it's not good at all because with the first "Yes" in the column "Days since goal" shows 0 instead of 1. This is the result I get with the code but it isn't what I need:

      Player      Date            Goal     Days since goal
0     John        11-7-2022       No       0
1     Charles     11-7-2022       Yes      0
2     John        12-7-2022       Yes      0
3     Charles     12-7-2022       No       1
4     John        13-7-2022       No       1
5     Charles     13-7-2022       No       2
6     John        14-7-2022       No       2
7     Charles     14-7-2022       No       3
8     Peter       15-7-2022       No       0
9     John        15-7-2022       Yes      0
10    Charles     16-7-2022       No       4
11    Peter       16-7-2022       Yes      0
12    John        16-7-2022       No       1

I need 0 for the player until the moment Goal is "Yes", and in this case, if it's "Yes", automatically is 1. Any suggestion? Thanks in advance


Solution

  • With your initial dataframe:

    df = pd.DataFrame({'Player': ['John', 'Charles', 'John', 'Charles', 'John', 'Charles', 'John', 'Charles', 'Peter', 'John', 'Charles', 'Peter', 'John'], 'Date': ['11-7-2022', '11-7-2022', '12-7-2022', '12-7-2022', '13-7-2022', '13-7-2022', '14-7-2022', '14-7-2022', '15-7-2022', '15-7-2022', '16-7-2022', '16-7-2022', '16-7-2022'], 'Goal': ['No', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'Yes', 'No']}
    

    Here is one way to do it:

    def count(goals):
        """Helper function.
        
        Args:
            goals: list of Yes/No values.
        
        Returns:
            Cumcount of goals.
        
        """
        counts = []
        for i, goal in enumerate(goals):
            if i == 0 and goal == "No":
                counts.append(0)
                continue
            if goal == "":
                continue
            for i in range(2, 2 + int(len(goal) / 2)):
                counts.append(i)
        return counts
    
    df = df.sort_values(by=["Player", "Date"])
    
    no_counts = []
    for player in df["Player"].unique():
        goals = "".join(df.loc[df["Player"] == player, "Goal"].tolist()).split("Yes")
        no_counts += count(goals)
    
    df.loc[df["Goal"] == "No", "Days since goal"] = no_counts
    
    df = df.fillna(1).sort_index().astype({"Days since goal": int})
    
    print(df)
    # Output
         Player       Date Goal  Days since goal
    0      John  11-7-2022   No                0
    1   Charles  11-7-2022  Yes                1
    2      John  12-7-2022  Yes                1
    3   Charles  12-7-2022   No                2
    4      John  13-7-2022   No                2
    5   Charles  13-7-2022   No                3
    6      John  14-7-2022   No                3
    7   Charles  14-7-2022   No                4
    8     Peter  15-7-2022   No                0
    9      John  15-7-2022  Yes                1
    10  Charles  16-7-2022   No                5
    11    Peter  16-7-2022  Yes                1
    12     John  16-7-2022   No                2