Search code examples
pythonpandasdataframedata-sciencedata-manipulation

Pandas Groupby only same ID and when column value is false


I have the following problem I am trying to solve currently. A dataframe with a lot of numerical columns consisting out of a "serial_number" column that act as ID. A column "update" that is either True or False. And also some numerical columns that I need to sum/divide up with the following: Sum/divide all numerical columns where the rows are "update"=False to the next rows where "update" is True (including the row where "update" = True.

To give you an additional context, the entries are used for training a machine learning model, but for the rows where "update"=false I don't have the target variable. Therefore, I need to sum or mean the values to the next "update"=true row.

Thanks in advance!

As an example, this would be the input table:

serial_number model numerical_mean numerical_1_sum numerical_2_sum update
a 2023-01-01 5 10 20 False
a 2023-01-02 10 15 10 False
a 2023-01-03 15 15 10 True
b 2023-01-01 10 15 10 False
b 2023-01-02 15 15 10 True
b 2023-01-03 15 15 10 False
b 2023-01-04 15 15 10 True
b 2023-01-05 15 15 10 False
c 2023-01-04 15 15 10 True

The resulting output should look like this:

serial_number date numerical_mean numerical_1_sum numerical_2_sum update
a 2023-01-03 10 40 40 True
b 2023-01-02 12.5 30 20 True
b 2023-01-04 15 30 20 True
c 2023-01-04 15 15 10 True

The output table has the same amount of rows as the input table has rows where "update"=True. So basically, I try to take all rows within the same serial_number between the first "update"=false and first "update"=true rows and either sum them up or take the average.


Solution

  • Code

    # filter the columns that you would like to aggregate
    c1 = df.filter(like='_sum')
    c2 = df.filter(like='_mean')
    
    # create a agg dictionary which maps column names
    # to the corresponding aggregation functions
    agg_dict = {
        'model': 'last', 
        'update': 'any',
        **dict.fromkeys(c1, 'sum'), 
        **dict.fromkeys(c2, 'mean'),
    }
    
    # grouper to identify different blocks of rows followed by True
    b = df[::-1]['update'].cumsum()
    
    # group the dataframe by serial_number and blocks and aggregate
    result = df.groupby(['serial_number', b]).agg(agg_dict)
    
    # Query the results to remove the rows that do 
    # not have any subsequent rows with 'update=true',
    # for example, (b, 2023-01-05).
    result = result.droplevel(1).query('update').reset_index()
    

    Result

      serial_number       model  update  numerical_1_sum  numerical_2_sum  numerical_mean
    0             a  2023-01-03    True               40               40            10.0
    1             b  2023-01-04    True               30               20            15.0
    2             b  2023-01-02    True               30               20            12.5
    3             c  2023-01-04    True               15               10            15.0