I have the following problem I am trying to solve currently. A dataframe with a lot of numerical columns consisting out of a "serial_number" column that act as ID. A column "update" that is either True or False. And also some numerical columns that I need to sum/divide up with the following: Sum/divide all numerical columns where the rows are "update"=False to the next rows where "update" is True (including the row where "update" = True.
To give you an additional context, the entries are used for training a machine learning model, but for the rows where "update"=false I don't have the target variable. Therefore, I need to sum or mean the values to the next "update"=true row.
Thanks in advance!
As an example, this would be the input table:
serial_number | model | numerical_mean | numerical_1_sum | numerical_2_sum | update |
---|---|---|---|---|---|
a | 2023-01-01 | 5 | 10 | 20 | False |
a | 2023-01-02 | 10 | 15 | 10 | False |
a | 2023-01-03 | 15 | 15 | 10 | True |
b | 2023-01-01 | 10 | 15 | 10 | False |
b | 2023-01-02 | 15 | 15 | 10 | True |
b | 2023-01-03 | 15 | 15 | 10 | False |
b | 2023-01-04 | 15 | 15 | 10 | True |
b | 2023-01-05 | 15 | 15 | 10 | False |
c | 2023-01-04 | 15 | 15 | 10 | True |
The resulting output should look like this:
serial_number | date | numerical_mean | numerical_1_sum | numerical_2_sum | update |
---|---|---|---|---|---|
a | 2023-01-03 | 10 | 40 | 40 | True |
b | 2023-01-02 | 12.5 | 30 | 20 | True |
b | 2023-01-04 | 15 | 30 | 20 | True |
c | 2023-01-04 | 15 | 15 | 10 | True |
The output table has the same amount of rows as the input table has rows where "update"=True. So basically, I try to take all rows within the same serial_number between the first "update"=false and first "update"=true rows and either sum them up or take the average.
# filter the columns that you would like to aggregate
c1 = df.filter(like='_sum')
c2 = df.filter(like='_mean')
# create a agg dictionary which maps column names
# to the corresponding aggregation functions
agg_dict = {
'model': 'last',
'update': 'any',
**dict.fromkeys(c1, 'sum'),
**dict.fromkeys(c2, 'mean'),
}
# grouper to identify different blocks of rows followed by True
b = df[::-1]['update'].cumsum()
# group the dataframe by serial_number and blocks and aggregate
result = df.groupby(['serial_number', b]).agg(agg_dict)
# Query the results to remove the rows that do
# not have any subsequent rows with 'update=true',
# for example, (b, 2023-01-05).
result = result.droplevel(1).query('update').reset_index()
serial_number model update numerical_1_sum numerical_2_sum numerical_mean
0 a 2023-01-03 True 40 40 10.0
1 b 2023-01-04 True 30 20 15.0
2 b 2023-01-02 True 30 20 12.5
3 c 2023-01-04 True 15 10 15.0