Search code examples
pythonpandasdataframepivot-table

How do I add a filter to a pivot table grouping?


Follow up to: How do I sum a column based on separate category types, and preserve the zeros?

I used 'Id' as a stand in for a bunch of columns. One of them is 'Date'. Here is my dataframe after applying the pivot table method.

Id Date A_Hours B_Hours
1 2/2/2022 4 4
2 2/4/2023 2 6
3 6/2/2023 3 0

But because 'Date' was one of the columns I added to the index with 'Id', here is the dataframe for a user with multiple hours entered for multiple dates.

Id Date A_Hours B_Hours
1 2/2/2022 4 4
2 2/3/2023 3 3
2 2/4/2023 2 6
3 6/2/2023 3 0

How do I get the following dataframe:

Id 2/3/2023_A_Hours 2/4/2023_B_Hours 6/2/2023_A_Hours 6/2/2023_B_Hours
1 0 0 0 0
2 3 3 2 6
3 0 0 3 0

Note that there is an additional filter placed on top of this - Id 1 has all 0s because in our result dataframe, not only are we combining rows into multiple columns for a single id, we only want to calculate hours entered in 2023. (and exclude all dates < 1/1/2023 in the count)

Also note that this time, I don't want to create a new column name '2/4/2023_B_Hours'. I would actually prefer a multi-level index with "A_hours" and "B_hours" under "2/4/2023". The table in StackOverflow doesn't allow me to format it like that.


Solution

  • You should provide a reproducible example for clarity.

    Assuming this one, we want to consider dates after 2023-01-15 (which are absent in Id 3):

       Id        Date Category  Hours
    0   1  2023-01-01        A      1   # this should be filtered out
    1   1  2023-02-01        A      3
    2   1  2023-02-01        B      4
    3   2  2023-03-01        A      2
    4   2  2023-03-01        B      6
    5   3  2023-01-01        A      3   # this should be filtered out
    

    You can pre-filter the rows (for example using query), then reindex after pivoting to add the missing Ids:

    out = (df.query('Date > "2023-01-15"')
             .pivot_table(index='Id', columns='Category', values='Hours',
                          aggfunc='sum', fill_value=0,
                          margins=True, margins_name='Total')
             .add_suffix('_Hours')
             .drop('Total')
             .reindex(df['Id'].unique(), fill_value=0)
             .reset_index().rename_axis(columns=None)
          )
    

    Output:

       Id  A_Hours  B_Hours  Total_Hours
    0   1        3        4            7
    1   2        2        6            8
    2   3        0        0            0