Search code examples
pythonpandasdataframe

Python dataframes: issue when attempting to group by multiple columns


I am attempting to group by multiple columns and return the sum of select columns in my dataframe. I was able to do this by only grouping by one column

df_sum = df.iloc[:, 27:].groupby(df['id']).sum().reset_index()

I am successfully grouping by id and summing the values from column 27 to the end of my dataframe. However, when I try to group by multiple columns using

df_sum = df.iloc[:, 27:].groupby(df['id', 'year']).sum().reset_index()

I get an error

KeyError: ('id', 'year')

I have been able to group by multiple columns before by using something like

df.groupby(['id', 'year'])['some col name'].nunique().reset_index()

I tried using this type of format

df.groupby(['id', 'year'])[df.iloc[:,27:]].sum().reset_index()

But it also does not work as I get another error

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed

I'm currently a bit confused on how I can get this to work. I feel like I must be overlooking something relatively simple since I am able to get it to work for grouping by one column. I am just struggling to find a way to get it to work when grouping by multiple columns.


Solution

  • Check whether id column is within the subset "from 27-th column to the end".

    I suppose that this column is at some earlier position in your df.

    Another solution to try

    Generate the target list of columns:

    cols = df.columns[27:]
    

    Then use it in your instruction:

    df.groupby(['id', 'year'])[cols].sum().reset_index()
    

    Note that cols itself is a list, but in the instruction above it must be "enveloped" with another pair of square brackets.