Search code examples
pythonpandasdataframegroup-byaggregate

Why does summing data grouped by df.iloc[:, 0] also sum up the column names?


I have a DataFrame with a species column and four arbitrary data columns. I want to group it by species and sum up the four data columns for each one. I've tried to do this in two ways: once by grouping by df.columns[0] and once by grouping by df.iloc[:, 0].

data = {
    'species': ['a', 'b', 'c', 'd', 'e', 'rt', 'gh', 'ed', 'e', 'd', 'd', 'q', 'ws', 'f', 'fg', 'a', 'a', 'a', 'a', 'a'],
    's1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    's2': [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9],
    's3': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
    's4': [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]
}
df = pd.DataFrame(data)

grouped_df1 = df.groupby(df.columns[0], as_index=False).sum()
grouped_df2 = df.groupby(df.iloc[:, 0], as_index=False).sum()

Both methods correctly sum the data in the four rightmost columns. But for some reason, the second method also sums up the names of the species, concatenating them into one long, repeating string.

Here's the result from the first method, which is what I'm looking for:

print(grouped_df1)
   species  s1  s2  s3  s4
0        a  91  54  97  60
1        b   2   9   3  10
2        c   3   9   4  10
3        d  25  27  28  30
4        e  14  18  16  20
5       ed   8   9   9  10
6        f  14   9  15  10
7       fg  15   9  16  10
8       gh   7   9   8  10
9        q  12   9  13  10
10      rt   6   9   7  10
11      ws  13   9  14  10

And here's the result from the df.iloc method, which incorrectly sums up the species data:

print(grouped_df2)
   species  s1  s2  s3  s4
0   aaaaaa  91  54  97  60
1        b   2   9   3  10
2        c   3   9   4  10
3      ddd  25  27  28  30
4       ee  14  18  16  20
5       ed   8   9   9  10
6        f  14   9  15  10
7       fg  15   9  16  10
8       gh   7   9   8  10
9        q  12   9  13  10
10      rt   6   9   7  10
11      ws  13   9  14  10

Why is the second method summing up the species names as well as the numerical data?


Solution

  • In groupby - column name is treated as an intrinsic grouping key, while a Series is treated as an external key.

    Reference - https://pandas.pydata.org/docs/reference/groupby.html

    When using df.iloc[:, 0]:

    Pandas considers the string values in the species column as a separate grouping key independent of the DataFrame structure.

    When using df.columns[0]:

    Pandas directly uses the column 'species' within the DataFrame as the grouping key. This allows Pandas to manage the grouping and summation correctly.

    Code COrrection

    You should always reference the column name explicitly

    grouped_df1 = df.groupby('species', as_index=False).sum()
    

    Or this also works

    grouped_df1 = df.groupby(df[df.columns[0]], as_index=False).sum()