Search code examples
pythonsemanticspandas-groupby

Semantics of DataFrame groupby method


I find the behavior of the groupby method on a DataFrame object unexpected.

Let me explain with an example.

df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                   'key2': ['one', 'two', 'one', 'two', 'one'],
                   'data1': np.random.randn(5),
                   'data2': np.random.randn(5)})
data1 = df['data1']

data1
# Out[14]: 
# 0    1.989430
# 1   -0.250694
# 2   -0.448550
# 3    0.776318
# 4   -1.843558
# Name: data1, dtype: float64

data1 does not have the 'key1' column anymore. So I would expect to get an error if I applied the following operation:

grouped = data1.groupby(df['key1'])

But I don't, and I can further apply the mean method on grouped to get the expected result.

grouped.mean()
# Out[13]: 
# key1
# a   -0.034941
# b    0.163884
# Name: data1, dtype: float64

However, the above operation does create a group using the 'key1' column of df.

How can this happen? Does the interpreter store information of the originating DataFrame (df in this case) with the created DataFrame/series (data1 in this case)?

Thank you.


Solution

  • Although the grouping columns are typically from the same dataframe or series, they don't have to be.

    Your statement data1.groupby(df['key1']) is equivalent to data1.groupby(['a', 'a', 'b', 'b', 'a']). In fact, you can inspect the actual groups:

    >>> data1.groupby(['a', 'a', 'b', 'b', 'a']).groups
    {'a': [0, 1, 4], 'b': [2, 3]}
    

    This means that your groupby on data1 will have a group a using rows 0, 1, and 4 from data1 and a group b using rows 2 and 3.