Search code examples
pythonpandaspandas-groupbyaggregatedata-wrangling

Pandas aggregation: return only value in group, raise error if there are several


When wrangling dataframes, I often want to aggregate data while preserving some high-level variables. I don't want to group by these higher level variables, because this is not what I mean.

Is there an existing aggregation function that returns the unique value of a series (in each group), and throws an error if there are multiple different values?

Example

Consider this table:

df = (
    pd.DataFrame({"plant_id": [14] * 2 + [15, 16] * 3,
                  "plant": ["cactus"] * 2 + ["clover", "rose"] * 3,
                  "score": [1, 2, 6, 4, 7, 9, 2, 4],
                  "spikes": [False, True] + [False, True] * 3})
    .sort_values('plant_id')
)

#    plant_id   plant  score  spikes
# 0        14  cactus      1   False
# 1        14  cactus      2    True
# 2        15  clover      6   False
# 4        15  clover      7   False
# 6        15  clover      2   False
# 3        16    rose      4    True
# 5        16    rose      9    True
# 7        16    rose      4    True

I would like to get the average score of each plant, keeping both the plant_id and the plant name. One way to do it is to group by both variables.

df.group_by(['plant_id', 'plant']).agg({'score': "mean"})

This could be fine in this simple example, this isn't exactly what I mean. What I actually want is to group by plant_id, and then to keep the plant name, knowing that there is always only one plant name per plant id. Is there an existing function or method that would do something like unique_value here?

df.groupby('plant_id').agg({'plant': lambda x: unique_value(x),
                            'score': "mean"})

#            plant     score
# plant_id                  
# 14        cactus  1.500000
# 15        clover  5.000000
# 16          rose  5.666667

Ideally, this function would raise an error if I try to apply it to a series that has several values:

df.groupby('plant_id').agg({'plant': lambda x: unique_value(x),
                            'spikes': lambda x: unique_value(x),
                            'score': "mean"})

# ValueError: Not unique: some cactuses have spikes, and some don't!

Solution

  • I found a solution with existing methods only: x.unique().item().

    "unique item" says what it does, and item throws an error if applied to an object that is not of length 1.

    df.groupby('plant_id').agg({'plant': lambda x: x.unique().item(),
                                'score': "mean"})
    
    #           plant     score
    # plant_id      
    # 14        cactus    1.500000
    # 15        clover    5.000000
    # 16        rose      5.666667
    
    df.groupby('plant_id').agg({'plant': lambda x: x.unique().item(),
                                'spikes': lambda x: x.unique().item(),
                                'score': "mean"})
    
    # ValueError: can only convert an array of size 1 to a Python scalar