When wrangling dataframes, I often want to aggregate data while preserving some high-level variables. I don't want to group by these higher level variables, because this is not what I mean.
Is there an existing aggregation function that returns the unique value of a series (in each group), and throws an error if there are multiple different values?
Consider this table:
df = (
pd.DataFrame({"plant_id": [14] * 2 + [15, 16] * 3,
"plant": ["cactus"] * 2 + ["clover", "rose"] * 3,
"score": [1, 2, 6, 4, 7, 9, 2, 4],
"spikes": [False, True] + [False, True] * 3})
.sort_values('plant_id')
)
# plant_id plant score spikes
# 0 14 cactus 1 False
# 1 14 cactus 2 True
# 2 15 clover 6 False
# 4 15 clover 7 False
# 6 15 clover 2 False
# 3 16 rose 4 True
# 5 16 rose 9 True
# 7 16 rose 4 True
I would like to get the average score of each plant, keeping both the plant_id
and the plant
name. One way to do it is to group by both variables.
df.group_by(['plant_id', 'plant']).agg({'score': "mean"})
This could be fine in this simple example, this isn't exactly what I mean.
What I actually want is to group by plant_id
, and then to keep the plant
name, knowing that there is always only one plant name per plant id.
Is there an existing function or method that would do something like
unique_value
here?
df.groupby('plant_id').agg({'plant': lambda x: unique_value(x),
'score': "mean"})
# plant score
# plant_id
# 14 cactus 1.500000
# 15 clover 5.000000
# 16 rose 5.666667
Ideally, this function would raise an error if I try to apply it to a series that has several values:
df.groupby('plant_id').agg({'plant': lambda x: unique_value(x),
'spikes': lambda x: unique_value(x),
'score': "mean"})
# ValueError: Not unique: some cactuses have spikes, and some don't!
I found a solution with existing methods only: x.unique().item()
.
"unique item" says what it does, and item
throws an error if applied to an object that is not of length 1.
df.groupby('plant_id').agg({'plant': lambda x: x.unique().item(),
'score': "mean"})
# plant score
# plant_id
# 14 cactus 1.500000
# 15 clover 5.000000
# 16 rose 5.666667
df.groupby('plant_id').agg({'plant': lambda x: x.unique().item(),
'spikes': lambda x: x.unique().item(),
'score': "mean"})
# ValueError: can only convert an array of size 1 to a Python scalar