Good morning all
I am trying to process a lot of data, and I need to group data, look at the group, then set a value based on the other entries in the group, but I want to set the value in a column in the full dataset. What I can't figure out is how I can use the group to write back to the main dataframe.
So as an example, I created this data frame
import pandas as pd
data = [{
"class": "cat",
"name": "Fluffy",
"age": 3,
"child": "Whiskers",
"parents_in_group": ""
}, {
"class": "dog",
"name": "Spot",
"age": 5
}, {
"class": "cat",
"name": "Whiskers",
"age": 7
}, {
"class": "dog",
"name": "Rover",
"age": 2,
"child": "Spot"
}]
df = pd.DataFrame(data)
df
So as an example, lets say that I want to set the parrents_in_group to a list of all the parrents in the group, easy to do
for name, group in group_by_class:
mask = group["child"].notna()
print("This is the parrent in group")
print(group[mask])
parent_name = group[mask]["name"].values[0]
print(f"This is the parent name: {parent_name}")
group["parents_in_group"] = parent_name
print("And now we have the name set in group")
print(group)
That updates the group, but not the actual data frame. So how would I go about writing this information back to the main data frame
Using the name and search
This works, but seems a bit untidy
for name, group in group_by_class:
mask = group["child"].notna()
parent_name = group[mask]["name"].values[0]
df.loc[df['class'] == name, 'parents_in_group'] = parent_name
df
Using group
How would I go about using group to set the values, rather than searching for the name that the group was created by. Or are there better ways to going about it.
The real challenge I'm having is that I need to get the group, find some specific values in the group, then set some fields based on the data found.
Any help of course welcome.
A loop-less approach would be to compute a groupby.first
after dropna
, then to map
the output:
df['parents_in_group'] = df['class'].map(
df.dropna(subset='child').groupby('class')['name'].first()
)
# variant
df['parents_in_group'] = df['class'].map(
df['name'].where(df['child'].notna()).groupby(df['class']).first()
)
Or, with drop_duplicates
in place of groupby
(for efficiency):
df['parents_in_group'] = df['class'].map(
df.dropna(subset='child')
.drop_duplicates(subset='class')
.set_index('class')['name']
)
Output:
class name age child parents_in_group
0 cat Fluffy 3 Whiskers Fluffy
1 dog Spot 5 NaN Rover
2 cat Whiskers 7 NaN Fluffy
3 dog Rover 2 Spot Rover
Or, if efficiency doesn't really matter with a groupy.apply
:
out = (df.groupby('class', sort=False, group_keys=False)
.apply(lambda x: x.assign(parents_in_group=x.loc[x['child'].notna(),
'name']
.iloc[:1].squeeze()),
include_groups=False)
)