Search code examples
pythonpandas

Writing back to a panda groupby group


Good morning all

I am trying to process a lot of data, and I need to group data, look at the group, then set a value based on the other entries in the group, but I want to set the value in a column in the full dataset. What I can't figure out is how I can use the group to write back to the main dataframe.

So as an example, I created this data frame

import pandas as pd
data = [{
    "class": "cat",
    "name": "Fluffy",
    "age": 3,
    "child": "Whiskers",
    "parents_in_group": ""
}, {
    "class": "dog",
    "name": "Spot",
    "age": 5
}, {
    "class": "cat",
    "name": "Whiskers",
    "age": 7
}, {
    "class": "dog",
    "name": "Rover",
    "age": 2,
    "child": "Spot"
}]
df = pd.DataFrame(data)
df

So as an example, lets say that I want to set the parrents_in_group to a list of all the parrents in the group, easy to do

for name, group in group_by_class:
  mask = group["child"].notna()
  print("This is the parrent in group")
  print(group[mask])
  parent_name = group[mask]["name"].values[0]
  print(f"This is the parent name: {parent_name}")
  group["parents_in_group"] = parent_name
  print("And now we have the name set in group")
  print(group)

That updates the group, but not the actual data frame. So how would I go about writing this information back to the main data frame

Using the name and search

This works, but seems a bit untidy

for name, group in group_by_class:
    mask = group["child"].notna()
    parent_name = group[mask]["name"].values[0]
    df.loc[df['class'] == name, 'parents_in_group'] = parent_name
    
df

Using group

How would I go about using group to set the values, rather than searching for the name that the group was created by. Or are there better ways to going about it.

The real challenge I'm having is that I need to get the group, find some specific values in the group, then set some fields based on the data found.

Any help of course welcome.


Solution

  • A loop-less approach would be to compute a groupby.first after dropna, then to map the output:

    df['parents_in_group'] = df['class'].map(
        df.dropna(subset='child').groupby('class')['name'].first()
    )
    
    # variant
    df['parents_in_group'] = df['class'].map(
        df['name'].where(df['child'].notna()).groupby(df['class']).first()
    )
    

    Or, with drop_duplicates in place of groupby (for efficiency):

    df['parents_in_group'] = df['class'].map(
        df.dropna(subset='child')
          .drop_duplicates(subset='class')
          .set_index('class')['name']
    )
    

    Output:

      class      name  age     child parents_in_group
    0   cat    Fluffy    3  Whiskers           Fluffy
    1   dog      Spot    5       NaN            Rover
    2   cat  Whiskers    7       NaN           Fluffy
    3   dog     Rover    2      Spot            Rover
    

    Or, if efficiency doesn't really matter with a groupy.apply:

    out = (df.groupby('class', sort=False, group_keys=False)
             .apply(lambda x: x.assign(parents_in_group=x.loc[x['child'].notna(),
                                                              'name']
                                                         .iloc[:1].squeeze()),
                   include_groups=False)
          )