I have a data frame grouped by multiple columns but in this example it would be grouped only by Year
.
Year Animal1 Animal2
0 2002 Dog Mouse,Lion
1 2002 Mouse
2 2002 Lion
3 2002 Duck
4 2010 Dog Cat
5 2010 Cat
6 2010 Lion
7 2010 Mouse
I would like for each group, from the rows where Animal2
is empty to filter out the rows where Animal2
does not appear in the column Animal1
.
The expected output would be:
Year Animal1 Animal2
0 2002 Dog Mouse,Lion
1 2002 Mouse
2 2002 Lion
3 2010 Dog Cat
4 2010 Cat
Rows 0 & 3 stayed since Animal2
is not empty.
Rows 1 & 2 stayed since Mouse & Lion are in Animal2
for the first group.
Row 4 stayed since cat appear in Animal2
for the second group
EDIT: I get an error for a similar input data frame
Year Animal1 Animal2
0 2002 Dog Mouse
1 2002 Mouse
2 2002 Lion
3 2010 Dog
4 2010 Cat
The expected output would be:
Year Animal1 Animal2
0 2002 Dog Mouse
1 2002 Mouse
The error is triggered in the .apply(lambda g: g.isin(sets[g.name]))
part of the code.
if not any(isinstance(k, slice) for k in key):
if len(key) == self.nlevels and self.is_unique:
# Complete key in unique index -> standard get_loc
try:
return (self._engine.get_loc(key), None)
except KeyError as err:
raise KeyError(key) from err
KeyError: (2010, 'Dog')
You can use masks and regexes:
# non empty Animal2
m1 = df['Animal2'].notna()
# make patterns with those Animals2 per Year
patterns = df[m1].groupby('Year')['Animal2'].agg('|'.join).str.replace(',', '|')
# for each Year select with the matching regex
m2 = (df.groupby('Year', group_keys=False)['Animal1']
.apply(lambda g: g.str.fullmatch(patterns[g.name]))
)
out = df.loc[m1|m2]
Or sets:
m1 = df['Animal2'].notna()
sets = (df.loc[m1, 'Animal2'].str.split(',')
.groupby(df['Year'])
.agg(lambda x: set().union(*x))
)
m2 = (df.groupby('Year', group_keys=False)['Animal1']
.apply(lambda g: g.isin(sets[g.name]))
)
out = df.loc[m1|m2]
Output:
Year Animal1 Animal2
0 2002 Dog Mouse,Lion
1 2002 Mouse None
2 2002 Lion None
4 2010 Dog Cat
5 2010 Cat None