Say I have the following dataframe covering hundreds of different names:
df = pd.DataFrame({"NAME":["A", "A", "A" ,"A",
"B", "B", "B",
"C", "C", "C", "C",
"D", "D",],
"CLASS":["LI","BO","EQ","AI",
"LI", "EQ", "AI",
"LI","BO","EQ","AI",
"EQ", "AI",
]})
Using the groupby-function of Pandas, how am I supposed to filter out only those names that contain both EQ AND AI but NOT BO? Ideally, I should obtain the following:
NAME CLASS
B LI
B EQ
B AI
D EQ
D AI
Any suggestions?
Use groupby.transform
with set operations:
out = df[df.groupby('NAME')['CLASS']
.transform(lambda x: ((S:=set(x))>={'EQ', 'AI'}) and ('BO' not in S))]
# or
# out = df[df.groupby('NAME')['CLASS']
# .transform(lambda x: ((S:=set(x))>={'EQ', 'AI'})
# and not S.intersection({'BO'}))]
Variant with groupby.agg
and isin
:
g = df.groupby('NAME')['CLASS'].agg(set)
out = df[df['NAME'].isin(g.index[(g >= {'EQ', 'AI'}) & ~(g >= {'BO'})])]
Output:
NAME CLASS
4 B LI
5 B EQ
6 B AI
11 D EQ
12 D AI
with an extra group for completeness
# transform approach
NAME CLASS g >= {'EQ', 'AI'} ~(g >= {'BO'}) &
0 A LI True False False
1 A BO True False False
2 A EQ True False False
3 A AI True False False
4 B LI True True True
5 B EQ True True True
6 B AI True True True
7 C LI True False False
8 C BO True False False
9 C EQ True False False
10 C AI True False False
11 D EQ True True True
12 D AI True True True
13 E AI False True False
# agg approach
g (g >= {'EQ', 'AI'}) ~(g >= {'BO'}) &
NAME
A {LI, EQ, BO, AI} True False False
B {LI, EQ, AI} True True True
C {LI, EQ, BO, AI} True False False
D {EQ, AI} True True True
E {AI} False True False