My data looks like this:
A B
MEDUI5945189 GATE IN
MEDUI5945189 RAIL LOAD
MEDUI5945189 GATE OUT
EBKG04830245 LOADED ON VESSEL
EBKG04830245 GATE OUT
COAU7242812270 VESSEL DEPARTURE
COAU7242812270 GATE IN
COAU7242812270 CHANGE IN SHIPMENT ETA
COAU7242812270 GATE OUT FULL
EBKG04830245 CHANGE IN SHIPMENT ETA
EBKG04830245 RAIL UNLOAD
EBKG04830245 VESSEL DEPARTURE
I want to group by column A and return the entire group if at least one of the rows within this group contains the word 'RAIL' in column B. The expected outcome would be:
A B
MEDUI5945189 GATE IN
MEDUI5945189 RAIL LOAD
MEDUI5945189 GATE OUT
EBKG04830245 LOADED ON VESSEL
EBKG04830245 GATE OUT
EBKG04830245 CHANGE IN SHIPMENT ETA
EBKG04830245 RAIL UNLOAD
EBKG04830245 VESSEL DEPARTURE
I understand that I need to df_sel = df.groupby('A')['B']
but struggling with setting the condition.
You can filter all A
groups if exist RAIL
by Series.str.contains
and filter original column A
in Series.isin
in boolean indexing
:
df1 = df[df.A.isin(df.loc[df['B'].str.contains('RAIL'), 'A'])]
Or use GroupBy.transform
with GroupBy.any
for mask:
df1 = df[df['B'].str.contains('RAIL').groupby(df.A).transform('any')]
print (df1)
A B
0 MEDUI5945189 GATE IN
1 MEDUI5945189 RAIL LOAD
2 MEDUI5945189 GATE OUT
3 EBKG04830245 LOADED ON VESSEL
4 EBKG04830245 GATE OUT
9 EBKG04830245 CHANGE IN SHIPMENT ETA
10 EBKG04830245 RAIL UNLOAD
11 EBKG04830245 VESSEL DEPARTURE