Search code examples
pandasselectgroup-by

Pandas: How to group by a certain column and select groups containing certain values?


My data looks like this:

A                B
MEDUI5945189     GATE IN
MEDUI5945189     RAIL LOAD
MEDUI5945189     GATE OUT
EBKG04830245     LOADED ON VESSEL
EBKG04830245     GATE OUT
COAU7242812270   VESSEL DEPARTURE
COAU7242812270   GATE IN
COAU7242812270   CHANGE IN SHIPMENT ETA
COAU7242812270   GATE OUT FULL
EBKG04830245     CHANGE IN SHIPMENT ETA
EBKG04830245     RAIL UNLOAD
EBKG04830245     VESSEL DEPARTURE

I want to group by column A and return the entire group if at least one of the rows within this group contains the word 'RAIL' in column B. The expected outcome would be:

A                B
MEDUI5945189     GATE IN
MEDUI5945189     RAIL LOAD
MEDUI5945189     GATE OUT
EBKG04830245     LOADED ON VESSEL
EBKG04830245     GATE OUT
EBKG04830245     CHANGE IN SHIPMENT ETA
EBKG04830245     RAIL UNLOAD
EBKG04830245     VESSEL DEPARTURE

I understand that I need to df_sel = df.groupby('A')['B'] but struggling with setting the condition.


Solution

  • You can filter all A groups if exist RAIL by Series.str.contains and filter original column A in Series.isin in boolean indexing:

    df1 = df[df.A.isin(df.loc[df['B'].str.contains('RAIL'), 'A'])]
    

    Or use GroupBy.transform with GroupBy.any for mask:

    df1 = df[df['B'].str.contains('RAIL').groupby(df.A).transform('any')]
    print (df1)
                   A                       B
    0   MEDUI5945189                 GATE IN
    1   MEDUI5945189               RAIL LOAD
    2   MEDUI5945189                GATE OUT
    3   EBKG04830245        LOADED ON VESSEL
    4   EBKG04830245                GATE OUT
    9   EBKG04830245  CHANGE IN SHIPMENT ETA
    10  EBKG04830245             RAIL UNLOAD
    11  EBKG04830245        VESSEL DEPARTURE