Search code examples
pythonpandasdataframegroup-bydata-analysis

Group by and return all index values where a substring of text exists in a column


I have a df with the following structure

   vid  sid   pid      url 
1   A    A1   page     ABCDEF   
2   A    A1   page     DEF123
3   A    A1   page     GHI345
4   A    A1   page     JKL345
5   B    B1   page     AB12345EF
6   B    B2   page     IJK
7   B    B2   page     XYZ
8   C    C1   page     ABCEF

dict = {'vid':{1:'A',2:'A',3:'A',4:'A',5:'B',6:'B',7:'B',8:'C'},
        'sid':{1:'A1',2:'A1',3:'A1',4:'A1',5:'B1',6:'B2',7:'B2',8:'C1'},
        'page':{1:'page',2:'page',3:'page',4:'page',5:'page',6:'page',7:'page',8:'pge'},
        'url':{1:'ABC',2:'DEF',3:'GHI',4:'JKL',5:'ABC',6:'IJK',7:'XYZ',8:'ABC'}
}

I also have a list substrings

lst = ['AB','EF']

Essentially, I want to group by sid and check every single row in url. If all the elements in the list exist as a substring in at least one row, then return the sid.If not, filter out the sid from the df. The substrings inside url aren't sequential.

Psuedo-code

group by sid
if row in url contains all the substrings in lst
       pass
if no row in url contains all substrings in lst
       remove the `sid` from the df

Result from applying the logic above to the df using lst

enter code here

      vid  sid   pid      url 
1   A    A1   page     ABCDEF   
2   A    A1   page     DEF123
3   A    A1   page     GHI345
4   A    A1   page     JKL345
5   B    B1   page     AB12345EF
8   C    C1   page     ABCEF

Solution

  • Get the boolean mask for url in lst :

    # `all` check for rows that have both `AB` and `EF`
    mask = [all(a in ent for a in lst)  for ent in df.url]
    mask = pd.Series(mask, index = df.index)
    
    # Group mask with `Sid` and filter `df`:
    
    df.loc[mask.groupby(df.sid).transform('any')]
    
      vid sid   pid        url
    1   A  A1  page     ABCDEF
    2   A  A1  page     DEF123
    3   A  A1  page     GHI345
    4   A  A1  page     JKL345
    5   B  B1  page  AB12345EF
    8   C  C1  page      ABCEF