pythonpandassplitextract

Substring of a specific pattern in different positions


I have a column in my dataframe where all the equipment used for a test are in codes and grouped together.

data = [{'Equipment': 'C-FDO:ER09999,M-GEP:IO09765,P-EE047777,P-PP04757,P-IR07086,P-EE2222,P-ER20000,tEE3-ERE:CC046846'},
        {'Equipment': 'C-FDO:ER09999,M-GEP:IO09765,tEE3-ERE:CC04647, P-CC07777,P-PP047557,P-IR07086,P-EE2223,P-ER20000'},
        {'Equipment': 'tEE3-ERE:CC04648, C-FDO:ER7809999,M-GEP:IO09765,P-PP07777,P-PP04757,P-IR07086,P-EE2224,P-ER20000'},
        {'Equipment': 'C-FDO:ER09999,M-GEP:IO09765,P-PP07777, P-PP04757, tEE3-ERE:CC04649, P-ER01465,P-EE22250,P-ER201212000'}]

I need to get the 4th occurrence of the "P-" equipment and all the occurrences of the "tEE3-ERE" equipment per line, but I can't do this by position slicing, since they are shuffled.

I tried to split by comma and some regex through extract, but it didn't work as I have more than one "P-". So I don't need any other occurrence of P- (but 4th) nor other equipment like "C-FDO" and "M-GEP".

Expected result:

P-EE2222  tEE3-ERE:CC046846
P-EE2223  tEE3-ERE:CC04647
P-EE2224  tEE3-ERE:CC04648
P-EE22250 tEE3-ERE:CC04649

P-      tEE3-ERE
EE2222  CC046846
EE2223  CC04647
EE2224  CC04648
EE22250 CC04649

Solution

  • You can indeed use str.extractall, filtering on the match number (3 for the 4th), and join with the extracted values for tEE3-ERE:

    out = (df['Equipment'].str.extractall('P-([^,]+)')
           .query('match == 3').droplevel('match')
           .rename(columns={0: ('P-')})
           .join(df['Equipment'].str.extractall('tEE3-ERE:([^,]+)')[0]
                 .droplevel('match').rename('tEE3-ERE'))
          )
    

    Output:

            P-  tEE3-ERE
    0   EE2222  CC046846
    1   EE2223   CC04647
    2   EE2224   CC04648
    3  EE22250   CC04649
    

    Alternative output:

    out = (df['Equipment'].str.extractall('(P-[^,]+)')
           .query('match == 3').droplevel('match')
           .join(df['Equipment'].str.extractall('(tEE3-ERE:[^,]+)')[0]
                 .droplevel('match').rename(1))
          )
    
               0                  1
    0   P-EE2222  tEE3-ERE:CC046846
    1   P-EE2223   tEE3-ERE:CC04647
    2   P-EE2224   tEE3-ERE:CC04648
    3  P-EE22250   tEE3-ERE:CC04649