I have a column in my dataframe where all the equipment used for a test are in codes and grouped together.
data = [{'Equipment': 'C-FDO:ER09999,M-GEP:IO09765,P-EE047777,P-PP04757,P-IR07086,P-EE2222,P-ER20000,tEE3-ERE:CC046846'},
{'Equipment': 'C-FDO:ER09999,M-GEP:IO09765,tEE3-ERE:CC04647, P-CC07777,P-PP047557,P-IR07086,P-EE2223,P-ER20000'},
{'Equipment': 'tEE3-ERE:CC04648, C-FDO:ER7809999,M-GEP:IO09765,P-PP07777,P-PP04757,P-IR07086,P-EE2224,P-ER20000'},
{'Equipment': 'C-FDO:ER09999,M-GEP:IO09765,P-PP07777, P-PP04757, tEE3-ERE:CC04649, P-ER01465,P-EE22250,P-ER201212000'}]
I need to get the 4th occurrence of the "P-" equipment and all the occurrences of the "tEE3-ERE" equipment per line, but I can't do this by position slicing, since they are shuffled.
I tried to split by comma and some regex through extract, but it didn't work as I have more than one "P-". So I don't need any other occurrence of P- (but 4th) nor other equipment like "C-FDO" and "M-GEP".
Expected result:
P-EE2222 tEE3-ERE:CC046846
P-EE2223 tEE3-ERE:CC04647
P-EE2224 tEE3-ERE:CC04648
P-EE22250 tEE3-ERE:CC04649
P- tEE3-ERE
EE2222 CC046846
EE2223 CC04647
EE2224 CC04648
EE22250 CC04649
You can indeed use str.extractall
, filtering on the match number (3 for the 4th), and join
with the extracted values for tEE3-ERE
:
out = (df['Equipment'].str.extractall('P-([^,]+)')
.query('match == 3').droplevel('match')
.rename(columns={0: ('P-')})
.join(df['Equipment'].str.extractall('tEE3-ERE:([^,]+)')[0]
.droplevel('match').rename('tEE3-ERE'))
)
Output:
P- tEE3-ERE
0 EE2222 CC046846
1 EE2223 CC04647
2 EE2224 CC04648
3 EE22250 CC04649
Alternative output:
out = (df['Equipment'].str.extractall('(P-[^,]+)')
.query('match == 3').droplevel('match')
.join(df['Equipment'].str.extractall('(tEE3-ERE:[^,]+)')[0]
.droplevel('match').rename(1))
)
0 1
0 P-EE2222 tEE3-ERE:CC046846
1 P-EE2223 tEE3-ERE:CC04647
2 P-EE2224 tEE3-ERE:CC04648
3 P-EE22250 tEE3-ERE:CC04649