python-3.x regex pandas dataframe regex-lookarounds

Pandas regex to extract everything after and before two different symbols

I have a dataframe like as shown below

df = pd.DataFrame({'person_id': [11,11,11],'text':['DOSE: 667 mg - TDS with food - Inject  hypo > 4 test value here','DOSE: 667 mg - TDS with food - Tube','DOSE: 667 mg - TDS with food - PO'],'Summary':['Test','Test1','Test2']})

From the above data frame, I would like to extract everything that

a) comes after DOSE: and before 1st hyphen -

b) Again extract everything that comes after 2nd hyphen -.

I was trying something like below

df['text'].str.extract('(\d+[A-Za-ZS]*(\-))',expand=True)  # doesn't work 
s1 = df['text'].str.split() # I tried using `str.split`
s1[0][1]  # it goes on like for loop which is not elegant.

Can help me with the above?

I expect my output to be like as shown below

Solution

You can use

import pandas as pd
df = pd.DataFrame({'person_id': [11,11,11],'text':['DOSE: 667 mg - TDS with food - Inject','DOSE: 667 mg - TDS with food - Tube','DOSE: 667 mg - TDS with food - PO'],'Summary':['Test','Test1','Test2']})
df['text'].str.replace(r'^DOSE:\s*([^-]*)-[^-]*-\s*(.*)', r'\1\2')
# 0    667 mg Inject
# 1      667 mg Tube
# 2        667 mg PO
# Name: text, dtype: object

The regex is

^DOSE:\s*([^-]*)-[^-]*-\s*(.*)

See the regex demo. Details:

^DOSE: - DOSE: at the start of string
\s* - 0+ whitespaces
([^-]*) - Group 1 (\1 refers to this group value from the replacement pattern): any 0 or more chars other than a -
- - a hyphen
[^-]* - 0+ chars other than -
- - a -
\s* - 0+ whitespaces
(.*) - Capturing group 2 (\2 refers to this group value from the replacement pattern): any zero or more chars other than line break chars, as many as possible.