I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11],'text':['DOSE: 667 mg - TDS with food - Inject hypo > 4 test value here','DOSE: 667 mg - TDS with food - Tube','DOSE: 667 mg - TDS with food - PO'],'Summary':['Test','Test1','Test2']})
From the above data frame, I would like to extract everything that
a) comes after DOSE:
and before 1st hyphen -
b) Again extract everything that comes after 2nd hyphen -
.
I was trying something like below
df['text'].str.extract('(\d+[A-Za-ZS]*(\-))',expand=True) # doesn't work
s1 = df['text'].str.split() # I tried using `str.split`
s1[0][1] # it goes on like for loop which is not elegant.
Can help me with the above?
I expect my output to be like as shown below
You can use
import pandas as pd
df = pd.DataFrame({'person_id': [11,11,11],'text':['DOSE: 667 mg - TDS with food - Inject','DOSE: 667 mg - TDS with food - Tube','DOSE: 667 mg - TDS with food - PO'],'Summary':['Test','Test1','Test2']})
df['text'].str.replace(r'^DOSE:\s*([^-]*)-[^-]*-\s*(.*)', r'\1\2')
# 0 667 mg Inject
# 1 667 mg Tube
# 2 667 mg PO
# Name: text, dtype: object
The regex is
^DOSE:\s*([^-]*)-[^-]*-\s*(.*)
See the regex demo. Details:
^DOSE:
- DOSE:
at the start of string\s*
- 0+ whitespaces([^-]*)
- Group 1 (\1
refers to this group value from the replacement pattern): any 0 or more chars other than a -
-
- a hyphen[^-]*
- 0+ chars other than -
-
- a -
\s*
- 0+ whitespaces(.*)
- Capturing group 2 (\2
refers to this group value from the replacement pattern): any zero or more chars other than line break chars, as many as possible.