Search code examples
python-3.xregexpandasdataframeregex-lookarounds

Pandas regex to extract everything after and before two different symbols


I have a dataframe like as shown below

df = pd.DataFrame({'person_id': [11,11,11],'text':['DOSE: 667 mg - TDS with food - Inject  hypo > 4 test value here','DOSE: 667 mg - TDS with food - Tube','DOSE: 667 mg - TDS with food - PO'],'Summary':['Test','Test1','Test2']})

From the above data frame, I would like to extract everything that

a) comes after DOSE: and before 1st hyphen -

b) Again extract everything that comes after 2nd hyphen -.

I was trying something like below

df['text'].str.extract('(\d+[A-Za-ZS]*(\-))',expand=True)  # doesn't work 
s1 = df['text'].str.split() # I tried using `str.split`
s1[0][1]  # it goes on like for loop which is not elegant.

Can help me with the above?

I expect my output to be like as shown below

enter image description here


Solution

  • You can use

    import pandas as pd
    df = pd.DataFrame({'person_id': [11,11,11],'text':['DOSE: 667 mg - TDS with food - Inject','DOSE: 667 mg - TDS with food - Tube','DOSE: 667 mg - TDS with food - PO'],'Summary':['Test','Test1','Test2']})
    df['text'].str.replace(r'^DOSE:\s*([^-]*)-[^-]*-\s*(.*)', r'\1\2')
    # 0    667 mg Inject
    # 1      667 mg Tube
    # 2        667 mg PO
    # Name: text, dtype: object
    

    The regex is

    ^DOSE:\s*([^-]*)-[^-]*-\s*(.*)
    

    See the regex demo. Details:

    • ^DOSE: - DOSE: at the start of string
    • \s* - 0+ whitespaces
    • ([^-]*) - Group 1 (\1 refers to this group value from the replacement pattern): any 0 or more chars other than a -
    • - - a hyphen
    • [^-]* - 0+ chars other than -
    • - - a -
    • \s* - 0+ whitespaces
    • (.*) - Capturing group 2 (\2 refers to this group value from the replacement pattern): any zero or more chars other than line break chars, as many as possible.