Search code examples
pythonpandasdataframedata-analysisdata-preprocessing

extracting a string from between to strings in dataframe


im trying to extract a value from my data frame i have a column ['Desc'] it contains sentences in the folowing format

_000it_ZZZ$$$-

_0780it_ZBZT$$$-

_011it_BB$$$-

_000it_CCCC$$$-

I want to extract the string between 'it_' and '$$$'

I have tried this code but does not seem to work

# initializing substrings
sub1 = "it_"
sub2 = "$$$"
 
# getting index of substrings
idx1 = df['DESC'].find(sub1)
idx2 = df['DESC'].find(sub2)
 
# length of substring 1 is added to
# get string from next character
df['results'] = df['DESC'][idx1 + len(sub1) + 1: idx2]

I would appreciate your help


Solution

  • You can use str.extract to get the desired output in your new column.

    import pandas as pd
    import re
    
    df = pd.DataFrame({
        'DESC' : ["_000it_ZZZ$$$-", "_0780it_ZBZT$$$-", "_011it_BB$$$-", "_000it_CCCC$$$-", "_000it_123$$$-"]
    })
    
    pat = r"(?<=it_)(.+)(?=[\$]{3}-)"
    df['results'] = df['DESC'].str.extract(pat)
    print(df)
    
                   DESC results
    0    _000it_ZZZ$$$-     ZZZ
    1  _0780it_ZBZT$$$-    ZBZT
    2     _011it_BB$$$-      BB
    3   _000it_CCCC$$$-    CCCC
    4    _000it_123$$$-     123
    

    You can see the regex pattern on Regex101 for more details.