Search code examples
pythonpython-3.xpandassubstringfindall

Unable to find the first occurrence of substring using regex for set of values in pandas


I have a dataframe as below, i need to find only the first occurrence in a string for set of values.

I'm unable to use "find" function along with regex and dictionary. And if i use "findall" function, it is ofcourse finding all occurrence which is not what i need.

Text

51000/1-PLASTIC 150 Prange
51034/2-RUBBER KL 100 AA
51556/3-PAPER BD+CM 1 BOXT2
52345/1-FLOW IJ 10place 500 plastic
54975/1-DIVIDER PQR 100 BC
54975/1-SCALE DEF 555 AB Apple 
54975/1-PLASTIC ABC 4.6 BB plastic  

Code:

import re

L = ['PLASTIC','RUBBER','PAPER','FLOW']
pat = '|'.join(r"\b{}\b".format(x) for x in L)

df['Result'] = df['Text'].str.find(pat, flags=re.I).str.join(' ')
print(df)

df = df.replace(r'^\s*$', np.nan, regex=True)
df = df.replace(np.nan, "Not known", regex=True)
#df['Result'] = df['Result'].str.lower()

Expected Result:

Text                                                   Result

51000/1-PLASTIC 150 Prange                            Plastic
51034/2-RUBBER KL 100 AA                              Rubber
51556/3-PAPER BD+CM 1 BOXT2                           Paper
52345/1-FLOW IJ 10place 500 plastic                   Flow
54975/1-DIVIDER PQR 100 BC                            Not known
54975/1-SCALE DEF 555 AB Apple                        Not KNown 
54975/1-PLASTIC ABC 4.6 BB plastic                    Plastic

Error:

TypeError: find() got an unexpected keyword argument 'flags'


Solution

  • Use Series.str.findall instead find with select first value of lists returned of findall by indexing str[0]:

    import re
    
    L = ['PLASTIC','RUBBER','PAPER','FLOW']
    pat = '|'.join(r"\b{}\b".format(x) for x in L)
    
    df['Result'] = df['Text'].str.findall(pat, flags=re.I).str[0]
    

    Or use Series.str.extract:

    df['Result'] = df['Text'].str.extract('(' + pat + ')', flags=re.I)
    

    Then convert missing values to Not known:

    df['Result'] = df['Result'].fillna("Not known")
    

    Last if necessary use Series.str.capitalize:

    df['Result'] = df['Result'].str.capitalize()
    print (df)
                                       Text     Result
    0            51000/1-PLASTIC 150 Prange    Plastic
    1              51034/2-RUBBER KL 100 AA     Rubber
    2           51556/3-PAPER BD+CM 1 BOXT2      Paper
    3   52345/1-FLOW IJ 10place 500 plastic       Flow
    4            54975/1-DIVIDER PQR 100 BC  Not known
    5        54975/1-SCALE DEF 555 AB Apple  Not known
    6  54975/1-PLASTIC ABC 4.6 BB plastic      Plastic