I have a data frame with a column that contains string and digit,
Prod_nbr| prod_name
5 Natural chip companyseasalt175g
66 cC Nacho cheese 172g
61 Smiths Crinkle cut chips chicken135g
My desired output is
Prod_nbr|pack|prod_name
5 175g Natural chip....
66 172g cC Nacho cheese..
61 135g Smiths Crinkle...
I tried the code below but I didn't get my desired output, I got
df['pack'] = df['prod_name'].str.extract(r'\d+\s*(\w{,5})\b').fillna('')[0]
I would make a custom function to solve the parsing of the field, then apply it by row to the whole DataFrame. I prefer this way because most of the time you will find some unexpected string in the data, and using a function helps you with tweaking the output when needed.
Here is a quick example.
def parse(row):
s = row.prod_name
matches = re.findall('\d+g', s)
if matches:
if len(matches) == 1:
return matches[0] #if you have a single match
else:
return 'parsing error' #if you have multiple unexpected matches
return np.nan #no matches
df['pack'] = df.apply(parse, axis=1)