Search code examples
pythonpandasdataframestrsplit

How can I split a column in pandas


I have a data frame with a column that contains string and digit,

Prod_nbr| prod_name
5   Natural chip companyseasalt175g
66  cC Nacho cheese 172g
61  Smiths Crinkle cut chips chicken135g

My desired output is

Prod_nbr|pack|prod_name
5          175g  Natural chip....
66         172g  cC Nacho cheese..
61         135g   Smiths Crinkle...

I tried the code below but I didn't get my desired output, I got

Output

df['pack'] = df['prod_name'].str.extract(r'\d+\s*(\w{,5})\b').fillna('')[0]

Solution

  • I would make a custom function to solve the parsing of the field, then apply it by row to the whole DataFrame. I prefer this way because most of the time you will find some unexpected string in the data, and using a function helps you with tweaking the output when needed.

    Here is a quick example.

    def parse(row):
        s = row.prod_name
        matches = re.findall('\d+g', s)
        if matches:
            if len(matches) == 1:
                return matches[0] #if you have a single match
            else:
                return 'parsing error' #if you have multiple unexpected matches
        return np.nan #no matches
    
    
    df['pack'] = df.apply(parse, axis=1)