Search code examples
pythonregexpandasgroup-byseries

Replacing value after groupby


I have a data frame of a grocery store record:

df = pd.DataFrame(np.array([['Tom', 'apple1'], ['Tom', 'banana35'], ['Jeff', 'pear0']]),
               columns=['customer', 'product'])

| customer | product | | -------- | --------| | Tom| apple1| | Tom| banana35| |Jeff| pear0| I want to get all the products that a customer ever bought, so I used

product_by_customer = df.groupby('customer')['product'].unique()
product_by_customer
customer
Jeff [pear0]
Tom [apple1, banana35]

I want to get rid of the numbers after the product name. I tried

product_by_customer.str.replace('[0-9]', '')

but it replaced everything by NaN.

My desired output is |customer|| |--------|--------| |Jeff|pear| |Tom|apple, banana|

Any help is appreciated!


Solution

  • The values in the product column are in type nd array. Hence the replacement isnt taking place. Try the following code.

    import re
    
    df = pd.DataFrame(np.array([['Tom', 'apple1'], ['Tom', 'banana35'], ['Jeff', 'pear0']]),
                   columns=['customer', 'product'])
    df1 = df.groupby(["customer"])["product"].unique().reset_index()
    df1["product"] = df1["product"].apply(lambda x: [re.sub("\d","", v ) for v in x])
    
    
    df1
    Out[52]: 
      customer          product
    0     Jeff           [pear]
    1      Tom  [apple, banana]
    

    What we are doing is using the lambda function we will access each of the array value and then replace the digits.