Search code examples
pythonpandasgroup-bypandas-apply

advanced logic with groupby, apply and transform - compare row value with previous value and create new column


I have the following pandas dataframe:

d= {'Time': [0,1,2,0,1,2,2,3,4], 'Price': ['Auction', 'Auction','800','900','By Negotiation','700','250','250','Make Offer'],'Item': ['Picasso', 'Picasso', 'Picasso', 'DaVinci', 'DaVinci', 'DaVinci', 'Dali', 'Dali', 'Dali']}
df = pd.DataFrame(data=d)

I would like to create a fourth column 'Listing-history' which would specify the following:

  • 'first seen' if the listing is seen for the first time (this is not necessarily Time == 0)
  • 'ongoing listing' if there is no change to the price field from one timepoint to the next
  • 'Price->Auction' if the price field changes from a numeric value (which is actually encoded as a string in my dataframe) to the 'Auction' string, and vice versa if the price changes from a numeric value to the 'Auction' string.
  • I would like the code to be agnostic to the exact price field string, for example: 'Price->By Negotiation' if the price field changes from a numeric value to 'By Negotiation'

I want to group by Item, and then apply the above logic.

Finding whether a listing is 'first seen' is pretty straight forward using something like the following:

df['Price_coerced_to_numeric'] = pd.to_numeric(df['Price'], errors='coerce')
df['Price_diff'] = df.groupby(['Item'])['Price_coerced_to_numeric'].diff(1)

I suspect there is a way of using pandas apply and transform but I haven't been able to work it out. Any tips much appreciated.


Solution

  • You could use groupby.shift and numpy.select:

    # replace numbers by "Price"
    price = df['Price'].mask(pd.to_numeric(df['Price'], errors='coerce')
                               .notna(), 'Price')
    # get previous price
    prev_price = price.groupby(df['Item']).shift()
    
    # identify first row per Item
    m1 = ~df['Item'].duplicated()
    # identify change in price
    m2 = price.ne(prev_price)
    
    # combine conditions
    df['Listing-history'] = np.select([m1, m2],
                                      ['first seen', prev_price+'->'+price],
                                      'ongoing listing')
    

    Output:

       Time           Price     Item        Listing-history
    0     0         Auction  Picasso             first seen
    1     1         Auction  Picasso        ongoing listing
    2     2             800  Picasso         Auction->Price
    3     0             900  DaVinci             first seen
    4     1  By Negotiation  DaVinci  Price->By Negotiation
    5     2             700  DaVinci  By Negotiation->Price
    6     2             250     Dali             first seen
    7     3             250     Dali        ongoing listing
    8     4      Make Offer     Dali      Price->Make Offer
    

    If you really want to use groupby.transform you code refactor the code a bit:

    def history(col):
        price = col.mask(pd.to_numeric(col, errors='coerce').notna(), 'Price')
        prev_price = price.shift()
        return ((prev_price+'->'+price)
                .where(price.ne(prev_price), 'ongoing listing')
                .fillna('first seen') 
               )
    
    df['Listing-history'] = df.groupby('Item')['Price'].transform(history)
    

    variant if you can have NaNs in the original column:

    def history(col):
        price = col.mask(pd.to_numeric(col, errors='coerce').notna(), 'Price')
        prev_price = price.shift()
        out = (prev_price+'->'+price).where(price.ne(prev_price), 'ongoing listing')
        out.iat[0] = 'first seen'
        return out
    
    df['Listing-history'] = df.groupby('Item')['Price'].transform(history)