Search code examples
pythondataframegroup-by

Take the last occurrence of a string value based on the latest date


There are 3 columns, say: Item, Item type and Sales. Now item type keeps changing frequently. Sometimes an item (say, Apple) might have item type as one thing on a particular date ("Healthy" on Jan 1) but might have something else on another date ("Vegan" on Jan 5). There are more than 500 such items.

Eg dataframe:

Item Item type Date Sales
Apple Healthy Jan 1 10
Apple Vegan Jan 5 5

The output needs to be:

Item Item type Sales
Apple Vegan 15

(assuming Vegan is the last entry in the data) and the cumulative total sales is 15.

Now if on Jan 7 another entry is made (Apple, Fruit, 2) for Item, item type, sales, the output changes to: Eg dataframe:

Item Item type Date Sales
Apple Healthy Jan 1 10
Apple Vegan Jan 5 5
Apple Fruit Jan 7 2

The required output:

Item Item type Sales
Apple Fruit 17

I tried using the df.groupby .last function but it's only applicable on numerical data.


Solution

  • The groupby last function should work in this use case, as long as you can confirm that your data is being entered chronologically you can just do

    import pandas
    
    df = pandas.DataFrame(columns=['item', 'type', 'date', 'sales'])
    df.loc[0] = ["Apple", "unripe", "Jan 1", 10]
    df.loc[1] = ["Apple", "ripe", "Jan 10", 5]
    df.loc[2] = ["Orange", "ripe", "Jan 12", 6]
    df.groupby(["item"]).agg({'type': 'last', 'date': 'last', 'sales': sum})
    
    

    That groupby ensures that the latest entries of date and type are used while sales is the sum of all the grouped entries.

    The output looks like

            type    date    sales
    item            
    Apple   ripe    Jan 10  15
    Orange  ripe    Jan 12  6