There are 3 columns, say: Item, Item type and Sales. Now item type keeps changing frequently. Sometimes an item (say, Apple) might have item type as one thing on a particular date ("Healthy" on Jan 1) but might have something else on another date ("Vegan" on Jan 5). There are more than 500 such items.
Eg dataframe:
Item | Item type | Date | Sales |
---|---|---|---|
Apple | Healthy | Jan 1 | 10 |
Apple | Vegan | Jan 5 | 5 |
The output needs to be:
Item | Item type | Sales |
---|---|---|
Apple | Vegan | 15 |
(assuming Vegan is the last entry in the data) and the cumulative total sales is 15.
Now if on Jan 7 another entry is made (Apple, Fruit, 2) for Item, item type, sales, the output changes to: Eg dataframe:
Item | Item type | Date | Sales |
---|---|---|---|
Apple | Healthy | Jan 1 | 10 |
Apple | Vegan | Jan 5 | 5 |
Apple | Fruit | Jan 7 | 2 |
The required output:
Item | Item type | Sales |
---|---|---|
Apple | Fruit | 17 |
I tried using the df.groupby
.last
function but it's only applicable on numerical data.
The groupby last function should work in this use case, as long as you can confirm that your data is being entered chronologically you can just do
import pandas
df = pandas.DataFrame(columns=['item', 'type', 'date', 'sales'])
df.loc[0] = ["Apple", "unripe", "Jan 1", 10]
df.loc[1] = ["Apple", "ripe", "Jan 10", 5]
df.loc[2] = ["Orange", "ripe", "Jan 12", 6]
df.groupby(["item"]).agg({'type': 'last', 'date': 'last', 'sales': sum})
That groupby ensures that the latest entries of date and type are used while sales is the sum of all the grouped entries.
The output looks like
type date sales
item
Apple ripe Jan 10 15
Orange ripe Jan 12 6