I am supposed to aggregate a table of sales by year and id, then also add an age column to the table. Unfortunately when I add the last column, I keep getting NaN values for that column only. Like so:
|year|id|sales|age|
__________________
2022 |1 | 200| NaN|
|2 | 342| NaN|
2021 |34| 500| NaN|
|10| 20 | NaN|
|7 | 4200| Nan|
I have an "age" column in my df, so each id has an age assigned to it. Here is my code:
df['sales'] = df.price*df.quantity
df['year'] = pd.DatetimeIndex(df['date']).year
def table(df):
test = order.groupby(['year','id'])\
.agg(sales = ('sales', 'sum'))\
.assign(age = df['age'])
return table
table(df)
Could someone please let me know why it does not show the age the way it is supposed to?
It gives NaN
values as it does not understand what you want. You are assigning a column to a smaller dataframe as you are grouping by year
and id
. So the question is, what do you want to happen to the column age
?
With my dummy dataset:
df = pd.DataFrame({'year': [2021, 2021, 2021, 2022],
'id': [1, 1, 2, 1],
'sales': [20, 30, 40, 50],
'age': [3, 4, 5, 6]})
If you want to sum or take the average of the column age, do a similar thing as you did with sales
:
df.groupby(['year', 'id']).agg(sales = ('sales', 'sum'), age = ('age', 'mean'))
Output:
sales age
year id
2021 1 50 3.5
2 40 5.0
2022 1 50 6.0
Do you want to get a list of all values of age, use the argument unique
instead of mean
:
df.groupby(['year', 'id']).agg(sales = ('sales', 'sum'), age = ('age', 'unique'))
Output:
sales age
year id
2021 1 50 [3, 4]
2 40 [5]
2022 1 50 [6]
If the age is the same within a year/id group, you can take the minimum (or maximum, it does not matter as they are the same). Note that in my dummy dataset they are not the same:
df.groupby(['year', 'id']).agg(sales = ('sales', 'sum'), age = ('age', 'min'))
Output:
sales age
year id
2021 1 50 3
2 40 5
2022 1 50 6