Search code examples
pythonpandasdataframedata-mining

Panda : Group by and Aggregate


I am a bit new to pandas dataframe and currently trying to get the highest total by aggregated sales by States for my dataframe. The dataframe is about sales from store in the USA over several years. I want to know which State sold the most items and then which State sold the most but by values of items, so in dollars.

Here is a link to the dataframe : https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/sales.csv

And here is my code :

# Load the data
df=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/sales.csv')

# Highest demand by number of items
df["State"].value_counts()[:50]

#Highest demand by aggregated sales
df.groupby(["Customer ID","State"])["Sales"].sum()
AggSales = df.groupby("State", as_index=False)["Sales"].count()
AggSales.sort_values(by=['Sales'], inplace=True, ascending=False)
AggSales.head(50)

My problem is that I get the same results for the highest sales by number of items and by aggregated values and I don't understand why. I have tried to group or aggregate in many different ways but always get the same results and I can't seem to seen where I am going wrong.


Solution

  • To calculate the total sales in each country/user you need after the groupby use sum() to add all the sales. count() is used to count the amount of rows that the sale appear, so it won't give us the output we desire.

    Code:

    # calculate the sales of user in each country
    sales_by_user_per_country = df.groupby(["Customer ID", "State"])["Sales"].sum().sort_values(
        ascending=False).reset_index()
    print(sales_by_user_per_country.head(50))
    
    # calculate the total sales in each country 
    sales_by_country = df.groupby(["State"])["Sales"].sum().sort_values(ascending=False).reset_index()
    print(sales_by_country.head(50))
    

    There's other option to do the same thing using pivot_table method:

    sales_by_country = df.pivot_table(index="State",values="Sales",aggfunc="sum").sort_values("Sales",ascending=False)
    print(sales_by_country.head(50))