Search code examples
pythonpandasdataframematplotlibmulti-index

Pandas / Matplotlib bar plot with multi index dataframe


I have a sorted Multi-Index pandas data frame, which I need to plot in a bar chart. My data frame.

I either didn't find the solution yet, or the simple one doesn't exist, but I need to plot a bar chart on this data with Content and Category to be on x-axis and Installs to be the height.

In simple terms, I need to show what each bar consist of e.g. 20% of it would be by Everyone, 40% by Teen etc... I'm not sure that is even possible, as the mean of means wouldn't be possible, as different sample size, hence I made an Uploads column to calculate it, but haven't gotten that far to plot by mean.

I think plotting by cumulative would give a wrong result though.

I need to plot a bar chart with X-ticks to be the Category, (Preferably just the first 10) then each X-tick have a bar of Content not always 3, could be just "Everyone" and "Teen" and the height of each bar to be Installs.

Ideally, it should look like so: Bar Chart

but each bar have bars for Content for this specific Category.

I have tried flattening out with DataFrame.unstack(), but it ruins the sorting of the data frame, so used that Cat2 = Cat1.reset_index(level = [0,1]), but need help with plotting still.

So far I have:

Cat = Popular.groupby(["Category","Content"]).agg({"Installs": "sum", "Rating Count": "sum"})
Uploads = Popular[["Category","Content"]].value_counts().rename_axis(["Category","Content"]).reset_index(name = "Uploads")
Cat = pd.merge(Cat, Uploads, on = ["Category","Content"])
Cat = Cat.groupby(["Category","Content"]).agg({"Installs": "sum", "Rating Count": "sum", "Uploads": "sum"})

which gives this

result

Then I sort it like so

Cat1 = Cat.unstack() 
Cat1 = Cat1.sort_index(key = (Cat1["Installs"].sum(axis = 1)/Cat1["Uploads"].sum(axis = 1)).get, ascending = False).stack()

Thanks to one of those solutions

That's all I have.

Data Set is from Kaggle, over 600MB, don't expect anyone to download it, but at least a simple guide towards a solution.

P.S. This should help me out with splitting each dots in scatter plot below in the same way, but if not, that's fine.

P.S.S I don't have enough reputation to post pictures, so apologies for the links


Solution

  • ChatGPT has answered my question

    import pandas as pd
    import matplotlib.pyplot as plt
    
    # create a dictionary of data for the DataFrame
    data = {
        'app_name': ['Google Maps', 'Uber', 'Waze', 'Spotify', 'Pandora'],
        'category': ['Navigation', 'Transportation', 'Navigation', 'Music', 'Music'],
        'rating': [4.5, 4.0, 4.5, 4.5, 4.0],
        'reviews': [1000000, 50000, 100000, 500000, 250000]
    }
    
    # create the DataFrame
    df = pd.DataFrame(data)
    
    # set the 'app_name' and 'category' columns as the index
    df = df.set_index(['app_name', 'category'])
    
    # add a new column called "content_rating" to the DataFrame, and assign a content rating to each app
    df['content_rating'] = ['Everyone', 'Teen', 'Everyone', 'Everyone', 'Teen']
    
    # Grouping the Data by category and content_rating and getting the mean of reviews
    df_grouped = df.groupby(['category','content_rating']).agg({'reviews':'mean'})
    
    # Reset the index to make it easier to plot
    df_grouped = df_grouped.reset_index()
    
    # Plotting the stacked bar chart
    df_grouped.pivot(index='category', columns='content_rating', values='reviews').plot(kind='bar', stacked=True)
    

    This is a sample data set

    What I did is I added a sum column to the dataset and sorted it by this sum.

    piv = qw1.reset_index()
    piv = piv.pivot_table(index='Category', columns='Content', values='per')#.plot(kind='bar', stacked = True)
    piv["Sum"] = piv.sum(axis=1)
    piv_10 = piv.sort_values(by = "Sum", ascending = False)[["Adult", "Everyone", "Mature", "Teen"]].head(10)
    

    where qw1 is the multi-index data frame.

    Then all had to do is to plot it:

    piv_10.plot.bar(stacked = True, logy = False)