Search code examples
pythonloopscsvbar-chartstacked

Stacked bar plot of large data in python


I would like to plot a stacked bar plot from a csv file in python. I have three columns of data

year word frequency
2018 xyz 12
2017 gfh 14
2018 sdd 10
2015 fdh 1
2014 sss 3
2014 gfh 12
2013 gfh 2
2012 gfh 4
2011 wer 5
2010 krj 4
2009 krj 4
2019 bfg 4
... 300+ rows of data. 

I need to go through all the data and plot a stacked bar plot which is categorized based on the year, so x axis is word and y axis is frequency, the legend color should show year wise. I want to see how the evolution of each word occured year wise. Some of the technology words are repeatedly used in every year and hence the stack bar graph should add the values on top and plot, for example the word gfh initially plots 14 for year 2017, and then in year 2014 I want the gfh word to plot (in a different color) for a value of 12 on top of the gfh of 2017. How do I do this? So far I called the csv file in my code. But I don't understand how could it go over all the rows and stack the words appropriately (as some words repeat through all the years). Any help is highly appreciated. Also the years are arranged in random order in csv but I sorted them year wise to make it easier. I am just learning python and trying to understand this plotting routine since i have 40 years of data and ~20 words. So I thought stacked bar plot is the best way to represent them. Any other visualisation method is also welcome.


Solution

  • This can be done using pandas:

    import pandas as pd
    
    df = pd.read_csv("file.csv")
    
    # Aggregate data
    df = df.groupby(["word", "year"], as_index=False).agg({"frequency": "sum"})
    # Create list to sort by
    sorter = (
        df.groupby(["word"], as_index=False)
        .agg({"frequency": "sum"})
        .sort_values("frequency")["word"]
        .values
    )
    
    # Pivot, reindex, and plot
    df = df.pivot(index="word", columns="year", values="frequency")
    df = df.reindex(sorter)
    df.plot.bar(stacked=True)
    
    

    Which outputs:
    Stacked bar chart