Search code examples
pandasmatplotlibbar-chartgrouping

Plot stacked (100%) bar chart for multiple categories on multiple dates


I have following initial dataframe:

Post ID Submission_Date Flair
0 row1 01.12.2020 NaN
1 row2 03.12.2020 Discussion
2 row3 03.12.2020 News
3 row4 03.12.2020 Discussion
4 row5 06.12.2020 Due Diligence
5 row6 07.12.2020 Discussion
6 row7 31.12.2020 Discussion
1 row8 01.01.2021 Hedge Fund Tears
  • Multiple Dates with missing dates in between
  • Multiple categories on dates

I grouped the dataframe with:

import pandas as pd
import numpy as np  # for test data

data = {'Post ID': ['row1', 'row2', 'row3', 'row4', 'row5', 'row6', 'row7', 'row8'], 'Submission_Date': ['01.12.2020', '03.12.2020', '03.12.2020', '03.12.2020', '06.12.2020', '07.12.2020', '31.12.2020', '01.01.2021'], 'Flair': [np.nan, 'Discussion', 'News', 'Discussion', 'Due Diligence', 'Discussion', 'Discussion', 'Hedge Fund Tears']}
df = pd.DataFrame(data)

df['Submission_Date'] = pd.to_datetime(df['Submission_Date'], format = "%Y-%m-%d %H:%M:%S").dt.strftime('%Y-%m-%d') 

df = df.groupby('Submission_Date')['Flair'].value_counts(normalize=True).unstack()

The result is this:

enter image description here

I want to fill the dates with "empty" bars and plot like this

something like this:

I tried already this:

fig, ax = plt.subplots(figsize=(20,10))
df.plot(kind='bar',ax=ax, stacked=True, width=1)
plt.xlabel('Submission_Date', fontsize=16)
plt.ylabel('Ratio of Flairs used', fontsize=16)

But the dates are incorrect since the empty days are not displayed

this graph


Solution

  • Assuming this input as df2 (the output of your groupby operation):

    Flair            Discussion  Due Diligence  Hedge Fund Tears      News
    Submission_Date                                                       
    01.01.2021              NaN            NaN               1.0       NaN
    03.12.2020         0.666667            NaN               NaN  0.333333
    06.12.2020              NaN            1.0               NaN       NaN
    07.12.2020         1.000000            NaN               NaN       NaN
    31.12.2020         1.000000            NaN               NaN       NaN
    

    You can reindex from pd.date_range:

    df2.index = pd.to_datetime(df2.index, format='%d.%m.%Y')
    df2 = df2.reindex(pd.date_range(df2.index.min(), df2.index.max()))
    df2.index = df2.index.strftime('%Y-%m-%d') 
    
    Flair       Discussion  Due Diligence  Hedge Fund Tears      News
    2020-12-03    0.666667            NaN               NaN  0.333333
    2020-12-04         NaN            NaN               NaN       NaN
    2020-12-05         NaN            NaN               NaN       NaN
    2020-12-06         NaN            1.0               NaN       NaN
    2020-12-07    1.000000            NaN               NaN       NaN
    ...
    2020-12-30         NaN            NaN               NaN       NaN
    2020-12-31    1.000000            NaN               NaN       NaN
    2021-01-01         NaN            NaN               1.0       NaN
    

    graphical outcome (small size):

    missing dates