Search code examples
pythonmatplotlibdatetimelineboxplot

How to plot a line and a box plot in the same graph and the x-axis is a date


I've been trying to make a plot from two dataframe that contains discharge values from a monthly time serie. The first one is a dataframe (df1) which contains the following columns date, year, month and discharge. the second dataframe (df2) contains the same columns but it contains different discharge values for each month. I want to make a plot in the same figure using the two dataframes. The dataframe df1 has to be a line plot with a x-axis as date and y-axis as discharge. The second dataframe (df2) has to be a box-plot with a x-axis as date and y-axis as the grouped discharge for each month.

Here is the code I have tested:

df1 = pd.DataFrame({
    'date': ['2023-04-01', '2023-03-01', '2023-02-01', '2023-01-01', '2022-12-01'],
    'year': [2023,2023,2023,2023,2022],
    'month': [4,3,2,1,12],
    'discharge': [10, 20, 30, 15, 25]
})

# Define the start and end dates
start_date = datetime(2023, 5, 1)
end_date = datetime(2023, 9, 1)

# Generate the date range
dates = pd.date_range(start=start_date, end=end_date, freq='MS')

# Define the values for the DataFrame
df2  = {'date': dates.repeat(10),
        'year': [d.year for d in dates] * 10,
        'month': [d.month for d in dates] * 10,
        'discharge': [i+1 for i in range(len(dates))] * 10}

# Plot 

fig, ax = plt.subplots()
ax.plot(df1['date'], df1['discharge'], label='Discharge');

# plot the box plot
df2.boxplot(column='discharge', by='month', positions=[df2['date'][2]], widths=10, ax=ax)

I got this error:

ValueError: List of boxplot statistics and positions values must have same the length


Solution

  • The reason for the error is because pos has just one value for df2['date'][2]. But, even if you manually gave the other values using unique(), it would not work. There are a couple of issues. One is that you need to use datetime for both the line and box plots. I am assuming you want the dates to be incrementally increasing as well as the y-axis would need to be the same for both plots.

    To do this, you will need to first convert each of the points (for both plots) to have an integer which would be the number of days from the earliest date for either plot. Then, the offset (number of days from the earliest date) would be calculated for each of the points and plotted on an integer axis. Post potting, you will need to change the x-ticks back to date format... Below is the code and I have provided as many comments as possible. Hope this is what you are looking for...

    import datetime
    df1 = pd.DataFrame({'date': ['2023-04-01', '2023-03-01', '2023-02-01', '2023-01-01', '2022-12-01'], 'year': [2023,2023,2023,2023,2022], 'month': [4,3,2,1,12], 'discharge': [10, 20, 30, 15, 25]})
    df1['date']=pd.to_datetime(df1['date'])
    
    # Define the start and end dates
    start_date = datetime.datetime(2023, 5, 1)
    end_date = datetime.datetime(2023, 9, 1)
    
    # Generate the date range
    dates = pd.date_range(start=start_date, end=end_date, freq='MS')
    
    # Define the values for the DataFrame
    df2  = pd.DataFrame({'date': dates.repeat(10), 'year': [d.year for d in dates] * 10, 'month': [d.month for d in dates] * 10, 'discharge': [i+1 for i in range(len(dates))] * 10})
    
    ## Get the earliest date in BOTH plots combined
    begin=pd.concat([df1.date, pd.Series(df2.date.unique())]).min()
    
    ## Add columns linepos and boxpos to the dataframes to show offset from earliest date
    df1['linepos']=(df1['date']-begin).dt.days
    df2['boxpos']=(df2['date']-begin).dt.days
    
    ## Plot plots - note I am using boxpos and linepos, not dates for x-axis
    ax=df2[['discharge', 'boxpos']].boxplot(by='boxpos', widths=4, positions=df2.boxpos.unique(), figsize=(20,7))
    ax.plot(df1['linepos'], df1['discharge'], label='Discharge')
    
    ## Set x-lim to include both line and boxes
    ax.set_xlim( [ min(df2.boxpos.min(), df1.linepos.min())-10, max(df2.boxpos.max(), df1.linepos.max()) + 10 ] )
    
    ## To change the x-axis ticks, get the list of all x-entries and sort
    locs=(list(df2.boxpos.unique())+list(df1.linepos.unique()))
    locs.sort()
    ax.set_xticks(locs)
    
    ## To add labels get unique dates, sort them, convert to format you like and plot
    ax.set_xticklabels(pd.concat([df1.date, pd.Series(df2.date.unique())]).sort_values().reset_index(drop=True).dt.strftime('%Y-%m-%d'), rotation=45 )
    
    ## Set x and y labels
    ax.set_xlabel('Dates')
    ax.set_ylabel('Discharge')
    

    enter image description here