Search code examples
pythonpandastime-seriesplotlyline-plot

Create Specific 365 Day Years to Compare Data


I have what seems like it should be a trivial exercise but I can't seem to crack it. I have the following dataframe:

import pandas as pd
import numpy as np

np.random.seed(5)
size = 1000
d = {'Farmer':np.random.choice( ['Bob','Suyzn','John'], size),
        'Commodity': np.random.choice( ['Corn','Wheat','Soy'], size),
        'Date':np.random.choice( pd.date_range('1/1/2018','12/12/2022', freq='D'), size),
        'Bushels': np.random.randint(20,100, size=(size))
        }
df = pd.DataFrame(d)

I use pandas to look at the different farmers & the amount of the different commodities they deliver. I want the ability to look at custom years where I can provide the start_month & start_day to create a new column called 'Interval' that shows the data based on what I choose. This is the function that works for that:

def calculate_intervals(df, start_month, start_day):
    """Adds column called Interval to df. It 

    Args:
        df (dataframe): Dataframe to pass that's been filtered to what you want.
        start_month (int): Month to start.
        start_day (int): Day to start.

    Returns:
        dataframe: Dataframe with 'Interval' added
    """    
    # Filter the DataFrame
    filtered_df = df
    filtered_df['Year'] = filtered_df['Date'].dt.year
    
    # Define the date ranges for each interval
    intervals = []
    for year in filtered_df['Year'].unique():
        start_date = pd.to_datetime(f"{year}-{start_month}-{start_day}")
        end_date = pd.to_datetime(f"{year+1}-{start_month}-{start_day}") - pd.DateOffset(days=1)
        intervals.append((start_date, end_date))
    
    # Create a function to assign the interval label to each row
    def assign_interval(row):
        for c, (start_date, end_date) in enumerate(intervals):
            if start_date <= row['Date'] <= end_date:
                return f"{start_date.date()} - {end_date.date()}"
        return None  # Return None if the row's date is not within any interval
    
    # Apply the interval assignment function to each row
    filtered_df['Interval'] = filtered_df.apply(assign_interval, axis=1)
    
    return filtered_df

calculate_intervals(df, start_month=7, start_day=6)

My goal is to take it one step further. I want to create a graph that takes the start_month & start_day & uses that as position 0 on the y-axis. The x-axis would show the next 365 days all the way out to the end. In this example, the x-axis would start at July 6 & go through July 5th of the following year. The y-axis would be the cumulative sum of bushels sold starting on July 6 through July 5th of the next year. Each custom year would be it's own line on the graph.

I have done similar exercises to graph cumulative sum starting on Jan 1 - Dec 31 of the same year, but can't figure out how to do it with a custom start date. My goal is to get a graph that looks something like this: enter image description here

Any suggestions? Thanks in advance!


Solution

  • The first thing would be to order your DataFrame by Interval and Date, then calculate the cumulative units sold using groupby and cumsum.

    However, the main challenge is that in your desired plot, the tickmarks on the axes are cyclical, and cyclical xaxis coordinates aren't easily interpreted by Plotly. In the sample data you've provided, we want the axis to start at week 26, then loop back around to week 0 after week 52. If we convert the format of the tickmarks to strings, we may end up with multiple tickmarks with the same string for a particular week, and this will cause problems.

    The best solution I could come up with is create a new column in your data that "adjusts" the day of the year by mapping it from cyclical days of the year to linear days of the year.

    For example, if you begin your cumulative counts of units sold on day 187 of the year, then the x values of your plot in terms of the day of the year would look like [187, 188, 189, ... 365, 0, 1, 2, ... 186], and we want to map this to [0, 1, 2, ... 365-187, 365-187+0, 365-187+1, ... 186].

    Then if we plot cumulative units sold against adjusted day of the year, we will get a monotonically increasing plot as desired – the only issue is that the tickmarks are in days, and that they don't reflect the actual day of the year.

    enter image description here

    To get around this, we use plotly's ticktext argument that allows you to overlay text for certain tickmarks, and since you want a tickmark for every 4 weeks, we can use floor division to convert the actual days of the year [187, 188, 189, ... 365, 0, 1, 2, ... 186, 187] to actual weeks of the year [26, 26, ... 52, 52, 0, 0, ... 26, 26], and then use a list comprehension to convert that to an array that contains empty strings except for every 28 values (4 weeks).

    enter image description here

    df_intervals = calculate_intervals(df, start_month=7, start_day=6)
    df_intervals = df_intervals.sort_values(by=['Interval','Date']).reset_index(drop=True)
    df_intervals['Cumulative_Units_Sold'] = df_intervals.groupby('Interval')['Bushels'].cumsum()
    
    df_intervals['Day'] = df_intervals['Date'].dt.dayofyear
    
    ## the starting date for each interval is slightly different in your sample data
    ## because you chose dates randomly, some intervals have a start date on exactly 07-06
    ## while other intervals have a start date on another nearby date such as 07-07
    ## to mitigate this, i will take the week of the year that is most common among all intervals
    day_of_year_start = df_intervals.groupby('Interval').first()['Day'].min()
    
    df_intervals['Adjusted_Day'] = df_intervals['Day'].apply(
        lambda x: x-day_of_year_start if x>=day_of_year_start else 365+x-day_of_year_start
    )
    
    fig = px.line(
        df_intervals,
        x='Adjusted_Day',
        y='Cumulative_Units_Sold',
        color='Interval'
    )
    
    adjusted_day_of_year = np.arange(0,365+1)
    
    actual_day_of_year = np.concatenate([np.arange(day_of_year_start, 365+1),np.arange(0,day_of_year_start)])
    actual_week_of_year = actual_day_of_year // 7
    # week_zero_location = np.where(actual_week_of_year == 0)[0][0] % 28
    
    # ## i % 28 ensures we only place a tick once every 4 weeks
    tick_week = [w if i % 28 == 0 else '' for i,w in enumerate(actual_week_of_year)]
    
    fig.update_layout(
        xaxis=dict(
            tickmode='array', 
            tickvals=adjusted_day_of_year,
            ticktext=tick_week,
            title='Week of the Year'
        ),
        yaxis=dict(
            title='Units Sold'
        )
    )
    
    fig.show()