Search code examples
pythonpandastime-series

Pandas Time Series: Count weekdays with min value in annual data


this is my first question on Stackoverflow and I hope I describe my problem detailed enough. I'm starting to learn data analysis with Pandas and I've created a time series with daily data for gas prices of a certain station. I've already grouped the hourly data into daily data.

I've been successfull with a simple scatter plot over the year with plotly but in the next step I would like to analyze which weekday is the cheapest or most expensive in every week, count the daynames and then look if there is a pattern over the whole year.

            count      mean       std    min    25%    50%     75%    max  \
2022-01-01   35.0  1.685000  0.029124  1.649  1.659  1.689  1.6990  1.749   
2022-01-02   27.0  1.673444  0.024547  1.649  1.649  1.669  1.6890  1.729   
2022-01-03   28.0  1.664000  0.040597  1.599  1.639  1.654  1.6890  1.789   
2022-01-04   31.0  1.635129  0.045069  1.599  1.599  1.619  1.6490  1.779   
2022-01-05   33.0  1.658697  0.048637  1.599  1.619  1.649  1.6990  1.769   
2022-01-06   35.0  1.658429  0.050756  1.599  1.619  1.639  1.6940  1.779   
2022-01-07   30.0  1.637333  0.039136  1.599  1.609  1.629  1.6565  1.759   
2022-01-08   41.0  1.655829  0.041740  1.619  1.619  1.639  1.6790  1.769   
2022-01-09   35.0  1.647857  0.031602  1.619  1.619  1.639  1.6590  1.769   
2022-01-10   31.0  1.634806  0.041374  1.599  1.609  1.619  1.6490  1.769 
...  

            week    weekday  
2022-01-01    52   Saturday  
2022-01-02    52     Sunday  
2022-01-03     1     Monday  
2022-01-04     1    Tuesday  
2022-01-05     1  Wednesday  
2022-01-06     1   Thursday  
2022-01-07     1     Friday  
2022-01-08     1   Saturday  
2022-01-09     1     Sunday  
2022-01-10     2     Monday
...

I tried with grouping and resampling but unfortunately I didn't get the result I was hoping for.

Can someone suggest a way how to deal with this problem? Thanks!


Solution

  • Here's a way to do what I believe your question asks:

    import pandas as pd
    import numpy as np
    df = pd.DataFrame({
    'count':[35,27,28,31,33,35,30,41,35,31]*40,
    'mean':
    [1.685,1.673444,1.664,1.635129,1.658697,1.658429,1.637333,1.655829,1.647857,1.634806]*40
    },
    index=pd.Series(pd.to_datetime(pd.date_range("2022-01-01", periods=400, freq="D"))))
    print( '','input df:',df,sep='\n' )
    
    df_date = df.reset_index()['index']
    df['weekday'] = list(df_date.dt.day_name())
    df['year'] = df_date.dt.year.to_numpy()
    df['week'] = df_date.dt.isocalendar().week.to_numpy()
    df['year_week_started'] = df.year - np.where((df.week>=52)&(df.week.shift(-7)==1),1,0)
    print( '','input df with intermediate columns:',df,sep='\n' )
    
    cols = ['year_week_started', 'week']
    dfCheap = df.loc[df.groupby(cols)['mean'].idxmin(),:].set_index(cols)
    dfCheap = ( dfCheap.groupby(['year_week_started', 'weekday'])['mean'].count()
        .rename('freq').to_frame().set_index('freq', append=True)
        .reset_index(level='weekday').sort_index(ascending=[True,False]) )
    print( '','dfCheap:',dfCheap,sep='\n' )
    
    dfExpensive = df.loc[df.groupby(cols)['mean'].idxmax(),:].set_index(cols)
    dfExpensive = ( dfExpensive.groupby(['year_week_started', 'weekday'])['mean'].count()
        .rename('freq').to_frame().set_index('freq', append=True)
        .reset_index(level='weekday').sort_index(ascending=[True,False]) )
    print( '','dfExpensive:',dfExpensive,sep='\n' )
    

    Sample input:

    input df:
                count      mean
    2022-01-01     35  1.685000
    2022-01-02     27  1.673444
    2022-01-03     28  1.664000
    2022-01-04     31  1.635129
    2022-01-05     33  1.658697
    ...           ...       ...
    2023-01-31     35  1.658429
    2023-02-01     30  1.637333
    2023-02-02     41  1.655829
    2023-02-03     35  1.647857
    2023-02-04     31  1.634806
    
    [400 rows x 2 columns]
    
    input df with intermediate columns:
                count      mean    weekday  year week  year_week_started
    2022-01-01     35  1.685000   Saturday  2022   52               2021
    2022-01-02     27  1.673444     Sunday  2022   52               2021
    2022-01-03     28  1.664000     Monday  2022    1               2022
    2022-01-04     31  1.635129    Tuesday  2022    1               2022
    2022-01-05     33  1.658697  Wednesday  2022    1               2022
    ...           ...       ...        ...   ...  ...                ...
    2023-01-31     35  1.658429    Tuesday  2023    5               2023
    2023-02-01     30  1.637333  Wednesday  2023    5               2023
    2023-02-02     41  1.655829   Thursday  2023    5               2023
    2023-02-03     35  1.647857     Friday  2023    5               2023
    2023-02-04     31  1.634806   Saturday  2023    5               2023
    
    [400 rows x 6 columns]
    

    Sample output:

    dfCheap:
                              weekday
    year_week_started freq
    2021              1        Monday
    2022              11      Tuesday
                      10     Thursday
                      10    Wednesday
                      6        Sunday
                      5        Friday
                      5        Monday
                      5      Saturday
    2023              2      Thursday
                      1      Saturday
                      1        Sunday
                      1     Wednesday
    
    dfExpensive:
                              weekday
    year_week_started freq
    2021              1      Saturday
    2022              16       Monday
                      10      Tuesday
                      6        Sunday
                      5        Friday
                      5      Saturday
                      5      Thursday
                      5     Wednesday
    2023              2        Monday
                      1        Friday
                      1      Thursday
                      1       Tuesday