Search code examples
pythonpandasdata-science

How can I split my dataframe by year or month


I have a dataframe that contains a time series with hourly data form 2015 to 2020. I want to create a new dataframe that has a column with the values of the time series for each year or for each month of each year to perform a separate analysis. As I have 1 leap year, I want them to share index but have a NaN value at that position (29 Feb) on the years that are not leap. I tried using merge creating two new columns called month and day_of_month but index gets crazy and ends up having millions of entries instead of the ~40.000 it should have, and in the end it ends up with a more than 20GB file on RAM and breaks:

years = pd.DataFrame(index=pd.date_range('2016-01-01', '2017-01-01', freq='1H'))
years['month'] = years.index.month
years['day_of_month'] = years.index.day
gp = data_md[['value', 'month', 'day_of_month']].groupby(pd.Grouper(freq='1Y'))
for name, group in gp:
    years = years.merge(group, right_on=['month', 'day_of_month'], left_on=['month', 'day_of_month'])


RESULT:

    month   day_of_month    value
0   1   1   0
1   1   1   6
2   1   1   2
3   1   1   0
4   1   1   1
...     ...     ...     ...
210259  12  31  6
210260  12  31  2
210261  12  31  4
210262  12  31  5
210263  12  31  1

How can I get the frame constructed having one value column for each single year or month?

Here I leave the original frame from which I want to create the new one, only needed column by now is value

    value    month  day_of_month    week    day_name    year    hour    season  dailyp  day_of_week     ...     hourly_no_noise     daily_trend     daily_seasonal  daily_residuals     daily_no_noise  daily_trend_h   daily_seasonal_h    daily_residuals_h   daily_no_noise_h    Total
date                                                                                    
2015-01-01 00:00:00     0   1   1   1   Thursday    2015    0   Invierno    165.0   3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -3.456929   NaN     NaN     6436996.0
2015-01-01 01:00:00     6   1   1   1   Thursday    2015    1   Invierno    NaN     3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -4.879983   NaN     NaN     NaN
2015-01-01 02:00:00     2   1   1   1   Thursday    2015    2   Invierno    NaN     3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -5.895367   NaN     NaN     NaN
2015-01-01 03:00:00     0   1   1   1   Thursday    2015    3   Invierno    NaN     3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -6.468616   NaN     NaN     NaN
2015-01-01 04:00:00     1   1   1   1   Thursday    2015    4   Invierno    NaN     3   ...     NaN     NaN     -9.053524   NaN     NaN     NaN     -6.441830   NaN     NaN     NaN
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
2019-12-31 19:00:00     6   12  31  1   Tuesday     2019    19  Invierno    NaN     1   ...     11.529465   230.571429  -4.997480   -11.299166  237.299166  9.613095    2.805720    1.176491    17.823509   NaN
2019-12-31 20:00:00     3   12  31  1   Tuesday     2019    20  Invierno    NaN     1   ...     11.314857   230.571429  -4.997480   -11.299166  237.299166  9.613095    2.928751    1.176491    17.823509   NaN
2019-12-31 21:00:00     3   12  31  1   Tuesday     2019    21  Invierno    NaN     1   ...     10.141139   230.571429  -4.997480   -11.299166  237.299166  9.613095    1.774848    1.176491    17.823509   NaN
2019-12-31 22:00:00     3   12  31  1   Tuesday     2019    22  Invierno    NaN     1   ...     8.823152    230.571429  -4.997480   -11.299166  237.299166  9.613095    0.663344    1.176491    17.823509   NaN
2019-12-31 23:00:00     6   12  31  1   Tuesday     2019    23  Invierno    NaN     1   ...     6.884636    230.571429  -4.997480   -11.299166  237.299166  9.613095    -1.624980   1.176491    17.823509   NaN

I would like to end up with a dataframe like this:

                         2015      2016      2017      2018      2019
2016-01-01 00:00:00  0.074053  0.218161  0.606810  0.687365  0.352672
2016-01-01 01:00:00  0.465167  0.210297  0.722825  0.683341  0.885175
2016-01-01 02:00:00  0.175964  0.610560  0.722479  0.016842  0.205916
2016-01-01 03:00:00  0.945955  0.807490  0.627525  0.187677  0.535116
2016-01-01 04:00:00  0.757608  0.797835  0.639215  0.455989  0.042285
...                       ...       ...       ...       ...       ...
2016-12-30 20:00:00  0.046138  0.139100  0.397547  0.738687  0.335306
2016-12-30 21:00:00  0.672800  0.802090  0.617625  0.787601  0.007535
2016-12-30 22:00:00  0.698141  0.776686  0.423712  0.667808  0.298338
2016-12-30 23:00:00  0.198089  0.642073  0.586527  0.106567  0.514569
2016-12-31 00:00:00  0.367572  0.390791  0.105193  0.592167  0.007365

where 29 Feb is NaN on non-leap years:

df['2016-02']
                         2015      2016      2017      2018      2019
2016-02-01 00:00:00  0.656703  0.348784  0.383639  0.208786  0.183642
2016-02-01 01:00:00  0.488729  0.909498  0.873642  0.122028  0.547563
2016-02-01 02:00:00  0.210427  0.912393  0.505873  0.085149  0.358841
2016-02-01 03:00:00  0.281107  0.534750  0.622473  0.643611  0.258437
2016-02-01 04:00:00  0.187434  0.327459  0.701008  0.887041  0.385816
...                       ...       ...       ...       ...       ...
2016-02-29 19:00:00       NaN  0.742402       NaN       NaN       NaN
2016-02-29 20:00:00       NaN  0.013419       NaN       NaN       NaN
2016-02-29 21:00:00       NaN  0.517194       NaN       NaN       NaN
2016-02-29 22:00:00       NaN  0.003136       NaN       NaN       NaN
2016-02-29 23:00:00       NaN  0.128406       NaN       NaN       NaN

Solution

  • IIUC, you just need the original DataFrame:

    origin = 2016  # or whatever year of your chosing
    newidx = pd.to_datetime(df.index.strftime(f'{origin}-%m-%d %H:%M:%S'))
    newdf = (
        df[['value']]
        .assign(year=df.index.year)
        .set_axis(newidx, axis=0)
        .pivot(columns='year', values='value')
    )
    

    Using the small sample data you provided for that "original frame" df, we get:

    >>> newdf
    year                 2015  2019
    date                           
    2016-01-01 00:00:00   0.0   NaN
    2016-01-01 01:00:00   6.0   NaN
    2016-01-01 02:00:00   2.0   NaN
    ...                   ...   ...
    2016-12-31 21:00:00   NaN   3.0
    2016-12-31 22:00:00   NaN   3.0
    2016-12-31 23:00:00   NaN   6.0
    

    On a larger (made-up) DataFrame:

    np.random.seed(0)
    
    ix = pd.date_range('2015', '2020', freq='H', inclusive='left')
    df = pd.DataFrame({'value': np.random.randint(0, 100, len(ix))}, index=ix)
    
    # (code above)
    
    >>> newdf
    year                 2015  2016  2017  2018  2019
    2016-01-01 00:00:00  44.0  82.0  96.0  68.0  71.0
    2016-01-01 01:00:00  47.0  99.0  54.0  44.0  71.0
    2016-01-01 02:00:00  64.0  28.0  11.0  10.0  55.0
    ...                   ...   ...   ...   ...   ...
    2016-12-31 21:00:00   0.0  30.0  28.0  53.0  14.0
    2016-12-31 22:00:00  47.0  82.0  19.0   6.0  64.0
    2016-12-31 23:00:00  22.0  75.0  13.0  37.0  35.0
    

    and, as expected, only 2016 has values for 02/29:

    >>> newdf[:'2016-02-29 02:00:00'].tail()
    year                 2015  2016  2017  2018  2019
    2016-02-28 22:00:00  74.0  54.0  22.0  17.0  39.0
    2016-02-28 23:00:00  37.0  61.0  31.0   8.0  62.0
    2016-02-29 00:00:00   NaN  34.0   NaN   NaN   NaN
    2016-02-29 01:00:00   NaN  82.0   NaN   NaN   NaN
    2016-02-29 02:00:00   NaN  67.0   NaN   NaN   NaN
    

    Addendum: by months

    The code above can easily be adapted for month columns:

    Either using MultiIndex columns:

    origin = 2016
    newidx = pd.to_datetime(df.index.strftime(f'{origin}-01-%d %H:%M:%S'))
    newdf = (
        df[['value']]
        .assign(year=df.index.year, month=df.index.month)
        .set_axis(newidx, axis=0)
        .pivot(columns=['year', 'month'], values='value')
    )
    >>> newdf
    year                 2015                                                        ...  2019                                                      
    month                  1     2     3     4     5     6     7     8     9     10  ...    3     4     5     6     7     8     9     10    11    12
    2016-01-01 00:00:00  44.0  49.0  40.0  60.0  71.0  67.0  63.0  16.0  71.0  78.0  ...  32.0  35.0  51.0  35.0  68.0  43.0   4.0  23.0  65.0  19.0
    2016-01-01 01:00:00  47.0  71.0  27.0  88.0  68.0  58.0  74.0  67.0  98.0  49.0  ...  85.0  27.0  70.0   8.0   9.0  29.0  78.0  29.0  21.0  68.0
    2016-01-01 02:00:00  64.0  90.0   4.0  61.0  95.0   3.0  57.0  41.0  28.0  24.0  ...   7.0  93.0  21.0  10.0  72.0  79.0  46.0  45.0  25.0  99.0
    ...                   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...
    2016-01-31 21:00:00  48.0   NaN  24.0   NaN  79.0   NaN  55.0  47.0   NaN  20.0  ...  87.0   NaN  19.0   NaN  56.0  76.0   NaN  91.0   NaN  14.0
    2016-01-31 22:00:00  82.0   NaN   6.0   NaN  46.0   NaN   9.0  57.0   NaN  21.0  ...  69.0   NaN  67.0   NaN  85.0  38.0   NaN  34.0   NaN  64.0
    2016-01-31 23:00:00  51.0   NaN  97.0   NaN  45.0   NaN  55.0  41.0   NaN  87.0  ...  94.0   NaN  80.0   NaN  37.0  81.0   NaN  98.0   NaN  35.0
    

    or a simple string column made of %Y-%m to indicate year/month:

    origin = 2016
    newidx = pd.to_datetime(df.index.strftime(f'{origin}-01-%d %H:%M:%S'))
    newdf = (
        df[['value']]
        .assign(ym=df.index.strftime(f'%Y-%m'))
        .set_axis(newidx, axis=0)
        .pivot(columns='ym', values='value')
    )
    >>> newdf
    ym                   2015-01  2015-02  2015-03  2015-04  2015-05  2015-06  2015-07  2015-08  2015-09  2015-10  ...  2019-03  2019-04  2019-05  2019-06  2019-07  2019-08  2019-09  \
    2016-01-01 00:00:00     44.0     49.0     40.0     60.0     71.0     67.0     63.0     16.0     71.0     78.0  ...     32.0     35.0     51.0     35.0     68.0     43.0      4.0   
    2016-01-01 01:00:00     47.0     71.0     27.0     88.0     68.0     58.0     74.0     67.0     98.0     49.0  ...     85.0     27.0     70.0      8.0      9.0     29.0     78.0   
    2016-01-01 02:00:00     64.0     90.0      4.0     61.0     95.0      3.0     57.0     41.0     28.0     24.0  ...      7.0     93.0     21.0     10.0     72.0     79.0     46.0   
    ...                      ...      ...      ...      ...      ...      ...      ...      ...      ...      ...  ...      ...      ...      ...      ...      ...      ...      ...   
    2016-01-31 21:00:00     48.0      NaN     24.0      NaN     79.0      NaN     55.0     47.0      NaN     20.0  ...     87.0      NaN     19.0      NaN     56.0     76.0      NaN   
    2016-01-31 22:00:00     82.0      NaN      6.0      NaN     46.0      NaN      9.0     57.0      NaN     21.0  ...     69.0      NaN     67.0      NaN     85.0     38.0      NaN   
    2016-01-31 23:00:00     51.0      NaN     97.0      NaN     45.0      NaN     55.0     41.0      NaN     87.0  ...     94.0      NaN     80.0      NaN     37.0     81.0      NaN   
    
    ym                   2019-10  2019-11  2019-12  
    2016-01-01 00:00:00     23.0     65.0     19.0  
    2016-01-01 01:00:00     29.0     21.0     68.0  
    2016-01-01 02:00:00     45.0     25.0     99.0  
    ...                      ...      ...      ...  
    2016-01-31 21:00:00     91.0      NaN     14.0  
    2016-01-31 22:00:00     34.0      NaN     64.0  
    2016-01-31 23:00:00     98.0      NaN     35.0  
    

    The former gives you more flexibility to index sub-parts. For example, here is a selection of rows for "all February months":

    >>> newdf.loc[:'2016-01-29 02:00:00', (slice(None), 2)].tail()
    year                 2015  2016  2017  2018  2019
    month                   2     2     2     2     2
    2016-01-28 22:00:00  74.0  54.0  22.0  17.0  39.0
    2016-01-28 23:00:00  37.0  61.0  31.0   8.0  62.0
    2016-01-29 00:00:00   NaN  34.0   NaN   NaN   NaN
    2016-01-29 01:00:00   NaN  82.0   NaN   NaN   NaN
    2016-01-29 02:00:00   NaN  67.0   NaN   NaN   NaN