I have a dataframe that contains a time series with hourly data form 2015 to 2020. I want to create a new dataframe that has a column with the values of the time series for each year or for each month of each year to perform a separate analysis. As I have 1 leap year, I want them to share index but have a NaN value at that position (29 Feb) on the years that are not leap. I tried using merge creating two new columns called month and day_of_month but index gets crazy and ends up having millions of entries instead of the ~40.000 it should have, and in the end it ends up with a more than 20GB file on RAM and breaks:
years = pd.DataFrame(index=pd.date_range('2016-01-01', '2017-01-01', freq='1H'))
years['month'] = years.index.month
years['day_of_month'] = years.index.day
gp = data_md[['value', 'month', 'day_of_month']].groupby(pd.Grouper(freq='1Y'))
for name, group in gp:
years = years.merge(group, right_on=['month', 'day_of_month'], left_on=['month', 'day_of_month'])
RESULT:
month day_of_month value
0 1 1 0
1 1 1 6
2 1 1 2
3 1 1 0
4 1 1 1
... ... ... ...
210259 12 31 6
210260 12 31 2
210261 12 31 4
210262 12 31 5
210263 12 31 1
How can I get the frame constructed having one value column for each single year or month?
Here I leave the original frame from which I want to create the new one, only needed column by now is value
value month day_of_month week day_name year hour season dailyp day_of_week ... hourly_no_noise daily_trend daily_seasonal daily_residuals daily_no_noise daily_trend_h daily_seasonal_h daily_residuals_h daily_no_noise_h Total
date
2015-01-01 00:00:00 0 1 1 1 Thursday 2015 0 Invierno 165.0 3 ... NaN NaN -9.053524 NaN NaN NaN -3.456929 NaN NaN 6436996.0
2015-01-01 01:00:00 6 1 1 1 Thursday 2015 1 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -4.879983 NaN NaN NaN
2015-01-01 02:00:00 2 1 1 1 Thursday 2015 2 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -5.895367 NaN NaN NaN
2015-01-01 03:00:00 0 1 1 1 Thursday 2015 3 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -6.468616 NaN NaN NaN
2015-01-01 04:00:00 1 1 1 1 Thursday 2015 4 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -6.441830 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2019-12-31 19:00:00 6 12 31 1 Tuesday 2019 19 Invierno NaN 1 ... 11.529465 230.571429 -4.997480 -11.299166 237.299166 9.613095 2.805720 1.176491 17.823509 NaN
2019-12-31 20:00:00 3 12 31 1 Tuesday 2019 20 Invierno NaN 1 ... 11.314857 230.571429 -4.997480 -11.299166 237.299166 9.613095 2.928751 1.176491 17.823509 NaN
2019-12-31 21:00:00 3 12 31 1 Tuesday 2019 21 Invierno NaN 1 ... 10.141139 230.571429 -4.997480 -11.299166 237.299166 9.613095 1.774848 1.176491 17.823509 NaN
2019-12-31 22:00:00 3 12 31 1 Tuesday 2019 22 Invierno NaN 1 ... 8.823152 230.571429 -4.997480 -11.299166 237.299166 9.613095 0.663344 1.176491 17.823509 NaN
2019-12-31 23:00:00 6 12 31 1 Tuesday 2019 23 Invierno NaN 1 ... 6.884636 230.571429 -4.997480 -11.299166 237.299166 9.613095 -1.624980 1.176491 17.823509 NaN
I would like to end up with a dataframe like this:
2015 2016 2017 2018 2019
2016-01-01 00:00:00 0.074053 0.218161 0.606810 0.687365 0.352672
2016-01-01 01:00:00 0.465167 0.210297 0.722825 0.683341 0.885175
2016-01-01 02:00:00 0.175964 0.610560 0.722479 0.016842 0.205916
2016-01-01 03:00:00 0.945955 0.807490 0.627525 0.187677 0.535116
2016-01-01 04:00:00 0.757608 0.797835 0.639215 0.455989 0.042285
... ... ... ... ... ...
2016-12-30 20:00:00 0.046138 0.139100 0.397547 0.738687 0.335306
2016-12-30 21:00:00 0.672800 0.802090 0.617625 0.787601 0.007535
2016-12-30 22:00:00 0.698141 0.776686 0.423712 0.667808 0.298338
2016-12-30 23:00:00 0.198089 0.642073 0.586527 0.106567 0.514569
2016-12-31 00:00:00 0.367572 0.390791 0.105193 0.592167 0.007365
where 29 Feb is NaN on non-leap years:
df['2016-02']
2015 2016 2017 2018 2019
2016-02-01 00:00:00 0.656703 0.348784 0.383639 0.208786 0.183642
2016-02-01 01:00:00 0.488729 0.909498 0.873642 0.122028 0.547563
2016-02-01 02:00:00 0.210427 0.912393 0.505873 0.085149 0.358841
2016-02-01 03:00:00 0.281107 0.534750 0.622473 0.643611 0.258437
2016-02-01 04:00:00 0.187434 0.327459 0.701008 0.887041 0.385816
... ... ... ... ... ...
2016-02-29 19:00:00 NaN 0.742402 NaN NaN NaN
2016-02-29 20:00:00 NaN 0.013419 NaN NaN NaN
2016-02-29 21:00:00 NaN 0.517194 NaN NaN NaN
2016-02-29 22:00:00 NaN 0.003136 NaN NaN NaN
2016-02-29 23:00:00 NaN 0.128406 NaN NaN NaN
IIUC, you just need the original DataFrame:
origin = 2016 # or whatever year of your chosing
newidx = pd.to_datetime(df.index.strftime(f'{origin}-%m-%d %H:%M:%S'))
newdf = (
df[['value']]
.assign(year=df.index.year)
.set_axis(newidx, axis=0)
.pivot(columns='year', values='value')
)
Using the small sample data you provided for that "original frame" df
, we get:
>>> newdf
year 2015 2019
date
2016-01-01 00:00:00 0.0 NaN
2016-01-01 01:00:00 6.0 NaN
2016-01-01 02:00:00 2.0 NaN
... ... ...
2016-12-31 21:00:00 NaN 3.0
2016-12-31 22:00:00 NaN 3.0
2016-12-31 23:00:00 NaN 6.0
On a larger (made-up) DataFrame:
np.random.seed(0)
ix = pd.date_range('2015', '2020', freq='H', inclusive='left')
df = pd.DataFrame({'value': np.random.randint(0, 100, len(ix))}, index=ix)
# (code above)
>>> newdf
year 2015 2016 2017 2018 2019
2016-01-01 00:00:00 44.0 82.0 96.0 68.0 71.0
2016-01-01 01:00:00 47.0 99.0 54.0 44.0 71.0
2016-01-01 02:00:00 64.0 28.0 11.0 10.0 55.0
... ... ... ... ... ...
2016-12-31 21:00:00 0.0 30.0 28.0 53.0 14.0
2016-12-31 22:00:00 47.0 82.0 19.0 6.0 64.0
2016-12-31 23:00:00 22.0 75.0 13.0 37.0 35.0
and, as expected, only 2016 has values for 02/29:
>>> newdf[:'2016-02-29 02:00:00'].tail()
year 2015 2016 2017 2018 2019
2016-02-28 22:00:00 74.0 54.0 22.0 17.0 39.0
2016-02-28 23:00:00 37.0 61.0 31.0 8.0 62.0
2016-02-29 00:00:00 NaN 34.0 NaN NaN NaN
2016-02-29 01:00:00 NaN 82.0 NaN NaN NaN
2016-02-29 02:00:00 NaN 67.0 NaN NaN NaN
The code above can easily be adapted for month columns:
Either using MultiIndex columns:
origin = 2016
newidx = pd.to_datetime(df.index.strftime(f'{origin}-01-%d %H:%M:%S'))
newdf = (
df[['value']]
.assign(year=df.index.year, month=df.index.month)
.set_axis(newidx, axis=0)
.pivot(columns=['year', 'month'], values='value')
)
>>> newdf
year 2015 ... 2019
month 1 2 3 4 5 6 7 8 9 10 ... 3 4 5 6 7 8 9 10 11 12
2016-01-01 00:00:00 44.0 49.0 40.0 60.0 71.0 67.0 63.0 16.0 71.0 78.0 ... 32.0 35.0 51.0 35.0 68.0 43.0 4.0 23.0 65.0 19.0
2016-01-01 01:00:00 47.0 71.0 27.0 88.0 68.0 58.0 74.0 67.0 98.0 49.0 ... 85.0 27.0 70.0 8.0 9.0 29.0 78.0 29.0 21.0 68.0
2016-01-01 02:00:00 64.0 90.0 4.0 61.0 95.0 3.0 57.0 41.0 28.0 24.0 ... 7.0 93.0 21.0 10.0 72.0 79.0 46.0 45.0 25.0 99.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-01-31 21:00:00 48.0 NaN 24.0 NaN 79.0 NaN 55.0 47.0 NaN 20.0 ... 87.0 NaN 19.0 NaN 56.0 76.0 NaN 91.0 NaN 14.0
2016-01-31 22:00:00 82.0 NaN 6.0 NaN 46.0 NaN 9.0 57.0 NaN 21.0 ... 69.0 NaN 67.0 NaN 85.0 38.0 NaN 34.0 NaN 64.0
2016-01-31 23:00:00 51.0 NaN 97.0 NaN 45.0 NaN 55.0 41.0 NaN 87.0 ... 94.0 NaN 80.0 NaN 37.0 81.0 NaN 98.0 NaN 35.0
or a simple string column made of %Y-%m
to indicate year/month:
origin = 2016
newidx = pd.to_datetime(df.index.strftime(f'{origin}-01-%d %H:%M:%S'))
newdf = (
df[['value']]
.assign(ym=df.index.strftime(f'%Y-%m'))
.set_axis(newidx, axis=0)
.pivot(columns='ym', values='value')
)
>>> newdf
ym 2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 ... 2019-03 2019-04 2019-05 2019-06 2019-07 2019-08 2019-09 \
2016-01-01 00:00:00 44.0 49.0 40.0 60.0 71.0 67.0 63.0 16.0 71.0 78.0 ... 32.0 35.0 51.0 35.0 68.0 43.0 4.0
2016-01-01 01:00:00 47.0 71.0 27.0 88.0 68.0 58.0 74.0 67.0 98.0 49.0 ... 85.0 27.0 70.0 8.0 9.0 29.0 78.0
2016-01-01 02:00:00 64.0 90.0 4.0 61.0 95.0 3.0 57.0 41.0 28.0 24.0 ... 7.0 93.0 21.0 10.0 72.0 79.0 46.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-01-31 21:00:00 48.0 NaN 24.0 NaN 79.0 NaN 55.0 47.0 NaN 20.0 ... 87.0 NaN 19.0 NaN 56.0 76.0 NaN
2016-01-31 22:00:00 82.0 NaN 6.0 NaN 46.0 NaN 9.0 57.0 NaN 21.0 ... 69.0 NaN 67.0 NaN 85.0 38.0 NaN
2016-01-31 23:00:00 51.0 NaN 97.0 NaN 45.0 NaN 55.0 41.0 NaN 87.0 ... 94.0 NaN 80.0 NaN 37.0 81.0 NaN
ym 2019-10 2019-11 2019-12
2016-01-01 00:00:00 23.0 65.0 19.0
2016-01-01 01:00:00 29.0 21.0 68.0
2016-01-01 02:00:00 45.0 25.0 99.0
... ... ... ...
2016-01-31 21:00:00 91.0 NaN 14.0
2016-01-31 22:00:00 34.0 NaN 64.0
2016-01-31 23:00:00 98.0 NaN 35.0
The former gives you more flexibility to index sub-parts. For example, here is a selection of rows for "all February months":
>>> newdf.loc[:'2016-01-29 02:00:00', (slice(None), 2)].tail()
year 2015 2016 2017 2018 2019
month 2 2 2 2 2
2016-01-28 22:00:00 74.0 54.0 22.0 17.0 39.0
2016-01-28 23:00:00 37.0 61.0 31.0 8.0 62.0
2016-01-29 00:00:00 NaN 34.0 NaN NaN NaN
2016-01-29 01:00:00 NaN 82.0 NaN NaN NaN
2016-01-29 02:00:00 NaN 67.0 NaN NaN NaN