I have some hierarchical data from 2003 to 2011 which bottoms out into time series data which looks something like this:
polar_temp
Station_Number Date Value
417 CA002100805 20030101 -296
423 CA002202570 20030101 -269
425 CA002203058 20030101 -268
427 CA002300551 20030101 -23
428 CA002300902 20030101 -200
I set a multi index on Station_Number and Date:
polar_temp['Date'] = pd.to_datetime(polar_temp['Date'],
format='%Y%m%d')#.dt.strftime("%Y-%m-%d")
polar_temp = polar_temp.set_index(['Station_Number', "Date"])
Value
Station_Number Date
CA002100805 2003-01-01 -296
CA002202570 2003-01-01 -269
CA002203058 2003-01-01 -268
CA002300551 2003-01-01 -23
CA002300902 2003-01-01 -200
Now I would like to perform a resampling of the data by calculating the mean of Value for every 8 days by using:
polar_temp8d = polar_temp.groupby([pd.Grouper(level='Station_Number'),
pd.Grouper(level='Date', freq='8D')]).mean()
Value
Station_Number Date
CA002100805 2003-01-01 -300.285714
2003-01-09 -328.750000
2003-01-17 -325.500000
2003-01-25 -385.833333
2003-02-02 -194.428571
... ...
USW00027515 2005-06-23 76.625000
2005-07-01 42.375000
2005-07-09 94.500000
2005-07-17 66.500000
2005-07-25 56.285714
So the problem here is that pandas only resamples the years from 2003 until 2005, so the years from 2006 to 2011 are completely left out. Now my question is: Did I use the Grouper function to analyse time series data correctly or is the anything else I have missed?
Edit 1:
By running:
print(polar_temp.loc['CA002300902'].sort_index(ascending=False))
Value
Date
2011-12-31 -288
2011-12-30 -299
2011-12-29 -347
2011-12-28 -310
2011-12-27 -239
One can see that the stations before the resampling have data until 2011.
I have created synthetic data to test your approach and it worked fine. I then arbitrarily removed data points to see if the aggregation would fail with missing dates and it skips missing values from the time series, as displayed on the output immediately below. Therefore, I still don't understand why your output stops in 2005.
Output without resampling and interpolation:
Value
Station_Number Date
CA002100805 2003-01-02 -195.545455
2003-01-10 -144.963636
2003-01-18 -158.045455
2003-01-26 -151.533333
2003-02-03 -196.300000
2003-04-08 -159.963636
2003-04-16 -157.115385
2003-04-24 -150.191489
2003-05-02 -146.113924
2003-05-10 -133.367347
Notice how it skips data points in March 2003 altogether.
You might sort your problem by:
1. Adding missing dates to the DataFrame
2. Filling NAs with interpolate()
import pandas as pd
import numpy as np
# Sets random seed
np.random.seed(42)
# Sample size
size=10**5
station_numbers = ['CA002100805', 'CA002202570', 'CA002203058', 'CA002300551',
'CA002300902']
stations = [station_numbers[i] for i in
np.random.randint(low=0, high=len(station_numbers), size=size)]
values = np.random.randint(low=-400, high=100, size=size)
dates_list = pd.date_range(start='2003-01-01', end='2011-12-31')
###################################
#### TESTS with missing dates #####
###################################
# Removes dates from dates_list to test
percent_to_remove = 1/3
items_to_remove = len(dates_list) * percent_to_remove
# Index of items to remove
rem_idx = set()
while len(rem_idx) < items_to_remove:
# Thanks to Jon Kiparsky's answer on this thread
# https://stackoverflow.com/questions/28037158/how-to-not-repeat-randint-value
rem_idx.add(np.random.randint(0, len(dates_list)))
dates_list = dates_list.delete(list(rem_idx))
# Arbitratily removes dates in sequence to test
dates_list = dates_list.delete(range(20, 60))
###################################
###################################
dates = [dates_list[i] for i in
np.random.randint(low=0, high=len(dates_list), size=size)]
# Creates DataFrame
data = (pd.DataFrame({'Station_Number': stations,
'Date': dates,
'Value': values})
.set_index('Date')
.sort_index())
# Creates one row per day
data = data.groupby('Station_Number').resample('D').mean()
# Fills NAs with standard interpolation strategy
data = data.interpolate()
# Calculates 8-day mean value
eight_day_mean = data.groupby([pd.Grouper(level='Station_Number'),
pd.Grouper(level='Date', freq='8D')]).mean()
Output with resampling and interpolation:
Value
Station_Number Date
CA002100805 2003-01-02 -178.138024
2003-01-10 -135.644524
2003-01-18 -147.253977
2003-01-26 -147.694712
2003-02-03 -200.642180
2003-02-11 -203.057708
2003-02-19 -192.821042
2003-02-27 -182.584375
2003-03-07 -172.347708
2003-03-15 -162.111042
2003-03-23 -151.874375
2003-03-31 -141.637708
2003-04-08 -154.028469
2003-04-16 -151.099405
2003-04-24 -156.152083
Now notice how it contains data points for March 2003 that are somewhere in between the values from February and April 2003, due to the interpolation strategy adopted.