I have the following dataframe with timeseries data:
import pandas as pd
import numpy as np
test_dict = {'time': ['2018-01-01 00:00:00+00:00', '2018-01-01 00:15:00+00:00', '2018-01-01 00:30:00+00:00', '2018-01-01 00:45:00+00:00',
'2018-01-02 00:00:00+00:00', '2018-01-02 00:15:00+00:00', '2018-01-02 00:30:00+00:00', '2018-01-02 00:45:00+00:00',
'2018-01-01 00:00:00+00:00', '2018-01-01 00:15:00+00:00', '2018-01-01 00:30:00+00:00', '2018-01-01 00:45:00+00:00'],
'id_day': [1,1,1,1,2,2,2,2,1,1,1,1],
'name':['sensor1', 'sensor1', 'sensor1', 'sensor1', 'sensor1', 'sensor1', 'sensor1', 'sensor1', 'sensor2', 'sensor2', 'sensor2', 'sensor2'],
'id_unique': [1,1,1,1,2,2,2,2,3,3,3,3],
'class': ['type1', 'type1', 'type1', 'type1', 'type1', 'type1', 'type1', 'type1', 'type2', 'type2', 'type2', 'type2'],
'value': np.random.randint(5,25, size = 12)}
testframe = pd.DataFrame(test_dict)
testframe['time'] = pd.to_datetime(testframe['time'])
As you can see, there are several time series underneath each other. Each time series has a time with 15 min sapling rate, an id_day, the name of the sensor, a id_unique (is unique to each time series to bee able to separate them), a class and some values for each timestamp. In the real dataset, each time series has a length of 24h = 96*15 minutes. I hope the situation is clear, if not just ask for it and I will provide further explanations.
Now I want to change the sampling rate to, lets say, 10 minutes, 1h, etc. and use mean values for the column value
. The goal is to "delete" all the "obsolete" rows after calculating the mean, so I end up with the same table (but shorter). I have a hard time explaining this better right now, so let me know if anything remains unclear.
EDIT: Expected result:
result_dict = {'time': ['2018-01-01 00:15:00+00:00',
'2018-01-02 00:15:00+00:00',
'2018-01-01 00:15:00+00:00'],
'id_day': [1,2,1],
'name':['sensor1', 'sensor1', 'sensor2'],
'id_unique': [1,2,3],
'class': ['type1', 'type1', 'type2'],
'value': np.random.randint(5,25, size = 3)}
result = pd.DataFrame(result_dict)
result['time'] = pd.to_datetime(result['time'])
Where the value
is the mean value. In this case, I resampled the testframe from 5 minutes to 15 minutes.
EDIT_END
What I tried was:
testframe.set_index('time').groupby(pd.Grouper(freq = '5Min')).mean()
and some other groupby attempts, but they all lead to result, that ignores columns with strings in them.
What would be the correct approach/code to groupby "time", calculate mean of "value" and adjust the rest of the columns acordingly?
Thanks in advance for any hint or ideally a code!
Finally, I did it the hard way and looped through the data. This approach takes a while, but i couldn't find a better solution:
df_resampled_dict = {'time' : list(), 'id_day' : list(), 'name' : list(), 'id_unique' : list(), 'classes' : list(), 'value' : list()}
for id_uni in df['id_unique'].unique():
df_temp = df.loc[df['id_unique'] == id_uni]
name = df_temp['name'].unique()
klasse = df_temp['classes'].unique()
id_day = df_temp['id_day'].unique()
df_temp = df_temp.groupby(pd.Grouper(freq = sampling_rate)).mean()
df_temp = df_temp.reset_index()
df_resampled_dict['value'].extend(df_temp['value'])
df_resampled_dict['classes'].extend([klasse[0]] * len(df_temp['value']))
df_resampled_dict['time'].extend(df_temp['time'])
df_resampled_dict['id_day'].extend([id_day[0]] * len(df_temp['value']))
df_resampled_dict['name'].extend([name[0]] * len(df_temp['value']))
df_resampled_dict['id_unique'].extend([id_uni] * len(df_temp['value']))
df_resampled = pd.DataFrame(df_resampled_dict)
I someone stumbles across this and can offer a faster solution, I am happy to learn!