Search code examples
pandasdatetimetime-seriespandas-groupbyresampling

resampling collection of time series with numbers and strings


I have the following dataframe with timeseries data:

import pandas as pd
import numpy as np
test_dict = {'time': ['2018-01-01 00:00:00+00:00', '2018-01-01 00:15:00+00:00', '2018-01-01 00:30:00+00:00', '2018-01-01 00:45:00+00:00',
                     '2018-01-02 00:00:00+00:00', '2018-01-02 00:15:00+00:00', '2018-01-02 00:30:00+00:00', '2018-01-02 00:45:00+00:00',
                     '2018-01-01 00:00:00+00:00', '2018-01-01 00:15:00+00:00', '2018-01-01 00:30:00+00:00', '2018-01-01 00:45:00+00:00'],
             'id_day': [1,1,1,1,2,2,2,2,1,1,1,1], 
             'name':['sensor1', 'sensor1', 'sensor1', 'sensor1', 'sensor1', 'sensor1', 'sensor1', 'sensor1', 'sensor2', 'sensor2', 'sensor2', 'sensor2'], 
             'id_unique': [1,1,1,1,2,2,2,2,3,3,3,3],
             'class': ['type1', 'type1', 'type1', 'type1', 'type1', 'type1', 'type1', 'type1', 'type2', 'type2', 'type2', 'type2'],
             'value': np.random.randint(5,25, size = 12)}
testframe = pd.DataFrame(test_dict)
testframe['time'] = pd.to_datetime(testframe['time'])

As you can see, there are several time series underneath each other. Each time series has a time with 15 min sapling rate, an id_day, the name of the sensor, a id_unique (is unique to each time series to bee able to separate them), a class and some values for each timestamp. In the real dataset, each time series has a length of 24h = 96*15 minutes. I hope the situation is clear, if not just ask for it and I will provide further explanations.

Now I want to change the sampling rate to, lets say, 10 minutes, 1h, etc. and use mean values for the column value. The goal is to "delete" all the "obsolete" rows after calculating the mean, so I end up with the same table (but shorter). I have a hard time explaining this better right now, so let me know if anything remains unclear.

EDIT: Expected result:

result_dict = {'time': ['2018-01-01 00:15:00+00:00',
                        '2018-01-02 00:15:00+00:00',
                        '2018-01-01 00:15:00+00:00'],
             'id_day': [1,2,1], 
             'name':['sensor1', 'sensor1', 'sensor2'], 
             'id_unique': [1,2,3],
             'class': ['type1', 'type1', 'type2'],
             'value': np.random.randint(5,25, size = 3)}
result = pd.DataFrame(result_dict)
result['time'] = pd.to_datetime(result['time'])

Where the value is the mean value. In this case, I resampled the testframe from 5 minutes to 15 minutes. EDIT_END

What I tried was:

testframe.set_index('time').groupby(pd.Grouper(freq = '5Min')).mean()

and some other groupby attempts, but they all lead to result, that ignores columns with strings in them.

What would be the correct approach/code to groupby "time", calculate mean of "value" and adjust the rest of the columns acordingly?

Thanks in advance for any hint or ideally a code!


Solution

  • Finally, I did it the hard way and looped through the data. This approach takes a while, but i couldn't find a better solution:

    df_resampled_dict = {'time' : list(), 'id_day' : list(), 'name' : list(), 'id_unique' : list(),  'classes' : list(), 'value' : list()}
    for id_uni in df['id_unique'].unique():
    
        df_temp = df.loc[df['id_unique'] == id_uni]
        name = df_temp['name'].unique()
        klasse = df_temp['classes'].unique()
        id_day = df_temp['id_day'].unique()
        df_temp = df_temp.groupby(pd.Grouper(freq = sampling_rate)).mean()
        df_temp = df_temp.reset_index()
    
        df_resampled_dict['value'].extend(df_temp['value'])
        df_resampled_dict['classes'].extend([klasse[0]] * len(df_temp['value']))
        df_resampled_dict['time'].extend(df_temp['time'])
        df_resampled_dict['id_day'].extend([id_day[0]] * len(df_temp['value']))
        df_resampled_dict['name'].extend([name[0]] * len(df_temp['value']))
        df_resampled_dict['id_unique'].extend([id_uni] * len(df_temp['value']))
    
    df_resampled = pd.DataFrame(df_resampled_dict)
    

    I someone stumbles across this and can offer a faster solution, I am happy to learn!