Search code examples
pythonpandasstatisticstime-seriesstandard-deviation

Standard deviation of time series data on two columns


I have a data frame with two-columns of data for a day with a time series index. The sample data is in 1-minute and I want to create a 5-minute data frame where a 5-minute interval will be flagged false when the standard deviation of the 5 samples in the respective 5-minute is not deviating by 5% of the mean of the 5-samples and this need to be performed for each of the 5-minutes in the day and for each column. As seen below for DF1 column X we calculate the mean and standard deviation of the 5 samples from 16:01 to 16:05 and we see the %(Std/Mean) and same thing will be done for the next 5 samples and for column y. Then DF2 will be populated if %(std/Mean)>5% then the particular 5 minute interval will be false.

enter image description here


Solution

  • You can use the resample method of the pandas data frames, for that the dataframe most be index with a time stamp. Here an example:

    import pandas as pd
    import numpy as np
    dates = pd.date_range('1/1/2020', periods=30)
    df = pd.DataFrame(np.random.randn(30,2), index=dates, columns=['X','Y'])
    df.head()
    
    lbl = 'right' # set the label of the window index to the value of the right
    w = '3d'
    threshold = 1 # here goes your threshold for flagging the ration of standard deviation and mean
    x=df.resample(w, label=lbl).std()['X'] / df.resample(w, label=lbl).mean()['X'] > threshold
    y=df.resample(w, label=lbl).std()['Y'] / df.resample(w, label=lbl).mean()['Y'] > threshold
    
    DF2 = pd.concat([x,y], axis=1)