Search code examples
pythondatetimeresampling

How do I conditionally resample data into hourly values only when enough data is present?


My first time trying something more tricky and coming on here for advice. I am trying to resample my data that is at times in 10 minute increments in such a way that I obtain hourly values only if half or more of the data is an actual value instead of a NaN value. How do I still return a NaN value if more than 50% is NaN? So that I still have a continuous time series (at least when it comes to timestamps).

I am working in Python 3 via a jupyterlab notebook (not sure if that makes a difference).

Here's a small example of what I am trying to do:

Let's say that for an hour the rain gauge I am using needs to be repaired and therefore doesnt record any data:

Datetime          Precip
2014-07-01 00:00  0.2
2014-07-01 00:10  0.3
2014-07-01 00:20  0.1
2014-07-01 00:30  0.0
2014-07-01 00:40  NaN
2014-07-01 00:50  NaN
2014-07-01 01:00  NaN
2014-07-01 01:10  NaN
2014-07-01 01:20  NaN
2014-07-01 01:30  NaN
2014-07-01 01:40  0.0
2014-07-01 01:50  0.0

I would want the resample to return the following output where the first hour is resampled (since it has more than 50% data) and for the second hour to be returned as a NaN.

Datetime          Precip
2014-07-01 00:00  0.15
2014-07-01 00:10  NaN

So far I have just resampled as such:

df['Precip'].resample(rule = 'h').mean()

However that obviously resamples everything.


Solution

  • I have found a solution to my issue with the following function:

    In the function, the perc_of_NaN is the threshhold of NaN values above which you want to exclude any averaging. eg. You don't want resampling to be done if more than 60% of datapoints are NaN, then perc_of_NaN would be 60.

    Additionally, Averaging_number is the amount of rows (could be timesteps/whatever your data is) that you want to resample together.

    def NaN_Resample(df,perc_of_NaN,Averaging_number):
        rows = int(len(df)/Averaging_number)
        columns = int(len(df.columns))
        Output = pd.DataFrame(np.zeros((rows,columns)))
        Headers = list(df.columns)
        Output = Output.set_axis(Headers, axis=1)
        dec_of_NaN = perc_of_NaN/100
        for x in range (columns):
            for y in range (rows):
                if df.iloc[(y*Averaging_number):((y*Averaging_number)+(Averaging_number)),x].isna().sum()>(Averaging_number*dec_of_NaN): 
                    Output.iloc[y,x] == 'NaN'
                Output.iloc[y,x] = df.iloc[(y*Averaging_number):((y*Averaging_number)+(Averaging_number)),x].mean()
        Output.index = pd.to_datetime(Output[Headers[0]])      
        Output = Output.resample(rule = 'd').mean()
        return Output
    

    Feel free to use the function/ask any questions about it if it is helpful to you :)