Search code examples
rtime-seriesfrequencyzooimputation

Correct imputation for a zooreg object?


My objective is to impute NAs in a zooreg time series object. The pattern of the time series is cyclic. My code is:

#load libraries required
library("zoo")

# create sequence every 15 minutes from 1st Dec to 20th Dec, 2018
timeStamp <- seq.POSIXt(from=as.POSIXct('2018-01-01 00:00:00', tz="UTC"), to=as.POSIXct('2018-01-20 23:45:00', tz="UTC"), by = "15 min")
# data which increases from 12am to 12pm, then decreases till 12 am of next day, for 20 days
readings <- rep(c(seq(1,48,1), seq(48,1,-1)), 20)
dF <- data.frame(timeStamp=timeStamp, readings=readings)

# create a regular zooreg object, frequency is 1 day( 4 readings * 24 hours)
readingsZooReg <- zooreg(dF$readings, order.by  = dF$timeStamp, frequency = 4*24)
plot(readingsZooReg)

# force some data to be NAs
window(readingsZooReg, start = as.POSIXct("2018-01-14 00:00:00", tz="UTC"), end = as.POSIXct("2018-01-16 23:45:00", tz="UTC")) <- NA
plot(readingsZooReg)


# plot imputed values
plot(na.approx(readingsZooReg))

The plots are: Full time series, NAs added, Imputed time series

I'm purposely using zoo here, since the time series I work on are irregular(eg. solar, oil wells, etc)

1) Is my usage of "zooreg" correct? Or would a "zoo" object suffice ?
2) Is my frequency variable right?
3) Why won't na.approx work? I've also tried na.StructTs, the R script hangs.
4) Is there a solution using any other package? xts, ts, etc?


Solution

  • Your current example time-series is a regular time-series. (a irregular time series would have time-steps with different time distances between observations)

    E.g.:

    • 10:00:10, 10:00:20, 10:00:30, 10:00:40, 10:00:50 (regular spaced)
    • 10:00:10, 10:00:17, 10:00:33, 10:00:37, 10:00:50 (irregular spaced)

    If you really need to handle irregular spaced time-series, zoo is your go to package. Otherwise you can also use other time series classes as xts and ts.

    About the frequency:

    You set the frequency of a time-series usually according to a value where you expect patterns to repeat. (in your example this could be 96). In real live this is often 1 day, 1 week, 1 month,....but it can be also different from these like 1,5 days. (e.g. if you have daily returning patterns and 1 minute observations you would set the frequency to 1440).

    na.approx of zoo workes perfectly. It is exactly doing what it is expected to. A interpolation between the points 0 before the gap and 0 at the end of the gap will give a straight line at 0. Of course that is probably not the result you expected, because it does not account for seasonality. That is why G. Grothendieck suggests you na.StructTS as a method to choose. (this method is usually better in accounting for seasonality)

    The best choice if you are not bound to zoo would in this specific case be using na_seadec from the imputeTS package ( a package solely dedicated to time series imputation).

    I have added you a example also with nice plots from the imputeTS package

    library(imputeTS)
    yourTS <- ts(coredata(readingsZooReg), frequency = 96)
    ggplot_na_distribution(yourTS)
    imputedTS <- na_seadec(yourTS)
    ggplot_na_imputations(yourTS, imputedTS)
    

    Usually imputeTS also works perfectly with zoo time-series as input. I only changed it to ts again, because something with your zoo object seems odd...that is also why na.StructTS from zoo itself breaks. Maybe somebody with better knowledge can help out here.

    Beware, if you really should have irregular time series do not use other packages / imputation functions than from zoo. Because they all assume the data to be regular spaced and will give results accordingly.