Search code examples
rtime-serieserror-correction

Correcting consecutive errors in time series


I have very large continuous data sets (>1M rows) with frequent "breaks" or "jumps" due to sensor failure or other external factors. These breaks correspond to a constant value added or removed and last only for a limited amount of time. I am trying to realign these sequences with the rest of the data.

par(mfrow=c(2,1))

#simulating perfect dataset
dfe<-data.frame(
  date=seq(as.Date('2015-07-12'),as.Date('2015-07-12')+49, by = '1 day'),
  valueideal=round(sin(seq(1,50,1))+20)
)

#introducing artifacts
dfe$valuer<-dfe$valueideal
dfe$valuer[10:20]<-dfe$valueideal[10:20]+10
dfe$valuer[30:35]<-dfe$valueideal[30:35]-10


#plotting ideal vs real data
plot(dfe$date, dfe$valuer, main="real data", ylim=c(8,32))
plot(dfe$date, dfe$valueideal, main="ideal data", ylim=c(8,32))

So my data look like the "real data" and I would like them to realign them to be like the "ideal data". Real vs Ideal corrected data

So far I have made one for loop that mostly works except for the first data point of each artifact, and it slightly affects the rest of the data. I am not sure why or how to fix it:

#trying to solve it with a loop
dfe$valuel<-dfe$valuer
for (i in seq(2,length(dfe$valuel)-1,1)){
  future<-diff(c(dfe$valuel[i],dfe$valuel[i+1]))
  past<-diff(c(dfe$valuel[i-1],dfe$valuel[i]))

  if (abs(future)>2*abs(past)){
    dfe$valuel[i:length(dfe$valuel)]<-dfe$valuel[i:length(dfe$valuel)]-future

  }
}
plot(dfe$date, dfe$valuel, main="loop corrected data", ylim=c(8,32))

Real data vs Loop corrected

I am also worried to use this method on my very large dataset, I am not sure how long this will take. So I have also tried using this R function to subtract the difference between consecutive values in vector from subsequent values in vector method, but that didn't go well, possibly because it is hard to pick a delta_max value that is relevant:

#trying to solve it with a vectorised function
remove_artifacts <- function(weights, delta_max) {
  # calculate deltas, and set first delta to zero
  dw <- c(0, diff(x))
  # create vector of zeros and abs(observations) > delta_max
  # dw * (logical vector) results in either:
  # dw * 0 (if FALSE)
  # dw * 1 (if TRUE)
  dm <- dw * (abs(dw) > delta_max)
  # subtract the cumulative sum of observations > delta_max
  return(weights - cumsum(dm))
}
dfe$valuedm<-remove_artifacts(dfe$valuer, 10)
plot(dfe$date, dfe$valuedm, main="remove artifacts function", ylim=c(8,32))

real vs de-artifact function

So my question is, how can I efficiently correct these consecutive data breaks?


Solution

  • Here's another solution that is fast as it uses and, therefore, modify in place. First, I setup the problem.

    #simulating perfect dataset
    dfe<-data.frame(
      date=seq(as.Date('2015-07-12'),as.Date('2015-07-12')+49, by = '1 day'),
      valueideal=round(sin(seq(1,50,1))+20)
    )
    
    #introducing artifacts
    dfe$valuer<-dfe$valueideal
    dfe$valuer[10:20]<-dfe$valueideal[10:20]+10
    dfe$valuer[30:35]<-dfe$valueideal[30:35]-10
    

    Next, I load and convert the data frame into a data table.

    # Load data.table
    library(data.table)
    
    # Convert data frame into data.table
    setDT(dfe)
    

    I calculate the difference in consecutive values using a vectorised approach, rather than the loop in the question.

    # Calculate changes
    dfe[, delta := c(abs(diff(valuer)), 0)]
    

    These differences are used to break the time series into intervals:

    # Labels intervals
    dfe[, int := cut(.I, 
                     breaks = c(0, which(delta > 3 * sd(delta)/mean(delta)), nrow(dfe)), 
                     include.lowest = TRUE)]
    

    I centre all intervals on zero.

    # Mean of zero
    dfe[, value_new := valuer - mean(valuer), by = int]
    

    Then, I add an offset taken as the mean of the first group.

    # Correct offset
    dfe[, value_new := value_new + dfe[, mean(valuer), by = int][, first(V1)]]
    

    Finally, I plot the result.

    # Plot result
    with(dfe, plot(date, value_new, main="real data", ylim=c(8,32)))
    

    Created on 2019-12-11 by the reprex package (v0.3.0)