Search code examples
rdata-quality

R - estimating missing values


Let's assume I have a table as such:

Date        Sales
09/01/2017  9000
09/02/2017  12000
09/03/2017  0
09/04/2017  11000
09/05/2017  14400
09/06/2017  0
09/07/2017  0
09/08/2017  21000
09/09/2017  15000
09/10/2017  23100
09/11/2017  0
09/12/2017  32000
09/13/2017  8000

Here is what the data in the table looks like

The values in the table are estimated by an R program which I have no access to (it's a black box right now). Now there are a few days with 0 values which tend to creep in due to issues in our ingestion/ETL process. I need to estimate the values for the dates with 0 data.

Our approach is to:

  • Draw a line from the date prior to the missing data to the date right after the missing data
  • Estimate the value for the missing date from the line

Now if there's only one day with missing data between two good days, a straightforward mean would work. If there are two or more consecutive days with missing data, the mean would not work, so I'm trying to formulate a way to go about estimating values for multiple data points.

The intersection of the green and red lines would give the required values

Would this approach work in R? I'm a total n00b at R so I'm not really sure if this is even feasible.


Solution

  • You can fill in the values with linear interpolation using the function approxfun.

    ## Your data
    df = read.table(text="Date        Sales
    09/01/2017  9000
    09/02/2017  12000
    09/03/2017  0
    09/04/2017  11000
    09/05/2017  14400
    09/06/2017  0
    09/07/2017  0
    09/08/2017  21000
    09/09/2017  15000
    09/10/2017  23100
    09/11/2017  0
    09/12/2017  32000
    09/13/2017  8000",
    header=TRUE, stringsAsFactors=FALSE)
    df$Date = as.Date(df$Date, format="%m/%d/%Y")
    
    
    ## Create function for linear interpolation
    Interp = approxfun(df[df$Sales > 0, ])
    
    ## Use function to fill in interpolated values
    Vals = Interp(df$Date[df$Sales == 0])
    df$Sales[df$Sales == 0] = Vals
    plot(df, type="l")
    grid()
    

    Interpolated values