Let's assume I have a table as such:
Date Sales
09/01/2017 9000
09/02/2017 12000
09/03/2017 0
09/04/2017 11000
09/05/2017 14400
09/06/2017 0
09/07/2017 0
09/08/2017 21000
09/09/2017 15000
09/10/2017 23100
09/11/2017 0
09/12/2017 32000
09/13/2017 8000
The values in the table are estimated by an R program which I have no access to (it's a black box right now). Now there are a few days with 0 values which tend to creep in due to issues in our ingestion/ETL process. I need to estimate the values for the dates with 0 data.
Our approach is to:
Now if there's only one day with missing data between two good days, a straightforward mean would work. If there are two or more consecutive days with missing data, the mean would not work, so I'm trying to formulate a way to go about estimating values for multiple data points.
Would this approach work in R? I'm a total n00b at R so I'm not really sure if this is even feasible.
You can fill in the values with linear interpolation using the function approxfun
.
## Your data
df = read.table(text="Date Sales
09/01/2017 9000
09/02/2017 12000
09/03/2017 0
09/04/2017 11000
09/05/2017 14400
09/06/2017 0
09/07/2017 0
09/08/2017 21000
09/09/2017 15000
09/10/2017 23100
09/11/2017 0
09/12/2017 32000
09/13/2017 8000",
header=TRUE, stringsAsFactors=FALSE)
df$Date = as.Date(df$Date, format="%m/%d/%Y")
## Create function for linear interpolation
Interp = approxfun(df[df$Sales > 0, ])
## Use function to fill in interpolated values
Vals = Interp(df$Date[df$Sales == 0])
df$Sales[df$Sales == 0] = Vals
plot(df, type="l")
grid()