Search code examples
rregressionlinear-regressionforecasting

(In)correct use of a linear time trend variable, and most efficient fix?


I have 3133 rows representing payments made on some of the 5296 days between 7/1/2000 and 12/31/2014; that is, the "Date" feature is non-continuous:

> head(d_exp_0014)
  Year Month Day    Amount Count     myDate
1 2000     7   6  792078.6     9 2000-07-06
2 2000     7   7  140065.5     9 2000-07-07
3 2000     7  11  190553.2     9 2000-07-11
4 2000     7  12  119208.6     9 2000-07-12
5 2000     7  16 1068156.3     9 2000-07-16
6 2000     7  17       0.0     9 2000-07-17

I would like to fit a linear time trend variable,

t <- 1:3133

to a linear model explaining the variation in the Amount of the expenditure.

fit_t <- lm(Amount ~ t + Count, d_exp_0014)

However, this is obviously wrong, as t increments in different amounts between the dates:

> head(exp)
  Year Month Day    Amount Count       Date t
1 2000     7   6  792078.6     9 2000-07-06 1
2 2000     7   7  140065.5     9 2000-07-07 2
3 2000     7  11  190553.2     9 2000-07-11 3
4 2000     7  12  119208.6     9 2000-07-12 4
5 2000     7  16 1068156.3     9 2000-07-16 5
6 2000     7  17       0.0     9 2000-07-17 6

Which to me is the exact opposite of a linear trend.

What is the most efficient way to get this data.frame merged to a continuous date-index? Will a date vector like

CTS_date_V <- as.data.frame(seq(as.Date("2000/07/01"), as.Date("2014/12/31"), "days"), colnames = "Date")

yield different results?

I'm open to any packages (using fpp, forecast, timeSeries, xts, ts, as of right now); just looking for a good answer to deploy in functional form, since these payments are going to be updated every week and I'd like to automate the append to this data.frame.


Solution

  • I think some kind of transformation to regular (continuous) time series is a good idea. You can use xts to transform time series data (it is handy, because it can be used in other packages as regular ts)

    Filling the gaps

    # convert myDate to POSIXct if necessary
    # create xts from data frame x 
    ts1 <- xts(data.frame(a = x$Amount, c = x$Count), x$myDate )
    ts1
    
    # create empty time series
    ts_empty <- seq( from = start(ts1), to = end(ts1), by = "DSTday")
    
    # merge the empty ts to the data and fill the gap with 0
    ts2 <- merge( ts1, ts_empty, fill = 0)
    
    # or interpolate, for example:
    ts2 <- merge( ts1, ts_empty, fill = NA)
    ts2 <- na.locf(ts2)
    
    # zoo-xts ready functions are:
    #     na.locf - constant previous value
    #     na.approx - linear approximation
    #     na.spline - cubic spline interpolation
    

    Deduplicate dates

    In your sample there is now sign of duplicated values. But based on a new question it is very likely. I think you want to aggregate values with sum function:

    ts1 <- period.apply( ts1, endpoints(ts1,'days'), sum)