Search code examples
rtime-seriescorrelationstatistics-bootstrap

Time series lags and correlations (autocorrelations)


I'm trying to calculate the following for a set of data to learn some time series analysis and then block boot strap the standard errors for individuals :

enter image description here

Here's the data set :

https://www.dropbox.com/s/z066lnxetz9uaf6/health.csv?dl=0

And here is the code I've done for the Cor :

 #Check for duplicates
health.d <- health.d[!duplicated(health.d),]

health.d$lnincome <- log(health.d$Income + 1)

health.d <- health.d[(health.d$sex == 1 & health.d$married == 0),]

#First Difference for each individual  ( %>% , group_by and mutate are functions in dplyr package)
health.d <- health.d %>%    
  group_by(ID) %>%                
  mutate(Dy = lnincome - lag(lnincome, 1)) 

#Remove NA from Dy
health.d <- health.d[!is.na(health.d$Dy),]

#Autocorretion

health.d <- arrange(health.d, ID, year)
health.d <- transform(health.d, time = as.numeric(interaction(ID, drop=TRUE)))

health.d$lag1DY <- health.d$lnincome - lag(health.d$lnincome, 1)
health.d$lagDY_s1 <- lag(health.d$lnincome,1) - lag(health.d$lnincome, 2)
health.d$lagDY_s2 <- lag(health.d$lnincome,2) - lag(health.d$lnincome, 3)
health.d$lagDY_s3 <- lag(health.d$lnincome,3) - lag(health.d$lnincome, 4)
health.d$lagDY_s4 <- lag(health.d$lnincome,4) - lag(health.d$lnincome, 5)

#Remove NA from lag
health.d <- health.d[!is.na(health.d$lag1DY),]
health.d <- health.d[!is.na(health.d$lagDY_s1),]
health.d <- health.d[!is.na(health.d$lagDY_s2),]
health.d <- health.d[!is.na(health.d$lagDY_s3),]
health.d <- health.d[!is.na(health.d$lagDY_s4),]

cor(health.d$lag1DY, health.d$lagDY_s1)
cor(health.d$lag1DY, health.d$lagDY_s2)
cor(health.d$lag1DY, health.d$lagDY_s3)
cor(health.d$lag1DY, health.d$lagDY_s4)

Results :

    > cor(health.d$lag1DY, health.d$lagDY_s1)
    [1] -0.05593212
    > cor(health.d$lag1DY, health.d$lagDY_s2)
    [1] -0.1033625
    > cor(health.d$lag1DY, health.d$lagDY_s3)
    [1] -0.0804236
    > cor(health.d$lag1DY, health.d$lagDY_s4)
    [1] -0.1235624

These seem wrong as there should be high correlation between the time periods due to the income, but I can't figure out what I have done wrong.

Edit: I've updated my code to include the current results I've reached. These don't appear to be correct, but (1) I don't know the correct numbers, and (2) I don't know where my code is wrong. I'm posting my current results in hope someone can correct me :)

Any help with a block bootstrap on the standard errors?

Thanks in advance.


Solution

  • Probably all what you need is to use acf function in stats package. It will do correlations for many lags as you prefer.

    library(stats) # for the use of "acf" function
    health.d <- health.d[!duplicated(health.d),]
    health.d$lnincome <- log(health.d$Income + 1)
    health.d <- health.d[(health.d$sex == 1 & health.d$married == 0),]
    #First Difference for each individual ( %>% , group_by and mutate are functions in dplyr package)
    health.d <- health.d %>%
    group_by(ID) %>%
    mutate(Dy = lnincome - lag(lnincome, 1))
    acf.results<-acf(health.d$Dy, lag.max = 5, type = "correlation",plot = TRUE, na.action = na.pass)
    plot(acf.results, main="Auto-correlation")
    

    This will give you the following plot of auto-corrections at 5 lags specified in the acf argument

    enter image description here

    If you want to access the acf results you can use:

    print(acf.results)
    

    and you will get the following

    Autocorrelations of series ‘health.d$Dy’, by lag
    
         0      1      2      3      4      5 
     1.000 -0.225  0.016 -0.030 -0.002  0.002