Search code examples
rregressionlinear-regressionlmleast-squares

Replicate least squares regression to check consistency of estimation and prediction with truth


Here is the problem: Five observations on Y are to be taken when X = 4, 8, 12, 16, 20, respectively. The true regression function is E(y) = 20 + 4X, and the ei are independent N(O, 25).

  1. Generate five normal random numbers, with mean 0 and variance 25. Consider these random numbers as the error terms for the five Y observations at X = 4,8, 12, 16, 20 and calculate Y1, Y2 , Y3 , Y4 , and Y5. Obtain the least squares estimates bo and b1, when fitting a straight line to the five cases. Also calculate Yh when Xh = 10 and obtain a 95 percent confidence interval for E(Yh) when Xh = 10. I did part 1, but I need help to repeat it for 200 times.

  2. Repeat part (1) 200 times, generating new random numbers each time.

  3. Make a frequency distribution of the 200 estimates b1. Calculate the mean and standard deviation of the 200 estimates b1. Are the results consistent with theoretical expectations?

  4. What proportion of the 200 confidence intervals for E(Yh) when Xh = 10 include E(Yh)? Is this result consistent with theoretical expectations?

Here's my code so far, I am stumped on how to repeat part 1 for 200 times:

X <- matrix(c(4, 8, 12, 16, 20), nrow = 5, ncol = 1)
e <- matrix(c(rnorm(5,0,sqrt(5))), nrow = 5, ncol = 1)
Y <- 20 + 4 * X + e
mydata <- data.frame(cbind(Y=Y, X=X, e=e))
names(mydata) <- c("Y","X","e")
reg<-lm(Y ~ X, data = mydata)
predict(reg, newdata = data.frame(X=10), interval="confidence")

Solution

  • There is mistake in your code. You want independent N(O, 25) errors, but you passed sqrt(5) as standard error to rnorm(). It should be 5.

    We first wrap up your code into a function. This function takes no input, but run experiment once, and returns regression coefficients b0, b1 and prediction fit, lwr, upr in a named vector.

    sim <- function () {
      x <- c(4, 8, 12, 16, 20)
      y <- 20 + 4 * x + rnorm(5,0,5)
      fit <- lm(y ~ x)
      pred <- predict(fit, data.frame(x = 10), interval = "confidence")
      pred <- setNames(c(pred), dimnames(pred)[[2]])
      ## return simulation result
      c(coef(fit), pred)
      }
    

    For example, let's try

    set.seed(2016)
    sim()
    #(Intercept)           x         fit         lwr         upr 
    #  24.222348    3.442742   58.649773   47.522309   69.777236 
    

    Now we use replicate to repeat such experiment 200 times.

    set.seed(0)
    z <- t(replicate(200, sim()))
    
    head(z)
    #     (Intercept)        x      fit      lwr      upr
    #[1,]   24.100535 3.987755 63.97808 57.61262 70.34354
    #[2,]    6.417639 5.101501 57.43265 52.44263 62.42267
    #[3,]   20.652355 3.797991 58.63227 52.74861 64.51593
    #[4,]   20.349829 3.816426 58.51409 52.59115 64.43702
    #[5,]   19.891873 4.095140 60.84327 57.49911 64.18742
    #[6,]   24.586749 3.589483 60.48158 53.64574 67.31743
    

    There will be 200 rows, for results of 200 simulations. The second column contains estimation for b1 under 200 simulations, we compute their mean and standard error:

    mean(z[,2])
    # [1] 3.976249
    
    sd(z[,2])
    # [1] 0.4263377
    

    We know that the true value is 4, and it is evident that our estimate is consistent with true values.

    Finally, let's check with 95% confidence interval for prediction at X = 10. The true value is 20 + 4 * 10 = 60, so the proportion of confidence interval that covers this true vale is

    mean(z[, "lwr"] < 60 & z[, "upr"] > 60)
    ## 0.95
    

    which is exactly 0.95.