Here is the problem: Five observations on Y
are to be taken when X = 4, 8, 12, 16, 20
, respectively. The true regression function is E(y) = 20 + 4X
, and the ei
are independent N(O, 25)
.
Generate five normal random numbers, with mean 0 and variance 25. Consider these random numbers as the error terms for the five Y
observations at X = 4,8, 12, 16, 20
and calculate Y1
, Y2
, Y3
, Y4
, and Y5
. Obtain the least squares estimates bo
and b1
, when fitting a straight line to the five cases. Also calculate Yh
when Xh = 10
and obtain a 95 percent confidence interval for E(Yh)
when Xh = 10
. I did part 1, but I need help to repeat it for 200 times.
Repeat part (1) 200 times, generating new random numbers each time.
Make a frequency distribution of the 200 estimates b1
. Calculate the mean and standard deviation of the 200 estimates b1
. Are the results consistent with theoretical expectations?
What proportion of the 200 confidence intervals for E(Yh)
when Xh = 10
include E(Yh)
? Is this result consistent with theoretical expectations?
Here's my code so far, I am stumped on how to repeat part 1 for 200 times:
X <- matrix(c(4, 8, 12, 16, 20), nrow = 5, ncol = 1)
e <- matrix(c(rnorm(5,0,sqrt(5))), nrow = 5, ncol = 1)
Y <- 20 + 4 * X + e
mydata <- data.frame(cbind(Y=Y, X=X, e=e))
names(mydata) <- c("Y","X","e")
reg<-lm(Y ~ X, data = mydata)
predict(reg, newdata = data.frame(X=10), interval="confidence")
There is mistake in your code. You want independent N(O, 25)
errors, but you passed sqrt(5)
as standard error to rnorm()
. It should be 5
.
We first wrap up your code into a function. This function takes no input, but run experiment once, and returns regression coefficients b0
, b1
and prediction fit
, lwr
, upr
in a named vector.
sim <- function () {
x <- c(4, 8, 12, 16, 20)
y <- 20 + 4 * x + rnorm(5,0,5)
fit <- lm(y ~ x)
pred <- predict(fit, data.frame(x = 10), interval = "confidence")
pred <- setNames(c(pred), dimnames(pred)[[2]])
## return simulation result
c(coef(fit), pred)
}
For example, let's try
set.seed(2016)
sim()
#(Intercept) x fit lwr upr
# 24.222348 3.442742 58.649773 47.522309 69.777236
Now we use replicate
to repeat such experiment 200 times.
set.seed(0)
z <- t(replicate(200, sim()))
head(z)
# (Intercept) x fit lwr upr
#[1,] 24.100535 3.987755 63.97808 57.61262 70.34354
#[2,] 6.417639 5.101501 57.43265 52.44263 62.42267
#[3,] 20.652355 3.797991 58.63227 52.74861 64.51593
#[4,] 20.349829 3.816426 58.51409 52.59115 64.43702
#[5,] 19.891873 4.095140 60.84327 57.49911 64.18742
#[6,] 24.586749 3.589483 60.48158 53.64574 67.31743
There will be 200 rows, for results of 200 simulations.
The second column contains estimation for b1
under 200 simulations, we compute their mean and standard error:
mean(z[,2])
# [1] 3.976249
sd(z[,2])
# [1] 0.4263377
We know that the true value is 4
, and it is evident that our estimate is consistent with true values.
Finally, let's check with 95% confidence interval for prediction at X = 10
. The true value is 20 + 4 * 10 = 60
, so the proportion of confidence interval that covers this true vale is
mean(z[, "lwr"] < 60 & z[, "upr"] > 60)
## 0.95
which is exactly 0.95.