My teacher gave me the result of a regression and the exercise is to reverse engineer the data set that lead to that regression. Then we need to do a regression on it and find the exact same results.
I managed to get the coefficient to -19.93 but I don't get the same SE at all. I don't know how to do it. I don't know if I should use some formulas linking the SE of the estimator and the standard error of the regression (I have some but I don't really see the way to implement them in R)...Thanks in advance for your help!
My R output:
## Given values
n <- 1592
se_β1 <- 1.47
β1hat <- -19.93
## Create a dummy variable for control vs treatment condition
set.seed(123)
Low_anchor <- rbinom(n,1,0.5)
## Formula of standard error of beta 1 (assuming homoskedasticity)
calculate_standard_error <- function(u, Low_anchor) {
sqrt((1/(n - 2))*sum(u^2)/(n*sd(Low_anchor)^2))
}
## Define initial values of u
u <- rnorm(n)
## Tolerance for convergence
tolerance <- 0.1
## Iteratively adjust u until the standard error matches the target
while (abs(calculate_standard_error(u, Low_anchor) - se_β1) > tolerance) {
## Generate new set of values for u from a normal distribution
u <- rnorm(n)
}
print(u)
## regression
Yc <- -19.93*Low_anchor + u
model1 <- lm(Yc ~ Low_anchor - 1)
## Print the summary of the model
summary(model1)
It looks like Nc
is not defined. I think you are calling the independent variable. Here I will use x
. Note that this question seems to ask you to use the properties of the standard error and the coefficient in the case of (multiple) regression in the form of y = bx + u. You have to know that
With this, you can write a simple look that first adjusts u and then adjusts x. First we define some primaries:
n<- 1592
se_b1 <- 1.47
b1hat <- -19.93
set.seed(2)
x <- rnorm(n)
y_mean <- -19.93*x
# we are going to create a random variable to be the residuals
u <- rnorm(n,0,1)
error <- 1
tol <- 0.01
Note that the distributions of those are irrelevant to the answer. You may check that the mean of u is far from anything we might want. You may also check what happens once you run y <- y_mean + u summary(lm(y ~ x)) The coefficient and the standard error will be different from what you want. How to fix that? Using the two properties we mentioned above.
error <- 1
tol <- 0.01
while (error > tol) {
# remember that the se_b1 is constructed as the rood of the diagonal of sigma^2 * (X'X)^-1
# determine the matrix of X (assuming there is an intercept here)
X <- matrix(c(rep(1, n), x), ncol = 2)
XX_minus_one <- solve(t(X) %*% X)
# so far, we would get a standard deviation os
present_se <- sqrt(var(u) * XX_minus_one[2,2])
# this is different. Let's adjust the residuals to have the desired variance
u_fitting <- u * se_b1 / present_se
y <- y_mean + u
reg <- lm(y ~ x)
estimated_b1 <- reg$coefficients[2]
estimated_se_b1 <- summary(reg)$coefficients[2,2]
error <- max(abs(estimated_se_b1 - se_b1), abs(estimated_b1 - b1hat))
# but now we need to refit the x
x <- x * estimated_b1/b1hat
u <- u_fitting
}
You may check it is working:
summary(lm(y ~ x))