Search code examples
rregressionlinear-regressionstandard-error

How to reverse engineer a dataset?


My teacher gave me the result of a regression and the exercise is to reverse engineer the data set that lead to that regression. Then we need to do a regression on it and find the exact same results.

Regression result

I managed to get the coefficient to -19.93 but I don't get the same SE at all. I don't know how to do it. I don't know if I should use some formulas linking the SE of the estimator and the standard error of the regression (I have some but I don't really see the way to implement them in R)...Thanks in advance for your help!

My R output:

## Given values
n <- 1592
se_β1 <- 1.47
β1hat <- -19.93

## Create a dummy variable for control vs treatment condition
set.seed(123)
Low_anchor <- rbinom(n,1,0.5)

## Formula of standard error of beta 1 (assuming homoskedasticity)
calculate_standard_error <- function(u, Low_anchor) {
  sqrt((1/(n - 2))*sum(u^2)/(n*sd(Low_anchor)^2))
}

## Define initial values of u
u <- rnorm(n)

## Tolerance for convergence
tolerance <- 0.1

## Iteratively adjust u until the standard error matches the target
while (abs(calculate_standard_error(u, Low_anchor) - se_β1) > tolerance) {
  ## Generate new set of values for u from a normal distribution
  u <- rnorm(n)
}

print(u)

## regression
Yc <- -19.93*Low_anchor + u
model1 <- lm(Yc ~ Low_anchor - 1)

## Print the summary of the model
summary(model1)

Solution

  • It looks like Nc is not defined. I think you are calling the independent variable. Here I will use x. Note that this question seems to ask you to use the properties of the standard error and the coefficient in the case of (multiple) regression in the form of y = bx + u. You have to know that

    1. If you multiply x by some number a, the estimated coefficient will be b/a.
    2. If you multiply u by some number a, the estimated standard error will be a*s.

    With this, you can write a simple look that first adjusts u and then adjusts x. First we define some primaries:

    n<- 1592
    se_b1 <- 1.47
    b1hat <- -19.93
    
    set.seed(2)
    x <- rnorm(n)
    y_mean <- -19.93*x
    
    # we are going to create a random variable to be the residuals
    u <- rnorm(n,0,1)
    
    error <- 1
    tol <- 0.01
    

    Note that the distributions of those are irrelevant to the answer. You may check that the mean of u is far from anything we might want. You may also check what happens once you run y <- y_mean + u summary(lm(y ~ x)) The coefficient and the standard error will be different from what you want. How to fix that? Using the two properties we mentioned above.

    error <- 1
    tol <- 0.01
    
    while (error > tol) {
      # remember that the se_b1 is constructed as the rood of the diagonal of sigma^2 * (X'X)^-1
      
      # determine the matrix of X (assuming there is an intercept here)
      X <- matrix(c(rep(1, n), x), ncol = 2)
      XX_minus_one <- solve(t(X) %*% X)
      
      # so far, we would get a standard deviation os
      present_se <- sqrt(var(u) * XX_minus_one[2,2])
      # this is different.  Let's adjust the residuals to have the desired variance
      u_fitting <- u * se_b1 / present_se
      
      y <- y_mean + u
      reg <- lm(y ~ x)
      
      estimated_b1 <- reg$coefficients[2]
      estimated_se_b1 <- summary(reg)$coefficients[2,2]
      
      error <- max(abs(estimated_se_b1 - se_b1), abs(estimated_b1 - b1hat))
      
      # but now we need to refit the x 
      x <- x *  estimated_b1/b1hat
      u <- u_fitting
    }
    

    You may check it is working:

    summary(lm(y ~ x))