I am trying to run a regression on a bootstrapped sample in R.
The original sample looks like this dataframe (referred to as df) and has hundreds of entries. Y is the outcome variable, and treat is 0 or 1.
y treat
3 0
5 1
2 0
4 1
I have sampled with replacement to generate 900 observations from df$y.
set.seed(5)
b1 <- sample(df$y, 900, replace = TRUE, prob = NULL)
I have then run the following regression.
lm(b1 ~ treat, df)
When using the sample b1 as the outcome in the regression, does this automatically match up the correct value of b1 with the treat value from the original dataframe? If I want the outcome values in b1 to correspond to the correct treat value from the original dataframe, do I need to do something differently? How can I check that this is the regression I am trying to run?
We could sample
on the sequence of rows instead of a single column. In the OP's code, it is just sampling the 'y', leaving the 'treat' with just 4 elements and when we apply the formula method, this would result in error as one of the object is having different length.
lm(b1 ~ treat, df)
Error in model.frame.default(formula = b1 ~ treat, data = df, drop.unused.levels = TRUE) : variable lengths differ (found for 'treat')
Instead, we sample
on the sequence of rows
set.seed(5)
df1 <- df[sample(seq_len(nrow(df)), 900, replace = TRUE),]
lm(y ~ treat, df1)
df <- structure(list(y = c(3L, 5L, 2L, 4L), treat = c(0L, 1L, 0L, 1L
)), class = "data.frame", row.names = c(NA, -4L))