I have generated a random df
with the continuous variables V1
, V2
, and V3
:
set.seed(123)
sigma <- rbind(c(1,-0.4,-0.6), c(-0.4,1,0.5), c(-0.6,0.5,1))
mu <- c(5, 3, 2)
df <- as.data.frame(MASS::mvrnorm(n=1000, mu=mu, Sigma=sigma))
Now, I would like to generate an interaction effect between two variables in predicting another one. For example, there should be an interaction between V1
and V2
in predicting V3
. Is there a way to adjust the data in a way to get a pre-defined interaction term estimate of V1:V2
when computing summary(lm(V3 ~ V1*V2, data=df))
?
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.419951 0.376848 9.075 < 2e-16 ***
V1 -0.472042 0.069703 -6.772 2.16e-11 ***
V2 0.328102 0.112497 2.917 0.00362 **
V1:V2 -0.003878 0.021811 -0.178 0.85892
Note: I want to use the generated df for a tutorial
This is a linear regression model, so just just a linear equation with the coefficients you want and a small amount of noise:
set.seed(123)
sigma <- rbind(c(1,-0.4,-0.6), c(-0.4,1,0.5), c(-0.6,0.5,1))
mu <- c(5, 3, 2)
df <- as.data.frame(MASS::mvrnorm(n=10000, mu=mu, Sigma=sigma))
df$V4 <- 1 * df$V1 - 2 * df$V2 + pi * df$V1 * df$V2 + rnorm(nrow(df))
mod <- lm(V4 ~ V1 * V2, df)
summary(mod)
#
# Call:
# lm(formula = V4 ~ V1 * V2, data = df)
#
# Residuals:
# Min 1Q Median 3Q Max
# -4.1343 -0.6755 0.0073 0.6741 3.9650
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.183117 0.160119 1.144 0.253
# V1 0.953835 0.029612 32.211 <2e-16 ***
# V2 -2.041455 0.047475 -43.000 <2e-16 ***
# V1:V2 3.153014 0.009226 341.749 <2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.999 on 9996 degrees of freedom
# Multiple R-squared: 0.9949, Adjusted R-squared: 0.9949
# F-statistic: 6.515e+05 on 3 and 9996 DF, p-value: < 2.2e-16