Search code examples
rrandom

R: Generate dataset for pre-defined interaction effects


I have generated a random df with the continuous variables V1, V2, and V3:

set.seed(123)
sigma <- rbind(c(1,-0.4,-0.6), c(-0.4,1,0.5), c(-0.6,0.5,1))
mu <- c(5, 3, 2)
df <- as.data.frame(MASS::mvrnorm(n=1000, mu=mu, Sigma=sigma))

Now, I would like to generate an interaction effect between two variables in predicting another one. For example, there should be an interaction between V1 and V2 in predicting V3. Is there a way to adjust the data in a way to get a pre-defined interaction term estimate of V1:V2 when computing summary(lm(V3 ~ V1*V2, data=df))?

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.419951   0.376848   9.075  < 2e-16 ***
V1          -0.472042   0.069703  -6.772 2.16e-11 ***
V2           0.328102   0.112497   2.917  0.00362 ** 
V1:V2       -0.003878   0.021811  -0.178  0.85892    

Note: I want to use the generated df for a tutorial


Solution

  • This is a linear regression model, so just just a linear equation with the coefficients you want and a small amount of noise:

    set.seed(123)
    sigma <- rbind(c(1,-0.4,-0.6), c(-0.4,1,0.5), c(-0.6,0.5,1))
    mu <- c(5, 3, 2)
    df <- as.data.frame(MASS::mvrnorm(n=10000, mu=mu, Sigma=sigma))
    df$V4 <- 1 * df$V1 - 2 * df$V2 + pi * df$V1 * df$V2 + rnorm(nrow(df))
    
    mod <- lm(V4 ~ V1 * V2, df)
    summary(mod)
    # 
    # Call:
    # lm(formula = V4 ~ V1 * V2, data = df)
    # 
    # Residuals:
    #     Min      1Q  Median      3Q     Max 
    # -4.1343 -0.6755  0.0073  0.6741  3.9650 
    # 
    # Coefficients:
    #              Estimate Std. Error t value Pr(>|t|)    
    # (Intercept)  0.183117   0.160119   1.144    0.253    
    # V1           0.953835   0.029612  32.211   <2e-16 ***
    # V2          -2.041455   0.047475 -43.000   <2e-16 ***
    # V1:V2        3.153014   0.009226 341.749   <2e-16 ***
    # ---
    # Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    # 
    # Residual standard error: 0.999 on 9996 degrees of freedom
    # Multiple R-squared:  0.9949,  Adjusted R-squared:  0.9949 
    # F-statistic: 6.515e+05 on 3 and 9996 DF,  p-value: < 2.2e-16