Search code examples
rinteraction

Is there a way to change the way R labels the interaction parameter in model output?


I am having a seemingly simple but very frustrating problem. When you run a model with an interaction term in R, R names the parameter generated "var1:var2" etc. Unfortunately, this naming convention prevents me from calculating predicted values and CI's where newdata is required, because ":" is not a character that can be included in a column header, and the names in the original data frame must exactly match those in newdata. Has anyone else had this problem?

Here is a sample of my code:

wemedist2.exp = glm(survive/trials ~ sitedist + type + sitedist*type + roaddist, family =          binomial(logexp(wemedata$expos)), data=wemedata)
summary(wemedist2.exp)
wemepredict3 = with(wemedata, data.frame(sitedist=mean(sitedist),roaddist=mean(roaddist),    type=factor(1:2))) 
wemepredict3 = cbind(wemepredict3, predict(wemedist2.exp, newdata = wemepredict3, type = "link", se = TRUE))

This produces a table with predicted values for each of the variables at the specified levels, but not interaction.


Solution

  • For your newdata data frame, you shouldn't include columns for the interactions. The product of the interactive variables will be calculated for you (and multiplied by the estimated coefficient) when calling predict.

    For example:

    1. Create some dummy data:

      set.seed(1)
      n <- 10000
      X <- data.frame(x1=runif(n), x2=runif(n))
      X$x1x2 <- X$x1 * X$x2
      
      head(X)
      #          x1         x2        x1x2
      # 1 0.2655087 0.06471249 0.017181728
      # 2 0.3721239 0.67661240 0.251783646
      # 3 0.5728534 0.73537169 0.421260147
      # 4 0.9082078 0.11129967 0.101083225
      # 5 0.2016819 0.04665462 0.009409393
      # 6 0.8983897 0.13091031 0.117608474
      
      b <- runif(4)
      y <- b[1] + c(as.matrix(X) %*% b[-1]) + rnorm(n, sd=0.1)
      
    2. Fit the model and compare the estimated vs. true coefficients:

      M <- lm(y ~ x1 * x2, X)
      summary(M)
      
      # Call:
      # lm(formula = y ~ x1 * x2, data = X)
      # 
      # Residuals:
      #      Min       1Q   Median       3Q      Max 
      # -0.43208 -0.06743 -0.00170  0.06601  0.37197 
      # 
      # Coefficients:
      #             Estimate Std. Error t value Pr(>|t|)    
      # (Intercept) 0.202040   0.003906   51.72   <2e-16 ***
      # x1          0.128237   0.006809   18.83   <2e-16 ***
      # x2          0.156942   0.006763   23.21   <2e-16 ***
      # x1:x2       0.292582   0.011773   24.85   <2e-16 ***
      # ---
      # Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
      # 
      # Residual standard error: 0.09906 on 9996 degrees of freedom
      # Multiple R-squared:  0.5997,  Adjusted R-squared:  0.5996 
      # F-statistic:  4992 on 3 and 9996 DF,  p-value: < 2.2e-16
      
      b
      # [1] 0.2106027 0.1147864 0.1453641 0.3099322
      
    3. Create example data to predict to, and do prediction. Note that we only create x1 and x2, and do not create x1:x2:

      X.predict <- data.frame(x1=runif(10), x2=runif(10))
      
      head(X.predict)
      #           x1        x2
      # 1 0.26037592 0.7652155
      # 2 0.73988333 0.3352932
      # 3 0.02650689 0.9788743
      # 4 0.84083874 0.1446228
      # 5 0.85052685 0.7674547
      # 6 0.13568509 0.9612156
      
      predict(M, newdata=X.predict)
      
      #         1         2         3         4         5         6         7 
      # 0.4138194 0.4221251 0.3666572 0.3681432 0.6225354 0.4084543 0.4711018 
      #         8         9        10 
      # 0.7092744 0.3401867 0.2320834 
      

    Or...

    An alternative approach is to include the interactions in your model-fitting data by calculating the product of the interactive terms, and then include this in your new data as well. We've done the first step in point 1 above, where we created a column called x1x2.

    Then we would fit the model with: lm(y ~ x1 + x2 + x1x2, X)

    And predict to the following data:

    X.predict <- data.frame(x1=runif(10), x2=runif(10), x1x2=runif(10)
    

    If you have categorical variables involved in interactions...

    When you have interactions involving categorical variables, the model estimates coefficients describing the effect of belonging to each level relative to belonging to a reference level. So for instance if we have one continuous predictor (x1) and one categorical predictor (x2, with levels a, b, and c), then the model y ~ x1 * x2 will estimate six coefficients, describing:

    1. the intercept (i.e. the predicted y when x1 is zero and the observation belongs to the reference level of x2);

    2. the effect of varying x1 when the observation belongs to the reference level of x2 (i.e. the slope, for the reference level of x2);

    3. the effect of belonging to the second level (i.e. the change in intercept due to belonging to the second level, relative to belonging to the reference level);

    4. the effect of belonging to the third level (i.e. the change in intercept due to belonging to the third level, relative to belonging to the reference level);

    5. the change in the effect of x1 (i.e. change in slope) due to belonging to the second level, relative to belonging to the reference level; and

    6. the change in the effect of x1 (i.e. change in slope) due to belonging to the third level, relative to belonging to the reference level.

    If you want to fit and predict the model with/to pre-calculated data describing the interaction, you can create a dataframe that includes columns: x1; x2b (binary, indicating whether the observation belongs to level b); x2c (binary, indicating whether the observation belongs to level c); x1x2b (the product of x1 and x2b); and x1x2c (the product of x1 and x2c).

    A quick way to do this is with model.matrix:

    set.seed(1)
    n <- 1000
    d <- data.frame(x1=runif(n), x2=sample(letters[1:3], n, replace=TRUE))
    
    head(d)
    #          x1 x2
    # 1 0.2655087  b
    # 2 0.3721239  c
    # 3 0.5728534  b
    # 4 0.9082078  c
    # 5 0.2016819  a
    # 6 0.8983897  a
    
    X <- model.matrix(~x1*x2, d)
    
    head(X)
    #   (Intercept)        x1 x2b x2c    x1:x2b    x1:x2c
    # 1           1 0.2655087   1   0 0.2655087 0.0000000
    # 2           1 0.3721239   0   1 0.0000000 0.3721239
    # 3           1 0.5728534   1   0 0.5728534 0.0000000
    # 4           1 0.9082078   0   1 0.0000000 0.9082078
    # 5           1 0.2016819   0   0 0.0000000 0.0000000
    # 6           1 0.8983897   0   0 0.0000000 0.0000000
    
    b <- rnorm(6) # coefficients
    y <- X  %*% b + rnorm(n, sd=0.1)
    

    You can rename the columns of X to whatever you want, as long as you use consistent naming when predicting the model to new data later.

    Now fit the model. Here I tell lm not to calculate an intercept (with -1), since the variable (Intercept) already exists in X and will have a coefficient calculated for it. We could have also done this by fitting to data as.data.frame(X[, -1]):

    (M <- lm(y ~ . - 1, as.data.frame(X)))
    
    # Call:
    # lm(formula = y ~ . - 1, data = as.data.frame(X))
    # 
    # Coefficients:
    # `(Intercept)`          x1         x2b         x2c    `x1:x2b`    `x1:x2c`  
    #       1.14389     1.09168    -0.88879     0.20405     0.09085    -1.63769  
    

    Create some new data to predict to, and carry out the prediction:

    d.predict <- expand.grid(x1=seq(0, 1, 0.1), x2=letters[1:3])
    X.predict <- model.matrix(~x1*x2, d.predict)
    y.predict <- predict(M, as.data.frame(X.predict))