Search code examples
rregressioninteraction

Factors and dummies in R regressions


Ok, I have two problems - maybe they are related - with dummies and factors. I will use an example that is pretty much similar to my database. I have 20 columns with several names, say, presidents of one country (e.g., "George W.", "Bill C.", etc). Also, I have 25 columns of strategies (e.g. "str_1", "str2", etc). They are all in the same database, say, "dat", together with other variables like y and x.

example

=============================
y  x  presidents  strategies
============================
20 2   Bill.C      3_A
10 1   George.W    2_B
10 1   Tom_C       3_C
3  2   Tom_C       2_D
4  4   John.C      3_A
4  3   Bill.C      2_A

I would like to regress y ~ x + dummies for presidents + dummies for strategies + interactions between presidents and strategies.

I already created dummies for each one of the 20 presidents and the 25 strategies, but I don't know how to create the interactions between each president and each strategy (that's the first part of my problem). Supposing that I could do this easily, is there any other way to specify my regression without having to write 20*25 interactions one by one (I know Stata has some command for this same problem)?

Maybe those are separate questions, but I am not sure.

Thanks in advance.


Solution

  • lm and glm automatically converts factor variables to their corresponding dummies (leaving one out as reference category). So it's sufficient to do the following:

    mod1 = lm(y ~ x + presidents + strategies + presidents:strategies, data = df1)
    mod2 = lm(y ~ x + presidents*strategies, data = df1)
    mod3 = glm(y ~ x + presidents + strategies + presidents:strategies, data = df1)
    mod4 = glm(y ~ x + presidents*strategies, data = df1)
    
    summary(mod1)
    summary(mod2)
    summary(mod3)
    summary(mod4)
    

    Result:

    > summary(mod1)
    
    Call:
    lm(formula = y ~ x + presidents + strategies + presidents:strategies, 
        data = df1)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -17.3690  -6.1273  -0.1699   6.4295  17.4156 
    
    Coefficients:
                                     Estimate Std. Error t value Pr(>|t|)    
    (Intercept)                       14.4782     3.0799   4.701 5.15e-06 ***
    x                                 -0.1692     0.2141  -0.790    0.431    
    presidentsGeorge.W                11.1984     8.8283   1.268    0.206    
    presidentsJohn.C                   4.1281     4.2305   0.976    0.330    
    presidentsTom_C                    4.9604     3.6271   1.368    0.173    
    strategies2_B                      1.6203     3.5736   0.453    0.651    
    strategies2_D                     -1.7246     3.6550  -0.472    0.638    
    strategies3_A                      1.7663     3.2966   0.536    0.593    
    strategies3_C                     -0.5787     3.8440  -0.151    0.881    
    presidentsGeorge.W:strategies2_B  -9.9934    10.0125  -0.998    0.320    
    presidentsJohn.C:strategies2_B    -1.5192     5.8696  -0.259    0.796    
    presidentsTom_C:strategies2_B     -0.8962     5.0202  -0.179    0.859    
    presidentsGeorge.W:strategies2_D  -7.5266     9.7414  -0.773    0.441    
    presidentsJohn.C:strategies2_D     1.7179     6.4375   0.267    0.790    
    presidentsTom_C:strategies2_D     -1.1020     5.0551  -0.218    0.828    
    presidentsGeorge.W:strategies3_A -11.9783     9.3115  -1.286    0.200    
    presidentsJohn.C:strategies3_A    -2.8849     5.0866  -0.567    0.571    
    presidentsTom_C:strategies3_A     -5.0305     4.4068  -1.142    0.255    
    presidentsGeorge.W:strategies3_C  -6.5116     9.7387  -0.669    0.505    
    presidentsJohn.C:strategies3_C    -4.3792     6.0389  -0.725    0.469    
    presidentsTom_C:strategies3_C     -1.3257     5.3821  -0.246    0.806    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 8.364 on 179 degrees of freedom
    Multiple R-squared:  0.064, Adjusted R-squared:  -0.04058 
    F-statistic: 0.612 on 20 and 179 DF,  p-value: 0.9007
    
    > summary(mod2)
    
    Call:
    lm(formula = y ~ x + presidents * strategies, data = df1)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -17.3690  -6.1273  -0.1699   6.4295  17.4156 
    
    Coefficients:
                                     Estimate Std. Error t value Pr(>|t|)    
    (Intercept)                       14.4782     3.0799   4.701 5.15e-06 ***
    x                                 -0.1692     0.2141  -0.790    0.431    
    presidentsGeorge.W                11.1984     8.8283   1.268    0.206    
    presidentsJohn.C                   4.1281     4.2305   0.976    0.330    
    presidentsTom_C                    4.9604     3.6271   1.368    0.173    
    strategies2_B                      1.6203     3.5736   0.453    0.651    
    strategies2_D                     -1.7246     3.6550  -0.472    0.638    
    strategies3_A                      1.7663     3.2966   0.536    0.593    
    strategies3_C                     -0.5787     3.8440  -0.151    0.881    
    presidentsGeorge.W:strategies2_B  -9.9934    10.0125  -0.998    0.320    
    presidentsJohn.C:strategies2_B    -1.5192     5.8696  -0.259    0.796    
    presidentsTom_C:strategies2_B     -0.8962     5.0202  -0.179    0.859    
    presidentsGeorge.W:strategies2_D  -7.5266     9.7414  -0.773    0.441    
    presidentsJohn.C:strategies2_D     1.7179     6.4375   0.267    0.790    
    presidentsTom_C:strategies2_D     -1.1020     5.0551  -0.218    0.828    
    presidentsGeorge.W:strategies3_A -11.9783     9.3115  -1.286    0.200    
    presidentsJohn.C:strategies3_A    -2.8849     5.0866  -0.567    0.571    
    presidentsTom_C:strategies3_A     -5.0305     4.4068  -1.142    0.255    
    presidentsGeorge.W:strategies3_C  -6.5116     9.7387  -0.669    0.505    
    presidentsJohn.C:strategies3_C    -4.3792     6.0389  -0.725    0.469    
    presidentsTom_C:strategies3_C     -1.3257     5.3821  -0.246    0.806    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 8.364 on 179 degrees of freedom
    Multiple R-squared:  0.064, Adjusted R-squared:  -0.04058 
    F-statistic: 0.612 on 20 and 179 DF,  p-value: 0.9007
    
    > summary(mod3)
    
    Call:
    glm(formula = y ~ x + presidents + strategies + presidents:strategies, 
        data = df1)
    
    Deviance Residuals: 
         Min        1Q    Median        3Q       Max  
    -17.3690   -6.1273   -0.1699    6.4295   17.4156  
    
    Coefficients:
                                     Estimate Std. Error t value Pr(>|t|)    
    (Intercept)                       14.4782     3.0799   4.701 5.15e-06 ***
    x                                 -0.1692     0.2141  -0.790    0.431    
    presidentsGeorge.W                11.1984     8.8283   1.268    0.206    
    presidentsJohn.C                   4.1281     4.2305   0.976    0.330    
    presidentsTom_C                    4.9604     3.6271   1.368    0.173    
    strategies2_B                      1.6203     3.5736   0.453    0.651    
    strategies2_D                     -1.7246     3.6550  -0.472    0.638    
    strategies3_A                      1.7663     3.2966   0.536    0.593    
    strategies3_C                     -0.5787     3.8440  -0.151    0.881    
    presidentsGeorge.W:strategies2_B  -9.9934    10.0125  -0.998    0.320    
    presidentsJohn.C:strategies2_B    -1.5192     5.8696  -0.259    0.796    
    presidentsTom_C:strategies2_B     -0.8962     5.0202  -0.179    0.859    
    presidentsGeorge.W:strategies2_D  -7.5266     9.7414  -0.773    0.441    
    presidentsJohn.C:strategies2_D     1.7179     6.4375   0.267    0.790    
    presidentsTom_C:strategies2_D     -1.1020     5.0551  -0.218    0.828    
    presidentsGeorge.W:strategies3_A -11.9783     9.3115  -1.286    0.200    
    presidentsJohn.C:strategies3_A    -2.8849     5.0866  -0.567    0.571    
    presidentsTom_C:strategies3_A     -5.0305     4.4068  -1.142    0.255    
    presidentsGeorge.W:strategies3_C  -6.5116     9.7387  -0.669    0.505    
    presidentsJohn.C:strategies3_C    -4.3792     6.0389  -0.725    0.469    
    presidentsTom_C:strategies3_C     -1.3257     5.3821  -0.246    0.806    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for gaussian family taken to be 69.96038)
    
        Null deviance: 13379  on 199  degrees of freedom
    Residual deviance: 12523  on 179  degrees of freedom
    AIC: 1439
    
    Number of Fisher Scoring iterations: 2
    
    > summary(mod4)
    
    Call:
    glm(formula = y ~ x + presidents * strategies, data = df1)
    
    Deviance Residuals: 
         Min        1Q    Median        3Q       Max  
    -17.3690   -6.1273   -0.1699    6.4295   17.4156  
    
    Coefficients:
                                     Estimate Std. Error t value Pr(>|t|)    
    (Intercept)                       14.4782     3.0799   4.701 5.15e-06 ***
    x                                 -0.1692     0.2141  -0.790    0.431    
    presidentsGeorge.W                11.1984     8.8283   1.268    0.206    
    presidentsJohn.C                   4.1281     4.2305   0.976    0.330    
    presidentsTom_C                    4.9604     3.6271   1.368    0.173    
    strategies2_B                      1.6203     3.5736   0.453    0.651    
    strategies2_D                     -1.7246     3.6550  -0.472    0.638    
    strategies3_A                      1.7663     3.2966   0.536    0.593    
    strategies3_C                     -0.5787     3.8440  -0.151    0.881    
    presidentsGeorge.W:strategies2_B  -9.9934    10.0125  -0.998    0.320    
    presidentsJohn.C:strategies2_B    -1.5192     5.8696  -0.259    0.796    
    presidentsTom_C:strategies2_B     -0.8962     5.0202  -0.179    0.859    
    presidentsGeorge.W:strategies2_D  -7.5266     9.7414  -0.773    0.441    
    presidentsJohn.C:strategies2_D     1.7179     6.4375   0.267    0.790    
    presidentsTom_C:strategies2_D     -1.1020     5.0551  -0.218    0.828    
    presidentsGeorge.W:strategies3_A -11.9783     9.3115  -1.286    0.200    
    presidentsJohn.C:strategies3_A    -2.8849     5.0866  -0.567    0.571    
    presidentsTom_C:strategies3_A     -5.0305     4.4068  -1.142    0.255    
    presidentsGeorge.W:strategies3_C  -6.5116     9.7387  -0.669    0.505    
    presidentsJohn.C:strategies3_C    -4.3792     6.0389  -0.725    0.469    
    presidentsTom_C:strategies3_C     -1.3257     5.3821  -0.246    0.806    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for gaussian family taken to be 69.96038)
    
        Null deviance: 13379  on 199  degrees of freedom
    Residual deviance: 12523  on 179  degrees of freedom
    AIC: 1439
    
    Number of Fisher Scoring iterations: 2
    

    As you can see, the estimates are exactly the same.

    Data:

    df = read.table(text = "y  x  presidents  strategies
                    20 2   Bill.C      3_A
                    10 1   George.W    2_B
                    10 1   Tom_C       3_C
                    3  2   Tom_C       2_D
                    4  4   John.C      3_A
                    4  3   Bill.C      2_A", header = TRUE)
    
    set.seed(123)
    df1 = data.frame(y = sample(1:30, 200, replace = TRUE),
                     x = sample(1:10, 200, replace = TRUE),
                     presidents = sample(df$presidents, 200, replace = TRUE),
                     strategies = sample(df$strategies, 200, replace = TRUE))