Factors and dummies in R regressions

Ok, I have two problems - maybe they are related - with dummies and factors. I will use an example that is pretty much similar to my database. I have 20 columns with several names, say, presidents of one country (e.g., "George W.", "Bill C.", etc). Also, I have 25 columns of strategies (e.g. "str_1", "str2", etc). They are all in the same database, say, "dat", together with other variables like y and x.

example

=============================
y  x  presidents  strategies
============================
20 2   Bill.C      3_A
10 1   George.W    2_B
10 1   Tom_C       3_C
3  2   Tom_C       2_D
4  4   John.C      3_A
4  3   Bill.C      2_A

I would like to regress y ~ x + dummies for presidents + dummies for strategies + interactions between presidents and strategies.

I already created dummies for each one of the 20 presidents and the 25 strategies, but I don't know how to create the interactions between each president and each strategy (that's the first part of my problem). Supposing that I could do this easily, is there any other way to specify my regression without having to write 20*25 interactions one by one (I know Stata has some command for this same problem)?

Maybe those are separate questions, but I am not sure.

Thanks in advance.

Solution

lm and glm automatically converts factor variables to their corresponding dummies (leaving one out as reference category). So it's sufficient to do the following:

mod1 = lm(y ~ x + presidents + strategies + presidents:strategies, data = df1)
mod2 = lm(y ~ x + presidents*strategies, data = df1)
mod3 = glm(y ~ x + presidents + strategies + presidents:strategies, data = df1)
mod4 = glm(y ~ x + presidents*strategies, data = df1)

summary(mod1)
summary(mod2)
summary(mod3)
summary(mod4)

Result:

> summary(mod1)

Call:
lm(formula = y ~ x + presidents + strategies + presidents:strategies, 
    data = df1)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.3690  -6.1273  -0.1699   6.4295  17.4156 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       14.4782     3.0799   4.701 5.15e-06 ***
x                                 -0.1692     0.2141  -0.790    0.431    
presidentsGeorge.W                11.1984     8.8283   1.268    0.206    
presidentsJohn.C                   4.1281     4.2305   0.976    0.330    
presidentsTom_C                    4.9604     3.6271   1.368    0.173    
strategies2_B                      1.6203     3.5736   0.453    0.651    
strategies2_D                     -1.7246     3.6550  -0.472    0.638    
strategies3_A                      1.7663     3.2966   0.536    0.593    
strategies3_C                     -0.5787     3.8440  -0.151    0.881    
presidentsGeorge.W:strategies2_B  -9.9934    10.0125  -0.998    0.320    
presidentsJohn.C:strategies2_B    -1.5192     5.8696  -0.259    0.796    
presidentsTom_C:strategies2_B     -0.8962     5.0202  -0.179    0.859    
presidentsGeorge.W:strategies2_D  -7.5266     9.7414  -0.773    0.441    
presidentsJohn.C:strategies2_D     1.7179     6.4375   0.267    0.790    
presidentsTom_C:strategies2_D     -1.1020     5.0551  -0.218    0.828    
presidentsGeorge.W:strategies3_A -11.9783     9.3115  -1.286    0.200    
presidentsJohn.C:strategies3_A    -2.8849     5.0866  -0.567    0.571    
presidentsTom_C:strategies3_A     -5.0305     4.4068  -1.142    0.255    
presidentsGeorge.W:strategies3_C  -6.5116     9.7387  -0.669    0.505    
presidentsJohn.C:strategies3_C    -4.3792     6.0389  -0.725    0.469    
presidentsTom_C:strategies3_C     -1.3257     5.3821  -0.246    0.806    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.364 on 179 degrees of freedom
Multiple R-squared:  0.064, Adjusted R-squared:  -0.04058 
F-statistic: 0.612 on 20 and 179 DF,  p-value: 0.9007

> summary(mod2)

Call:
lm(formula = y ~ x + presidents * strategies, data = df1)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.3690  -6.1273  -0.1699   6.4295  17.4156 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       14.4782     3.0799   4.701 5.15e-06 ***
x                                 -0.1692     0.2141  -0.790    0.431    
presidentsGeorge.W                11.1984     8.8283   1.268    0.206    
presidentsJohn.C                   4.1281     4.2305   0.976    0.330    
presidentsTom_C                    4.9604     3.6271   1.368    0.173    
strategies2_B                      1.6203     3.5736   0.453    0.651    
strategies2_D                     -1.7246     3.6550  -0.472    0.638    
strategies3_A                      1.7663     3.2966   0.536    0.593    
strategies3_C                     -0.5787     3.8440  -0.151    0.881    
presidentsGeorge.W:strategies2_B  -9.9934    10.0125  -0.998    0.320    
presidentsJohn.C:strategies2_B    -1.5192     5.8696  -0.259    0.796    
presidentsTom_C:strategies2_B     -0.8962     5.0202  -0.179    0.859    
presidentsGeorge.W:strategies2_D  -7.5266     9.7414  -0.773    0.441    
presidentsJohn.C:strategies2_D     1.7179     6.4375   0.267    0.790    
presidentsTom_C:strategies2_D     -1.1020     5.0551  -0.218    0.828    
presidentsGeorge.W:strategies3_A -11.9783     9.3115  -1.286    0.200    
presidentsJohn.C:strategies3_A    -2.8849     5.0866  -0.567    0.571    
presidentsTom_C:strategies3_A     -5.0305     4.4068  -1.142    0.255    
presidentsGeorge.W:strategies3_C  -6.5116     9.7387  -0.669    0.505    
presidentsJohn.C:strategies3_C    -4.3792     6.0389  -0.725    0.469    
presidentsTom_C:strategies3_C     -1.3257     5.3821  -0.246    0.806    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.364 on 179 degrees of freedom
Multiple R-squared:  0.064, Adjusted R-squared:  -0.04058 
F-statistic: 0.612 on 20 and 179 DF,  p-value: 0.9007

> summary(mod3)

Call:
glm(formula = y ~ x + presidents + strategies + presidents:strategies, 
    data = df1)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-17.3690   -6.1273   -0.1699    6.4295   17.4156  

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       14.4782     3.0799   4.701 5.15e-06 ***
x                                 -0.1692     0.2141  -0.790    0.431    
presidentsGeorge.W                11.1984     8.8283   1.268    0.206    
presidentsJohn.C                   4.1281     4.2305   0.976    0.330    
presidentsTom_C                    4.9604     3.6271   1.368    0.173    
strategies2_B                      1.6203     3.5736   0.453    0.651    
strategies2_D                     -1.7246     3.6550  -0.472    0.638    
strategies3_A                      1.7663     3.2966   0.536    0.593    
strategies3_C                     -0.5787     3.8440  -0.151    0.881    
presidentsGeorge.W:strategies2_B  -9.9934    10.0125  -0.998    0.320    
presidentsJohn.C:strategies2_B    -1.5192     5.8696  -0.259    0.796    
presidentsTom_C:strategies2_B     -0.8962     5.0202  -0.179    0.859    
presidentsGeorge.W:strategies2_D  -7.5266     9.7414  -0.773    0.441    
presidentsJohn.C:strategies2_D     1.7179     6.4375   0.267    0.790    
presidentsTom_C:strategies2_D     -1.1020     5.0551  -0.218    0.828    
presidentsGeorge.W:strategies3_A -11.9783     9.3115  -1.286    0.200    
presidentsJohn.C:strategies3_A    -2.8849     5.0866  -0.567    0.571    
presidentsTom_C:strategies3_A     -5.0305     4.4068  -1.142    0.255    
presidentsGeorge.W:strategies3_C  -6.5116     9.7387  -0.669    0.505    
presidentsJohn.C:strategies3_C    -4.3792     6.0389  -0.725    0.469    
presidentsTom_C:strategies3_C     -1.3257     5.3821  -0.246    0.806    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 69.96038)

    Null deviance: 13379  on 199  degrees of freedom
Residual deviance: 12523  on 179  degrees of freedom
AIC: 1439

Number of Fisher Scoring iterations: 2

> summary(mod4)

Call:
glm(formula = y ~ x + presidents * strategies, data = df1)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-17.3690   -6.1273   -0.1699    6.4295   17.4156  

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       14.4782     3.0799   4.701 5.15e-06 ***
x                                 -0.1692     0.2141  -0.790    0.431    
presidentsGeorge.W                11.1984     8.8283   1.268    0.206    
presidentsJohn.C                   4.1281     4.2305   0.976    0.330    
presidentsTom_C                    4.9604     3.6271   1.368    0.173    
strategies2_B                      1.6203     3.5736   0.453    0.651    
strategies2_D                     -1.7246     3.6550  -0.472    0.638    
strategies3_A                      1.7663     3.2966   0.536    0.593    
strategies3_C                     -0.5787     3.8440  -0.151    0.881    
presidentsGeorge.W:strategies2_B  -9.9934    10.0125  -0.998    0.320    
presidentsJohn.C:strategies2_B    -1.5192     5.8696  -0.259    0.796    
presidentsTom_C:strategies2_B     -0.8962     5.0202  -0.179    0.859    
presidentsGeorge.W:strategies2_D  -7.5266     9.7414  -0.773    0.441    
presidentsJohn.C:strategies2_D     1.7179     6.4375   0.267    0.790    
presidentsTom_C:strategies2_D     -1.1020     5.0551  -0.218    0.828    
presidentsGeorge.W:strategies3_A -11.9783     9.3115  -1.286    0.200    
presidentsJohn.C:strategies3_A    -2.8849     5.0866  -0.567    0.571    
presidentsTom_C:strategies3_A     -5.0305     4.4068  -1.142    0.255    
presidentsGeorge.W:strategies3_C  -6.5116     9.7387  -0.669    0.505    
presidentsJohn.C:strategies3_C    -4.3792     6.0389  -0.725    0.469    
presidentsTom_C:strategies3_C     -1.3257     5.3821  -0.246    0.806    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 69.96038)

    Null deviance: 13379  on 199  degrees of freedom
Residual deviance: 12523  on 179  degrees of freedom
AIC: 1439

Number of Fisher Scoring iterations: 2

As you can see, the estimates are exactly the same.

Data:

df = read.table(text = "y  x  presidents  strategies
                20 2   Bill.C      3_A
                10 1   George.W    2_B
                10 1   Tom_C       3_C
                3  2   Tom_C       2_D
                4  4   John.C      3_A
                4  3   Bill.C      2_A", header = TRUE)

set.seed(123)
df1 = data.frame(y = sample(1:30, 200, replace = TRUE),
                 x = sample(1:10, 200, replace = TRUE),
                 presidents = sample(df$presidents, 200, replace = TRUE),
                 strategies = sample(df$strategies, 200, replace = TRUE))