Search code examples
rlm

Regression with dummy variable multiplied by another


What is the difference between approach 1 and approach 2 below? I was thinking that 'I()' will allow us to multiple 2 variables and to not included interaction, but here it is not working as expected. Do I undertsand correct that the 2nd approach takes into account also three 0 (non-USA)? So the model is build on 6 points instead of 3 - can we somehow fix it?

df <- data.frame(
                 Salary=c(5, 1:2,4,1:2),
                 Variable1=c(500,490,501,460,490,505),
                 Variable2=c(5,10,0,3,17,40),
                 Country=c(rep("USA",3),rep("RPA",3)),
                 Dummy_USA=c(rep(1,3), rep(0,3))
)

# Approach 1
summary(lm(Salary~Variable1, df%>% filter(Country=="USA")))

# Approach 2
summary(lm(Salary~I(Variable1*Dummy_USA), df))

Solution

  • Yes, the second version simply regresses the vector c(5, 1, 2, 4, 1, 2) on the vector c(500, 490, 501, 0, 0, 0). This is very different from the first version, which regresses the vector c(5, 1, 2) in the vector c(500, 490, 501).

    If you want to use a dummy variable you could either pass it to the subset argument of lm or the weights argument.

    with(df, summary(lm(Salary ~ Variable1, subset = Dummy_USA == 1)))
    #> 
    #> Call:
    #> lm(formula = Salary ~ Variable1, subset = Dummy_USA == 1)
    #> 
    #> Residuals:
    #>       1       2       3 
    #>  1.6847 -0.1532 -1.5315 
    #> 
    #> Coefficients:
    #>              Estimate Std. Error t value Pr(>|t|)
    #> (Intercept) -104.7928   131.8453  -0.795    0.572
    #> Variable1      0.2162     0.2653   0.815    0.565
    #> 
    #> Residual standard error: 2.282 on 1 degrees of freedom
    #> Multiple R-squared:  0.3992, Adjusted R-squared:  -0.2017 
    #> F-statistic: 0.6644 on 1 and 1 DF,  p-value: 0.5646
    

    or

    with(df, summary(lm(Salary ~ Variable1, weights = Dummy_USA)))
    #> 
    #> Call:
    #> lm(formula = Salary ~ Variable1, weights = Dummy_USA)
    #> 
    #> Weighted Residuals:
    #>       1       2       3       4       5       6 
    #>  1.6847 -0.1532 -1.5315  0.0000  0.0000  0.0000 
    #> 
    #> Coefficients:
    #>              Estimate Std. Error t value Pr(>|t|)
    #> (Intercept) -104.7928   131.8453  -0.795    0.572
    #> Variable1      0.2162     0.2653   0.815    0.565
    #> 
    #> Residual standard error: 2.282 on 1 degrees of freedom
    #> Multiple R-squared:  0.3992, Adjusted R-squared:  -0.2017 
    #> F-statistic: 0.6644 on 1 and 1 DF,  p-value: 0.5646
    

    Created on 2023-03-20 with reprex v2.0.2