Search code examples
rmatrixhypothesis-test

R: Adding constant in linear combination, glht()


So I am trying to replicate a stata function I saw in Priciples of Econometrics, by Hill, Griffiths and Lim. The function I want to replicate looks like this in stata;

lincom _cons + b_1 * [arbitrary value] - c

This is to the null hypothesis H0 : B0 + B1*X = C

I am able to test a hypothesis without the constant, but I would like to add the constant when testing the linear combination of parameters. I browsed the package documentation for glht() but it only had an example whereby they took out the constant. I replicated the example, keeping the constant, but I am unsure how to test for a linear combination when you have a matrix K, and a constant. For reference, here is their example;

### multiple linear model, swiss data
lmod <- lm(Fertility ~ ., data = swiss)

### test of H_0: all regression coefficients are zero 
### (ignore intercept)

### define coefficients of linear function directly
K <- diag(length(coef(lmod)))[-1,]
rownames(K) <- names(coef(lmod))[-1]
K

### set up general linear hypothesis
glht(lmod, linfct = K)

I am not really good at creating pretend data sets, but here is my try.

library(multcomp)
test.data = data.frame(test.y = seq(200,20000,1000),
                   test.x = seq(10,1000,10))
test.data$test.y = sort(test.data$test.y + rnorm(100, mean = 10000, sd = 100)) - 
  rnorm(100, mean = 5733, sd = 77)
test.lm = lm(test.y ~ test.x, data = test.data)

# to view the name of the coefficients
coef(test.lm)

# this produces an error. How can I add this intercept?
glht(test.lm, 
 linfct = c("(Intercept) + test.x = 20"))

It seems that there are two ways to go about this, according to the documentation. I can either use the function diag() to construct a matrix, which I can then use in the linfct = argument, or I can use a character string. The thing with this method, is that I do not quite know how to use the diag() method while also including the constant (right hand side of the equation); in the case of the character string method, I am not sure how to add the intercept.

Any and all help would be greatly appreciated.

Here is the data I am working with. This was originally in a .dta file, so I apologize for the horrible formatting. Per the book I mentioned above, this is the food.dta file.

structure(list(food_exp = structure(c(115.22, 135.98, 119.34, 
114.96, 187.05, 243.92, 267.43, 238.71, 295.94, 317.78, 216, 
240.35, 386.57, 261.53, 249.34, 309.87, 345.89, 165.54, 196.98, 
395.26, 406.34, 171.92, 303.23, 377.04, 194.35, 213.48, 293.87, 
259.61, 323.71, 275.02, 109.71, 359.19, 201.51, 460.36, 447.76, 
482.55, 438.29, 587.66, 257.95, 375.73), label = "household food expenditure per week", format.stata = "%10.0g"), 
income = structure(c(3.69, 4.39, 4.75, 6.03, 12.47, 12.98, 
14.2, 14.76, 15.32, 16.39, 17.35, 17.77, 17.93, 18.43, 18.55, 
18.8, 18.81, 19.04, 19.22, 19.93, 20.13, 20.33, 20.37, 20.43, 
21.45, 22.52, 22.55, 22.86, 24.2, 24.39, 24.42, 25.2, 25.5, 
26.61, 26.7, 27.14, 27.16, 28.62, 29.4, 33.4), label = "weekly  household income", format.stata = "%10.0g")), .Names = c("food_exp","income"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -40L))

Solution

  • Let's load the data from your book then check our results as we go along, to make sure we're getting the same thing. I'm presenting the answer to you this way in part because it helped me to understand exactly what you were wanting and in part to convince you of the equivalency here.

    Part of the confusion for me was in the syntax of your lincom example. Your syntax may have been correct, I have no idea, but based on how it looked I thought you were doing something different, thus referring to your book really helped.

    First, let's load the data and run the linear model that's on Page 115:

    install.packages("devtools")  # if not already installed
    library(devtools)
    install_git("https://github.com/ccolonescu/PoEdata")
    
    library(PoEdata)   # loads the package in memory
    library(multcomp)  # for hypo testing
    data(food)         # loads the data set of interest
    
    # EDA
    summary(food)
    
    # Model
    mod <- lm(food_exp ~ income, data = food)
    summary(mod) # Note: same results as PoE 4th ed. Pg 115 (other than rounding)
    
    Call:
    lm(formula = food_exp ~ income, data = food)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -223.025  -50.816   -6.324   67.879  212.044 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)   83.416     43.410   1.922   0.0622 .  
    income        10.210      2.093   4.877 1.95e-05 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 89.52 on 38 degrees of freedom
    Multiple R-squared:  0.385,   Adjusted R-squared:  0.3688 
    F-statistic: 23.79 on 1 and 38 DF,  p-value: 1.946e-05
    

    So far, so good. On Pg. 115 of the 4th edition it shows the same regression model other than some minor rounding differences.

    Next, the book calculates a point estimate of weekly food expenditure, conditional on household income of 20 (which is $2,000/wk):

    # Point estimate
    predict(mod, newdata = data.frame(income=20)) 
    
           1 
    287.6089
    

    Again, we get the exact same result. Incidentally, you can also see the same results in Stata in the nice free sample of the book Using Stata for Principles of Econometrics 4th ed. by Wiley.

    Finally, we're ready for the hypothesis testing. As mentioned, I want to make sure I can replicate exactly what Stata had. You kindly provided your code, but I was a little confused about your syntax.

    Fortunately, we've gotten lucky. While the preview of the 4th edition Stata guide only goes through Chapter 2, the Economics and Business faculty of a university in Holland were able to get parts of an old edition made available DRM free, so we can refer to that:

    lincom

    and finally see that we can replicate it in R like this:

    # Hyothesis Test 
    summary(glht(mod,  linfct = c("income = 15")))
    
       Simultaneous Tests for General Linear Hypotheses
    
    Fit: lm(formula = food_exp ~ income, data = food)
    
    Linear Hypotheses:
                 Estimate Std. Error t value Pr(>|t|)  
    income == 15   10.210      2.093  -2.288   0.0278 *
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    (Adjusted p values reported -- single-step method)
    

    Don't be fooled by the different output format. The estimate it's showing you in the R code is simply the coefficient ("b2") on income from the regression model. It doesn't change depending on the hypothesis test, whereas in the Stata output they do "b2 - 15" (which in R is mod$coefficients[2]-15).

    What does change are the t (t value) and p (Pr(>|t|)) values. Notice that the these test statistics from R match those from Stata.

    There was another example with a H0 of income = 7.5 Let's see that the t value is 1.29 and p value is .203 in both R and Stata:

    enter image description here

    summary(glht(mod,  linfct = c("income = 7.5")))
    
       Simultaneous Tests for General Linear Hypotheses
    
    Fit: lm(formula = food_exp ~ income, data = food)
    
    Linear Hypotheses:
                  Estimate Std. Error t value Pr(>|t|)
    income == 7.5   10.210      2.093   1.294    0.203
    (Adjusted p values reported -- single-step method)
    

    You can also get the confidence intervals with confint().

    Finally I think you were looking at section 3.6.4 (pg 117) of your book, where an executive wanted to test the hypothesis that, give income of 20 ($2000/wk) the food_exp is > 250:

    enter image description here

    We can calculate the t-value in R as:

    t = sum((mod$coefficients[1] + 20*mod$coefficients[2]-250)/sqrt(vcov(mod)[1] + 20^2 * vcov(mod)[4] + 2 * 20 *  vcov(mod)[2]))
    t
    [1] 2.652613
    

    Where the formula is the same as on the preceding 2 pages of the book.

    We can even make this into a custom function (works for simple linear regression, meaning 1 independent variable only):

    hypo_tester <- function(expenditure, income_per_week_hundreds, mod){
      t = sum((mod$coefficients[1] + 
                 income_per_week_hundreds*mod$coefficients[2]-expenditure)/sqrt(vcov(mod)[1] + 
                 income_per_week_hundreds^2 * vcov(mod)[4] + 2 * income_per_week_hundreds *  vcov(mod)[2]))
      return(t)
    }
    
    hypo_tester(250, 20, mod)
    [1] 2.652613
    hypo_tester(200, 20, mod)
    [1] 6.179193
    hypo_tester(300, 20, mod)
    [1] -0.8739668