Search code examples
rlinear-regression

How R calculates the Regression coefficients using lm() function


I wanted to replicate R's calculation on estimation of regression equation on below data:

set.seed(1)
Vec = rnorm(1000, 100, 3)
DF = data.frame(X1 = Vec[-1], X2 = Vec[-length(Vec)])

Below R reports estimates of coefficients

coef(lm(X1~X2, DF))  ### slope =  -0.03871511 

Then I manually estimate the regression estimate for slope

(sum(DF[,1]*DF[,2])*nrow(DF) - sum(DF[,1])*sum(DF[,2])) / (nrow(DF) * sum(DF[,1]^2) - (sum(DF[,1])^2)) ### -0.03871178

They are close but still are nor matching exactly.

Can you please help me to understand what am I missing here?

Any pointer will be very helpful.


Solution

  • The problem is that X1 and X2 are switched in lm relative to the long formula.

    Background

    The formula for slope in lm(y ~ x) is the following where x and y each have length n and x is short for x[i] and y is short for y[i] and the summations are over i = 1, 2, ..., n.

    slope of simple linear regression

    Source of the problem

    Thus the long formula in the question, also shown in (1) below, corresponds to lm(X2 ~ X1, DF) and not to lm(X1 ~ X2, DF). Either change the formula in the lm model as in (1) below or else change the long formula in the answer by replacing each occurrence of DF[, 1] in the denominator with DF[, 2] as in (2) below.

    # (1)
    
    coef(lm(X2 ~ X1, DF))[[2]]
    ## [1] -0.03871178
    
    (sum(DF[,1]*DF[,2])*nrow(DF) - sum(DF[,1])*sum(DF[,2])) / 
      (nrow(DF) * sum(DF[,1]^2) - (sum(DF[,1])^2))  # as in question
    ## [1] -0.03871178
    
    # (2)
    
    coef(lm(X1 ~ X2, DF))[[2]]  # as in question
    ## [1] -0.03871511
    
    (sum(DF[,1]*DF[,2])*nrow(DF) - sum(DF[,1])*sum(DF[,2])) / 
      (nrow(DF) * sum(DF[,2]^2) - (sum(DF[,2])^2))
    ## [1] -0.03871511