Search code examples
rexceldata-fitting

Why does the result of a linear fit differ in R and Excel(Gnumeric Spreadsheet and WPS) with the same small data?


I have encountered a strange problem when I do a simple linear fit for my data in R and Excel-like Spreadsheet software e.g. Gnumeric Spreadsheet and WPS.

the data below is 19 pairs of x and y

93.37262737 56200
101.406044  62850
89.27322677 56425
86.9458042  43325
70.54645355 42775
85.1936032  38375
72.10985    38376
73.54055944 22950
78.092  15225
71.30285    12850
70.03953023 18125
66.31068931 14200
93.39847716 13925
66.09695152 13225
70.6549 18125
76.43348868 14125
71.37531234 14875
85.7953977  19275
95.65012506 45375

and is saved in a file called 'data.csv'

I do a linear fit between x and y. R script is below :

data<-read.csv("data.csv",col.names=c("x","y"))

# plot data
plot(data$x,data$y)
#Fit
lmodelx<-lm(data$y~data$x)
abline(lmodelx)

summary(lmodelx)

which gives this result:

Call:
lm(formula = data$y ~ data$x)

Residuals:
   Min     1Q Median     3Q    Max 
-27855  -7151  -1314   6947  23014 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -48212.8    23691.0  -2.035  0.05876 . 
data$x         963.5      296.8   3.246  0.00506 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13240 on 16 degrees of freedom
Multiple R-squared:  0.3971,    Adjusted R-squared:  0.3594 
F-statistic: 10.54 on 1 and 16 DF,  p-value: 0.005061

You can see that the Multiple R-squared: 0.3971, Adjusted R-squared: 0.3594, and the Coefficients for the Intercept is -48212.8 and the slope is 963.5342. So the formula is y = 963.5342 * x + -48212.8 and Adjusted R-squared (R2) is 0.3594.

But if I do fit the model in Gnumeric Spreadsheet the result is :

Gnumeric Spreadsheet Fiting Result

Which says the formula should be y = 1052.88214782179 *x + -54588.8186095881 and R2 =

If I fit the model with WPS Spredsheet, the result is :

WPS

That also says the formula should be y = 1052.9x - 54589 and R2 = 0.44763

In Excel the result is :

Excel

The formula is also y = 1052.x - 54589 and R2 = 0.447.

So the question comes that how a well-defined process like such simple fitting can be different in R than other software?

Because in my opinion this least square fitting is already defined, So why does R do this differently?

And what should I use?


Solution

  • From your summary data I would say you lost a row in R somewhere since you only have 16 degrees of freedom. Running your data I get the answer below which matches the other spreadsheet results.

    Call:
    lm(formula = data$y ~ data$x)
    
    Residuals:
       Min     1Q Median     3Q    Max 
    -29824  -6660  -1028   8520  23087 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
    (Intercept) -54588.8    22859.7  -2.388  0.02882 * 
    data$x        1052.9      283.7   3.712  0.00173 **
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 13250 on 17 degrees of freedom
    Multiple R-squared:  0.4476,    Adjusted R-squared:  0.4151 
    F-statistic: 13.78 on 1 and 17 DF,  p-value: 0.001733