Search code examples
rplotcurvelm

Linear regression on a log-log plot - plot lm() coefficients manually as abline() is producing a bad fit


I apologise as I have asked a question along the same lines before but the answer was working well until now. I have produced six plots that looked good using this method, but now I've gotten two weird ones. You can see this "lack of fit" using this example:

x=c(9222,187720,42162,7005,3121,7534,21957,272901,109667,1394312,12230,69607471,79183,6389,64859,32479,3535,9414098,2464,67917,59178,2278,33064,357535,11876,21036,11018,12499632,5160,84574)
y=c(0,4,1,0,1,0,0,1,5,13,0,322,0,0,1,1,1,32,0,0,0,0,0,0,0,0,0,33,1,1)
lin=lm(y~x)
plot(x, y, log="xy")
abline(lin, col="blue", untf=TRUE)

This is a plot I have produced using real data (log-log on the left, normal on the right):

freaky slope

I wasn't too concerned about the missing 0 values as I assumed lin would still take these into account, however as you can see on the log plot the line does not start even near (1,1). From how it looks now I would expect to see points at around (1000,10).

Anyone know what's going on? Will manually plotting the coefficients of lin help? If so, can anyone explain to me how I would do this?


Solution

  • First let's look at the leverage plot of your linear model:

    plot(lin,which=5)
    

    leverage plot of linear model

    As you see points 12 (y=322) and 28 (y=33) are the most influential. Furthermore the scatter around the fitted line becomes larger with increasing x values. Thus, it seems appropriate to do weighted regression:

    lin2 <- lm(y~x,weights=1/x)
    summary(lin2)
    
    Call:
    lm(formula = y ~ x, weights = 1/x)
    
    Weighted Residuals:
          Min        1Q    Median        3Q       Max 
    -0.006699 -0.003383 -0.002407  0.002521  0.012733 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 3.099e-01  1.092e-01   2.838  0.00835 ** 
    x           4.317e-06  5.850e-07   7.381 4.89e-08 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    Residual standard error: 0.005674 on 28 degrees of freedom
    Multiple R-squared: 0.6605, Adjusted R-squared: 0.6484 
    F-statistic: 54.47 on 1 and 28 DF,  p-value: 4.888e-08 
    
    
    plot(lin2,which=5)
    

    leverage plot of weighted linear model

    This is already better.

    plot(x, y, log="xy",ylim=c(0.1,350))
    abline(lin, col="blue", untf=TRUE)
    abline(lin2, col="green", untf=TRUE)
    

    results (keep in mind, that 0 values are not plotted here)

    Depending on what your data actually describe, you might consider using a generalized linear model.