Search code examples
rregressionglmlmnls

Using lm(), nls() (and glm()?) to estimate population growth rate in Malthusian growth model


My question is related to estimating the population growth rate in Malthusian growth model. As a toy example, consider a toy dataset df:

structure(list(x= c(0L, 24L, 48L, 72L, 96L, 120L, 144L, 168L
), y = c(10000, 18744.0760659189, 35134.0387564953, 65855.509495469, 
123440.067934292, 231377.002294256, 433694.813090781, 812920.856596808
)), .Names = c("x", "y"), row.names = c(NA, -8L), class = "data.frame")

I am trying to fit this dataset by exponential model:

y = 10000 * (e^(r * x))

and estimate r. When using nonlinear regression nls():

fit <- nls(y ~ (10000 * exp(r*x)), data=df)

I get the following error:

Error in getInitial.default(func, data, mCall = as.list(match.call(func,  : 
  no 'getInitial' method found for "function" objects

I also tried lm()

fit <- lm(log(y) ~ (10000 * exp(r*x)), data=df) 

but get

Error in terms.formula(formula, data = data) : 
  invalid model formula in ExtractVars

How can I solve this? How can I fit the data to the exponential model I have?

Also, are there other approaches I could consider for fitting population growth model? Is glm() reasonable?


Solution

  • Using lm()

    Please read ?formula for correct specification of a formula. Now I will proceed assuming you have read that.

    First, your model, after taking log transform on both LHS and RHS, becomes:

    log(y) = log(10000) + r * x
    

    The constant is a known value, not to be estimated. Such constant is called offset in lm.

    You should use lm as this:

    # "-1" in the formula will drop intercept
    fit <- lm(log(y) ~ x - 1, data = df, offset = rep(log(10000), nrow(df)))
    
    # Call:
    #  lm(formula = log(y) ~ x - 1, data = df, offset = rep(log(10000), nrow(df)))
    
    #  Coefficients:
    #        x  
    #  0.02618  
    

    As you've spotted, fit is a list of length 13. See the "Value" section of ?lm and you will get better idea of what they are. Among those, the fitted values are $fitted, so you can draw your plot by:

    plot(df)
    lines(df$x, exp(fit$fitted), col = 2, lwd = 2)  ## red line
    

    fit

    Pay attention to my using exp(fit$fitted), because we fit a model for log(y) and now we are going back to original scale.

    Remark

    As @BenBolker said, a simpler specification is:

    fit <- lm(log(y/10000) ~ x - 1, data = df)
    

    or

    fit <- lm(log(y) - log(10000) ~ x - 1, data = df)
    

    But the response variable is not log(y) but log(y/10000) now, so when you make plot, you need:

    lines(df$x, 10000 * exp(fit$fitted), col = 2, lwd = 2)
    

    Using nls()

    Correct way for using nls() is as this:

    nls(y ~ 10000 * exp(r * x), data = df, start = list(r = 0.1))
    

    Because non-linear curve fitting requires iterations, a starting value is needed, and must be provided via argument start.

    Now, if you try this code, you will get:

    Error in nls(y ~ 10000 * exp(r * x), data = df, start = list(r = 0.1)) : 
      number of iterations exceeded maximum of 50
    

    The problem is because your data are exact, without noise. Have a read on ?nls:

    Warning:
    
         *Do not use ‘nls’ on artificial "zero-residual" data.*
    

    So, using nls() for your toy data set df does not work.

    Let's go back to check the fitted model from lm():

    fit$residuals
    #            1             2             3             4             5 
    #-2.793991e-16 -1.145239e-16 -2.005405e-15 -5.498411e-16  3.094618e-15 
    #            6             7             8 
    # 1.410007e-15 -1.099682e-15 -1.007937e-15
    

    Residuals are basically 0 everywhere, and lm() does an exact fit in this case.


    Follow-up

    One last thing that I haven't been able to figure out is why the parameter r is not used in lm's formula specification.

    There are actually some difference in the formula between lm and nls. Perhaps you can take it as such:

    • lm()'s formula is called model formula, which you can read from ?formula. It is so fundamental in R. Model fitting routines use it, like lm, glm, while many functions have formula method, like model.matrix, aggregate, boxplot, etc.
    • nls()'s formula is more like a function specification, and really not widely used. Many other functions doing non-linear iterations like optim will not accept a formula but takes a function directly. So, just treat nls() as a special case.

    So would it make sense to do it using the linear model? Simply what I am trying to model here is using Malthusian growth model.

    Strictly speaking, giving real population data (certainly with noise), using nls() for curve fitting, or using glm(, family = poisson) for a Poisson response GLM has better ground than fitting a linear model. The glm() call to your data would be:

    glm(y ~ x - 1, family = poisson(), data = df, offset = rep(log(10000), nrow(df)))
    

    (You possibly need to learn what a GLM is first.) But since your data have no noise, you will get warning message when using it.

    However, in terms of computational complexity, using a linear model by first taking log transform is a clear win. In statistical modelling, variable transform are very common, so there is no compelling reason to reject the use of linear model for estimation of population growth rate.

    As a complete picture, I recommend you try all three approaches for real data (or noisy toy data). There will be some difference in estimation and prediction, but unlikely to be very great.

    "Follow-follow-up"

    Haha, thanks to @Ben again. For glm(), we can also try:

    glm(y ~ x - 1 + offset(log(10000)), family = gaussian(link="log"))
    

    For offset specification, we can either use offset argument in lm/glm, or the offset() function as Ben does.