My question is related to estimating the population growth rate in Malthusian growth model. As a toy example, consider a toy dataset df
:
structure(list(x= c(0L, 24L, 48L, 72L, 96L, 120L, 144L, 168L
), y = c(10000, 18744.0760659189, 35134.0387564953, 65855.509495469,
123440.067934292, 231377.002294256, 433694.813090781, 812920.856596808
)), .Names = c("x", "y"), row.names = c(NA, -8L), class = "data.frame")
I am trying to fit this dataset by exponential model:
y = 10000 * (e^(r * x))
and estimate r
. When using nonlinear regression nls()
:
fit <- nls(y ~ (10000 * exp(r*x)), data=df)
I get the following error:
Error in getInitial.default(func, data, mCall = as.list(match.call(func, :
no 'getInitial' method found for "function" objects
I also tried lm()
fit <- lm(log(y) ~ (10000 * exp(r*x)), data=df)
but get
Error in terms.formula(formula, data = data) :
invalid model formula in ExtractVars
How can I solve this? How can I fit the data to the exponential model I have?
Also, are there other approaches I could consider for fitting population growth model? Is glm()
reasonable?
Using lm()
Please read ?formula
for correct specification of a formula. Now I will proceed assuming you have read that.
First, your model, after taking log
transform on both LHS and RHS, becomes:
log(y) = log(10000) + r * x
The constant is a known value, not to be estimated. Such constant is called offset
in lm
.
You should use lm
as this:
# "-1" in the formula will drop intercept
fit <- lm(log(y) ~ x - 1, data = df, offset = rep(log(10000), nrow(df)))
# Call:
# lm(formula = log(y) ~ x - 1, data = df, offset = rep(log(10000), nrow(df)))
# Coefficients:
# x
# 0.02618
As you've spotted, fit
is a list of length 13. See the "Value" section of ?lm
and you will get better idea of what they are. Among those, the fitted values are $fitted
, so you can draw your plot by:
plot(df)
lines(df$x, exp(fit$fitted), col = 2, lwd = 2) ## red line
Pay attention to my using exp(fit$fitted)
, because we fit a model for log(y)
and now we are going back to original scale.
Remark
As @BenBolker said, a simpler specification is:
fit <- lm(log(y/10000) ~ x - 1, data = df)
or
fit <- lm(log(y) - log(10000) ~ x - 1, data = df)
But the response variable is not log(y)
but log(y/10000)
now, so when you make plot, you need:
lines(df$x, 10000 * exp(fit$fitted), col = 2, lwd = 2)
Using nls()
Correct way for using nls()
is as this:
nls(y ~ 10000 * exp(r * x), data = df, start = list(r = 0.1))
Because non-linear curve fitting requires iterations, a starting value is needed, and must be provided via argument start
.
Now, if you try this code, you will get:
Error in nls(y ~ 10000 * exp(r * x), data = df, start = list(r = 0.1)) :
number of iterations exceeded maximum of 50
The problem is because your data are exact, without noise. Have a read on ?nls
:
Warning:
*Do not use ‘nls’ on artificial "zero-residual" data.*
So, using nls()
for your toy data set df
does not work.
Let's go back to check the fitted model from lm()
:
fit$residuals
# 1 2 3 4 5
#-2.793991e-16 -1.145239e-16 -2.005405e-15 -5.498411e-16 3.094618e-15
# 6 7 8
# 1.410007e-15 -1.099682e-15 -1.007937e-15
Residuals are basically 0 everywhere, and lm()
does an exact fit in this case.
Follow-up
One last thing that I haven't been able to figure out is why the parameter
r
is not used inlm
's formula specification.
There are actually some difference in the formula between lm
and nls
. Perhaps you can take it as such:
lm()
's formula is called model formula, which you can read from ?formula
. It is so fundamental in R. Model fitting routines use it, like lm
, glm
, while many functions have formula method, like model.matrix
, aggregate
, boxplot
, etc.nls()
's formula is more like a function specification, and really not widely used. Many other functions doing non-linear iterations like optim
will not accept a formula but takes a function directly. So, just treat nls()
as a special case.So would it make sense to do it using the linear model? Simply what I am trying to model here is using Malthusian growth model.
Strictly speaking, giving real population data (certainly with noise), using nls()
for curve fitting, or using glm(, family = poisson)
for a Poisson response GLM has better ground than fitting a linear model. The glm()
call to your data would be:
glm(y ~ x - 1, family = poisson(), data = df, offset = rep(log(10000), nrow(df)))
(You possibly need to learn what a GLM is first.) But since your data have no noise, you will get warning message when using it.
However, in terms of computational complexity, using a linear model by first taking log
transform is a clear win. In statistical modelling, variable transform are very common, so there is no compelling reason to reject the use of linear model for estimation of population growth rate.
As a complete picture, I recommend you try all three approaches for real data (or noisy toy data). There will be some difference in estimation and prediction, but unlikely to be very great.
"Follow-follow-up"
Haha, thanks to @Ben again. For glm()
, we can also try:
glm(y ~ x - 1 + offset(log(10000)), family = gaussian(link="log"))
For offset
specification, we can either use offset
argument in lm
/glm
, or the offset()
function as Ben does.