Search code examples
rconfidence-intervalnonlinear-optimizationstatistics-bootstrap

Bootstrap parameter estimate of non-linear optimization in R: Why is it different than the regular parameter estimate?


Here's the short version of my question. The code is below.

I calculated the parameters for the non-linear von Bertalanffy growth equation in R using optim(), and now I am trying to add 95% confidence intervals to the von B growth coefficient K by bootstrapping. For at least one of the years of data, when I summarize the bootstrapped output of the growth coefficient K, the mean and median parameter estimates from bootstrapping are quite different than the estimated parameter:

>summary(temp.store) # summary of bootstrap values
     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
 0.002449 0.005777 0.010290 0.011700 0.016970 0.056720

> est.K [1] 0.01655956  # point-estimate from the optimization

I suspect the discrepancy is because there are errors in the bootstrap of the random draw that bias the result, although I have used try() to stop the optimization from crashing when there is a combination of input values that cause an error. So I would like to know what to do to fix that issue. I think I'm doing things correctly, because the fitted curve looks right.

Also, I have run this code for data from other years, and in at least one other year, the bootrap estimate and the regular estimate are very close.

Long-winded version:

The von Bertalanffy growth curve (VBGC) for length is given by: L(t) = L.inf * [1 - exp(-K*(t-t0))] (Eq. 3.1.0.1, from FAO)

where L(t) is the fish's length, L.inf is the asymptotic maximum length, K is the growth coefficient, t is the time step and t0 is when growth began. L(t) and t are the observed data. Usually time or age is measured in years, but here I am looking at juvenile fish data and I have made t the day the of year ("doy") starting with January 1 = 1.

To estimate the starting parameters for the optimization, I have used a linearization of the VBGC equation.

doy <- c(156,205,228,276,319,380)
len <- c(36,56,60,68,68,71)
data06 <- data.frame(doy,len)

Function to get starting parameters for the optimization:

get.init <-function(dframe){ # linearization of the von B
  l.inf <- 80  # by eyeballing max juvenile fish
  # make a response variable and store it in the data frame:
  # Eqn. 3.3.3.1 in FAO document
  dframe$vonb.y <- - log(1 - (dframe$len)/l.inf  )
  lin.vonb <- lm(vonb.y ~ doy, data=dframe)
  icept <- lin.vonb$coef[1] #  0.01534013 # intercept is a
  slope <-  k.lin <- lin.vonb$coef[2] # slope is the K param
  t0 <- - icept/slope # get t0 from this relship: intercept= -K * t0
  pars <- c(l.inf,as.numeric(slope),as.numeric(t0))
}

Sums of squares for von Bertalanffy growth equation

vbl.ssq <- function(theta, data){
  linf=theta[1]; k=theta[2]; t0=theta[3]
  # name variables for ease of use
  obs.length=data$len
  age=data$doy
  #von B equation
  pred.length=linf*(1-exp(-k*(age-t0)))
  #sums of squares
  ssq=sum((obs.length-pred.length)^2)
}

Estimate parameters

#Get starting parameter values
theta_init <- get.init(dframe=data06)
# optimize VBGC by minimizing sums of square differences
len.fit <- optim(par=theta_init, fn=vbl.ssq, method="BFGS", data=data06)

est.linf <- len.fit$par[1] # vonB len-infinite
est.K <- len.fit$par[2]      # vonB K
est.t0 <- len.fit$par[3]     # vonB t0

Bootstrapping

# set up for bootstrap loop
tmp.frame <- data.frame()
temp.store <- vector()

# bootstrap to get 95% conf ints on growth coef K
for (j in 1:1000){
  # choose indices at random, with replacement
  indices <- sample(1:length(data06[,1]),replace=T)
  # values from original data corresponding to those indices
  new.len <- data06$len[indices]
  new.doy <- data06$doy[indices]
  tmp.frame <- data.frame(new.doy,new.len)
  colnames(tmp.frame) <- c("doy","len")

  init.par <- get.init(tmp.frame)
  # now get the vonB params for the randomly selected samples
  # using try() to keep optimizing errors from crashing the program
  try(  len.fit.bs <- optim(par=init.par, fn=vbl.ssq, method="BFGS", data=tmp.frame))
  tmp.k <- len.fit.bs$par[2]
  temp.store[j] <- tmp.k
}

95% confidence interval for K parameter

k.ci <- quantile(temp.store,c(0.025,0.975))
#       2.5%       97.5% 
#0.004437702 0.019784178 

Here's the problem:

#>summary(temp.store)
#     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
# 0.002449 0.005777 0.010290 0.011700 0.016970 0.056720
# 
# est.K [1] 0.01655956

Example of error:

Error in optim(par = init.par, fn = vbl.ssq, method = "BFGS", data = tmp.frame) : 
  non-finite finite-difference value [2]

I don't believe I am making any errors with the optimization because the VBGC fit looks reasonable. Here are the plots:

plot(x=data06$doy,y=data06$len,xlim=c(0,550),ylim=c(0,100))
legend(x="topleft",legend=paste("Length curve 2006"), bty="n")
curve(est.linf*(1-exp(-est.K*(x-est.t0))), add=T,type="l")

plot(x=2006,y=est.K, main="von B growth coefficient for length; 95% CIs",
    ylim=c(0,0.025))
arrows(x0=2006,y0=k.ci[1],x1=2006,y1=k.ci[2], code=3,
    angle=90,length=0.1)

enter image description here K parameter estimate with the bootstrapped confidence intervals


Solution

  • First of all, you have a very small number of values, possibly too few to trust the bootstrap method. Then a high proportion of fits fails for the classic bootstrap, because due to the resampling you often have not enough distinct x values.

    Here is an implementation using nls with a selfstarting model and the boot package.

    doy <- c(156,205,228,276,319,380)
    len <- c(36,56,60,68,68,71)
    data06 <- data.frame(doy,len)
    
    plot(len ~ doy, data = data06)
    
    fit <- nls(len ~ SSasympOff(doy, Asym, lrc, c0), data = data06)
    summary(fit)
    #profiling CI
    proCI <- confint(fit)
    #          2.5%      97.5%
    #Asym 68.290477  75.922174
    #lrc  -4.453895  -3.779994
    #c0   94.777335 126.112523
    
    curve(predict(fit, newdata = data.frame(doy = x)), add = TRUE)
    

    plot of data and predictions

    #classic bootstrap
    library(boot)
    set.seed(42)
    boot1 <- boot(data06, function(DF, i) {
      tryCatch(coef(nls(len ~ SSasympOff(doy, Asym, lrc, c0), data = DF[i,])),
               error = function(e) c(Asym = NA, lrc = NA, c0 = NA))
    }, R = 1e3)
    
    #proportion of unsuccessful fits
    mean(is.na(boot1$t[, 1]))
    #[1] 0.256
    
    #bootstrap CI
    boot1CI <- apply(boot1$t, 2, quantile, probs = c(0.025, 0.5, 0.975), na.rm = TRUE)
    #          [,1]      [,2]      [,3]
    #2.5%  69.70360 -4.562608  67.60152
    #50%   71.56527 -4.100148 113.9287
    #97.5% 74.79921 -3.697461 151.03541
    
    
    #bootstrap of the residuals
    data06$res <- residuals(fit)
    data06$fit <- fitted(fit)
    
    set.seed(42)
    boot2 <- boot(data06, function(DF, i) {
      DF$lenboot <- DF$fit + DF[i, "res"]
      tryCatch(coef(nls(lenboot ~ SSasympOff(doy, Asym, lrc, c0), data = DF)),
               error = function(e) c(Asym = NA, lrc = NA, c0 = NA))
    }, R = 1e3)
    
    #proportion of unsuccessful fits
    mean(is.na(boot2$t[, 1]))
    #[1] 0
    
    #(residuals) bootstrap CI
    boot2CI <- apply(boot2$t, 2, quantile, probs = c(0.025, 0.5, 0.975), na.rm = TRUE)
    #          [,1]      [,2]     [,3]
    #2.5%  70.19380 -4.255165 106.3136
    #50%   71.56527 -4.100148 113.9287
    #97.5% 73.37461 -3.969012 119.2380
    proCI[2,1]
    
    CIs_k <- data.frame(lwr = c(exp(proCI[2, 1]),
                                exp(boot1CI[1, 2]),
                                exp(boot2CI[1, 2])),
                        upr = c(exp(proCI[2, 2]),
                                exp(boot1CI[3, 2]),
                                exp(boot2CI[3, 2])),
                        med = c(NA,
                                exp(boot1CI[2, 2]),
                                exp(boot2CI[2, 2])),
                        estimate = exp(coef(fit)[2]),
                        method = c("profile", "boot", "boot res"))
    
    library(ggplot2)
    ggplot(CIs_k, aes(y = estimate, ymin = lwr, ymax = upr, x = method)) +
      geom_errorbar() +
      geom_point(aes(color = "estimate"), size = 5) +
      geom_point(aes(y = med, color = "boot median"), size = 5) +
      ylab("k") + xlab("") +
      scale_color_brewer(name = "", type = "qual", palette = 2) +
      theme_bw(base_size = 22)
    

    resulting plot

    As you see, the bootstrap CI is wider than the profile CI and bootstrapping the residuals results in a more narrow estimated CI. All of them are almost symmetric. Furthermore, the medians are close to the point estimates.

    As a first step of investigating what goes wrong in your code, you should look at the proportion of failed fits from your procedure.