Optimize a function using gradient descent

Growing degree days is a concept in plant phenology where a given crop needs to accumulate certain amount of thermal units every day in order to move from one stage to the other.

I have thermal units data available at daily resolution for a given site for 10 years as follows:

  set.seed(1)
  avg_temp <- data.frame(year_ref = rep(2001:2010, each = 365),
                         doy = rep(1:365, times = 10),
                         thermal.units = sample(0:40, 3650, replace=TRUE))

I also have a crop grown in this site that should take 110 days to mature if planted on day 152

  planting_date <- 152 
  observed_days_to_mature <- 110

I also have some initial random guess on how many thermal units this crop in general might accumulate in each stage starting from planting to reach full maturity. For e.g. in the below example, stage 1 needs to accumulate 50 thermal units since planting, stage2 needs 120 thermal units since planting, stage 3 needs 190 thermal units since planting and so on.

  gdd_data <- data.frame(stage_id = 1:4,
                         gdd_required = c(50, 120, 190, 250))

So given the gdd requirement, I can calculate for each year, how many days does this crop take to mature.

  library(dplyr)
  library(data.table)

  days_to_mature_func <- function(gdd_data_df, avg_temp_df, planting_date_d){
    
    gdd.vec <- gdd_data_df$gdd_required 
    
    year_vec <- sort(unique(avg_temp_df$year_ref))
    
    temp_ls <- list()
    
    for(y in seq_along(year_vec)){
      
      year_id <- year_vec[y]
      
      weather_sub <- avg_temp_df %>% 
                     dplyr::filter(year_ref == year_id & 
                                   doy >= planting_date_d)  
      
      stage_vec <- unlist(lapply(1:length(gdd.vec), function(x)  planting_date_d - 1 + which.max(cumsum(weather_sub$thermal.units) >= gdd.vec[x])))
      
      stage_vec[length(stage_vec)] <- ifelse(stage_vec[length(stage_vec)] <= stage_vec[length(stage_vec) - 1], NA, stage_vec[length(stage_vec)])
      
      gdd_doy <- as.data.frame(t(as.data.frame(stage_vec)))
      
      names(gdd_doy) <- paste0('stage_doy', 1:length(stage_vec))
      gdd_doy$year_ref <- year_id
      temp_ls[[y]] <- gdd_doy
  }
    days_to_mature_mod <- rbindlist(temp_ls)
    return(days_to_mature_mod)
  }

  days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date)
  days_to_mature_mod

  stage_doy1 stage_doy2 stage_doy3 stage_doy4 year_ref
  1:        154        160        164        167     2001
  2:        154        157        159        163     2002
  3:        154        157        160        162     2003
  4:        155        157        163        165     2004
  5:        154        156        160        164     2005
  6:        154        161        164        168     2006
  7:        154        156        159        161     2007
  8:        155        158        161        164     2008
  9:        154        156        160        163     2009
  10:       154        158        160        163     2010

Since the crop should be taking 110 days to mature, I define the error as:

  error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2

My question is how do I optimise the gdd_required in the gdd_data to produce the minimal error.

One method I have implemented is to loop over a sequence of factors that reduces the gdd_required in each step and calculates the error. the factor with the lowest error is the final factor that I apply to the gdd_required data. I am reading about the gradient descent algorithm that might make this processquicker but unfortunately I don't have enough techincal expertise yet to achieve this.

From comment: I do have a condition that wasn't explicit - the x in the function that I am optimising are ordered i.e. x[1] < x[2] < x[3] < x[4] since they are cumulative.

Solution

Building on your example, you can define a function that takes arbitrary gdd_required and returns the fit:

optim_function <- function(x){
  gdd_data <- data.frame(stage_id = 1:4, gdd_required = x)
  days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date)
  error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2
}

The function optim allows you to find the parameters that reach a minimum, starting from the initial set you used e.g.

optim(c(50, 120, 190, 250), optim_function)
#$par
#[1] 266.35738 199.59795 -28.35870  30.21135
# 
#$value
#[1] 1866.24
# 
#$counts
#function gradient 
#      91       NA 
#
#$convergence
#[1] 0
#
#$message
#NULL

So a best fit of around 1866 is found with parameters 266.35738, 199.59795, -28.35870, 30.21135.

The help page gives some pointers on doing constrained optimisation if it is important that they are in a specific range.

Given your comment that the parameters should be strictly increasing, you can transform arbitrary values into increasing ones with cumsum(exp()) so your code would become

optim_function_plus <- function(x){
  gdd_data <- data.frame(stage_id = 1:4, gdd_required = cumsum(exp(x)))
  days_to_mature_mod <- days_to_mature_func(gdd_data, avg_temp, planting_date)
  error_mod <- mean(days_to_mature_mod$stage_doy4 - observed_days_to_mature)^2
}

opt <- optim(log(c(50, 70, 70, 60)), optim_function_plus)
opt
# $par
# [1] 1.578174 2.057647 2.392850 3.241456
# 
# $value
# [1] 1953.64
# 
# $counts
# function gradient 
# 57       NA 
# 
# $convergence
# [1] 0
# 
# $message
# NULL

To get the parameters back on the scale you're interested, you'd need to do:

cumsum(exp(opt$par))
# [1]  4.846097 12.673626 23.618263 49.189184