i have something very unexpected happening with the R optimize
function. I have 2 very similar datasets, one on which optimize
works as expected, the other it doesn't.
Here's the function i'm trying to minimise and the 2 datasets:
my_beta_fc <- function(data, par){
# data is a dataframe of the ndays distribution with 2 columns:
# x = number of days ranging from 0 to x+
# f_x = reach (absolute)
a <- par
data <- data[,1:2]
names(data) <- c("x", "f_x")
n <- max(data$x)
p0 <- data %>%
mutate(reach_pc = f_x / sum(f_x)) %>%
filter(x == 0) %>%
pull(reach_pc)
xbar <- data %>%
summarise(mean = sum(f_x*x)/sum(f_x)) %>%
pull(mean)
b <- a*(n-xbar)/xbar
# we want to minimise the difference between observed and calculated p0:
result <- abs(p0 - (gamma(n+b)*gamma(a+b))/(gamma(a+n+b)*gamma(b)))
# if the evaluated function returns NaN, replace with a high penalty
# to steer the optimization away from regions where the function returns NaN:
if(is.nan(result)|is.na(result)){
return(1e10)
}
return(result)
}
t1 <- data.frame(
n_days = 0:7,
reach = c(40979971, 2110778, 1126387, 729457, 541512, 346607, 236263, 198262)
)
t2 <- data.frame(
n_days = 0:7,
reach = c(41233610, 2017354, 1063684, 694576, 518144, 330215, 223006, 188648)
)
then the result for t1 is:
param_solution <- optimize(
f = function(param) my_beta_fc(data = t1, param),
interval = c(0,10),
tol = 0.00001
)
gives:
> param_solution
$minimum
[1] 0.05495449
$objective
[1] 0.0000003038929
but running the same for t2:
param_solution <- optimize(
f = function(param) my_beta_fc(data = t2, param),
interval = c(0,10),
tol = 0.00001
)
gives:
> param_solution
$minimum
[1] 9.999995
$objective
[1] 10000000000
which is evidently not the solution as we can get a better value of the objective function using the solution found with t1...
Any idea what might be the issue here?
The value returned is your penalty for stepping outside the allowable range. The way you have the penalty defined makes the function look perfectly flat there, so optimize
has no idea which way to go, and eventually gives up.
Constrained optimization is hard. What I have found to work sometimes is the following:
When the suggested parameters fall outside the allowable range, find values that fall within the range, and return the value from them, plus a penalty depending on how far you had to change them. Then at least optimize
will have a hint which way to move to get things to work.
In your case this is hard, because the gamma()
function returns NaN
in a variety of different situations: whole number values of 0 or less. Maybe you would get reasonable results by forcing n+b
, a+b
, a+n+b
and b
all to be bigger than some small positive number (e.g. 0.00001) by increasing b
, and then adding a penalty for how much you needed to change it.
I just tried this, and it didn't work. The NaN is coming because the values are too big, not too small. You should work with lgamma()
instead of gamma
, i.e. change your function as shown below:
my_beta_fc <- function(data, par){
# data is a dataframe of the ndays distribution with 2 columns:
# x = number of days ranging from 0 to x+
# f_x = reach (absolute)
a <- par
data <- data[,1:2]
names(data) <- c("x", "f_x")
n <- max(data$x)
p0 <- data %>%
mutate(reach_pc = f_x / sum(f_x)) %>%
filter(x == 0) %>%
pull(reach_pc)
xbar <- data %>%
summarise(mean = sum(f_x*x)/sum(f_x)) %>%
pull(mean)
b <- a*(n-xbar)/xbar
bounds_violation <- min(n+b, a+b, a+n+b, b)
if (bounds_violation <= 0) {
bounds_violation <- abs(bounds_violation)
b <- b + bounds_violation + 0.000001
} else
bounds_violation <- 0
# we want to minimise the difference between observed and calculated p0:
result <- abs(p0 - exp(lgamma(n+b) + lgamma(a+b) - lgamma(a+n+b) - lgamma(b)))
# if the evaluated function returns NaN, stop!
if(is.nan(result)|is.na(result)){
stop("Still get NaN")
}
return(result + bounds_violation)
}