r mathematical-optimization nonlinear-optimization

High dimensional optimization alternatives for optim()

I have implemented a wrapper function for optim using a non-standard (piecewise linear) cost function and optimizing with regard to this cost function. (Think lm but with the cost function least squares replaced with a custom function.)

This process works very well for low-dimensional mock data. However, my data (which unfortunately I cannot share) is relatively high-dimensional (~100 columns, no constant columns).

Using this high-dimensional data, the optimization parameters differ from their initial values by ~0.001 (more or less depending on the method used), even using the argument control = list(maxit = 5000) and trying out all implemented optimization methods:

optim(par = c(mean(b), rep(0, ncol(A) - 1)), fn = costFn, A = A, b = b, method = "CG", control = list(maxit = 5000)) Here A is the matrix of predictors, b is the dependent variable and the costFn is defined as

costFn <- function(A, par, b) {
    sum(
      wPlus * pmax(A %*% par - b, 0)
      + wMinus * pmax(b - A %*% par, 0)
    )
  }

with constants wPlus, wMinus > 0.

I feel like this process terminates way too fast for 5000 iterations (< 2 seconds).

The cost function regarded as a function R->R is convex and the input is a linear combination of the predictors and the dependent variable so it should also be convex regarded as a function R^n->R. Therefore, the methods should converge to the global minimum (and not get stuck in local minima as there are no others). Correct me if I am wrong here.

My question is if I seem to have made an obvious mistake (which, I know, is hard to tell as I cannot provide a reprex) or if there are better alternatives to optim.

Possible mistakes include:

The cost function is in fact not convex and so the algorithm does not converge to the global minimum.
As the data is high-dimensional, many iterations may be needed. Perhaps I have not used the maxit argument correctly.
The implemented methods are not suitable for high-dimensional data.

Solution

This is non-differentiable, so you can expect local, gradient-based NLP solvers to get into trouble. However, what you have is a slight generalization of the L1-norm or LAD regression:

 min sum(k, |r[k]|)
 r = Ax - b 
 r, x free variables

This is basically your problem with wPlus=wMinus = 1. This L1 problem can be stated as a Linear Programming (LP) model as follows:

min sum(k, rPlus[k] + rMin[k])
rPlus - rMin = Ax - b
rPlus[k],rMin[k]>=0
x free variable

LP solvers for use under R are readily available.

The modeling technique we used here is called variable splitting [link]. In the optimal solution only one of each pair (rPlus[k],rMin[k]) will be non-zero.

We can use this formulation for your problem by just adjusting the objective a little bit:

min sum(k, wPlus*rPlus[k] + wMin*rMin[k])
rPlus - rMin = Ax - b
rPlus[k],rMin[k]>=0
x free variable

LP problems with many variables and many constraints can be solved very efficiently using a good LP solver.