Search code examples
python-3.xoptimizationsolverminimization

Which scipy.optimize solver should I use for this data? (finding a module in a series)


I have a data, a sequence of integers (with repetition), multiplied by an unknown constant c, which I need to find. The data also has noise:

import pandas as pd
import numpy as np

#data
mySize=[1000,1]

#Unknown constant c to find with a solver
c=np.random.uniform(0.5,10)

#example data: df=c*[integer list]+noise
df = pd.DataFrame(c* np.random.randint(-500,500,size=mySize) +np.random.uniform(-0.8,0.5,mySize))\
                 .sort_values(by=0).reset_index(drop=True)
##Export to excel and open
#if True:
#    import os
#    myPath=os.path.join(os.environ['temp'], 'myNumbers.xlsx')
#    df.to_excel(myPath,sheet_name="Python data",engine='xlsxwriter')
#    os.startfile(myPath)
myNumbers=df[0].tolist()

#auxiliary calculation
absx_y=[ abs(x-y) for x in myNumbers for y in myNumbers]

Screenshot of data and Δdata
(in red is the delta (difference) between consecutive numbers. When the c*"integers" are equal, the difference is just the noise, plotted at the bottom, as small differences)

My idea is that, because c is the module separating most data, the function mod(data,c)≈0

Screenshot data % c

So, I need to minimize this loss function:

def Loss(trial_c):
    answer=np.sum( ((absx_y/trial_c-0.5) % 1 -0.5)**2 )
    #print("c="+str(c)+"; trial_c="+str(trial_c)+"; loss(trial_c)="+str(answer))
    return answer

Which is minimum for the values of c that turn the data into integers

screenshot Loss function

My intention is to use a solver, but to understand the problem, I did a brute force approach: If I generate all possible values for c, and his loss: (this is awfully slow)

#generate all trial c values for 
trialC=np.arange (1,1000, 1)*(df[0].diff().max()/1000)

#lossOfTrialC =[Loss(xx) for xx in trialC]# <- This is horribly slow, so I use parallel calculation     
from joblib import delayed, Parallel
lossOfTrialC = Parallel(n_jobs=8)(delayed(Loss)(xx) for xx in trialC)

When I plot it:

import matplotlib.pyplot as plt

def PlotearXY(X,Y,Title=""):

    #plt.ion()
    fig = plt.figure()
    fig.subplots_adjust(bottom=0.2)
    ax = plt.gca()
    #ax.scatter(X,Y,marker='o',s=1)#.abs()
    ax.plot(X,Y,marker='o')#.abs()
    #ax.set_yscale('log')
    plt.title(Title) 
    #plt.draw()
    plt.show()
    plt.close()


PlotearXY(trialC,lossOfTrialC,"objective c="+str(c))

I get a clear minimum for the loss function at the right trialC, but the loss is very noisy, full of local minimum

Screenshot Losses

I tried this method in excel, and it works. Because excel uses SLSQP, I tried scipy SLSQP solver:

from scipy.optimize import minimize

#Constraints
maxC=max(exampleData)
def constraint1(trial_c):
    return trial_c-maxC

#Initial value for trialC
trial_c=[df[0].diff().max()]

#Bounds for trial_c
myBounds=[(0.0000001,df[0].diff().max())]

#inequalities for trial_c (not sure if necessary)
con1 = {'type': 'ineq', 'fun': constraint1} 
cons = ([con1])

solution = minimize(Loss,trial_c,method='SLSQP',\
                    bounds=myBounds,constraints=cons)

But it generally fails, getting stuck into a local minimum.

The question is, "which solver should I use?"

The documentation of scipy.minimize has a large list of different solver, but I have no clue which one would be better for this problem.

Or my entire approach is wrong?


Solution

  • I haven’t looked too much into your implementation, however I have a couple of points and suggestions:

    • it seems to me that you are using the mod() function inside your objective. The mod() function is discontinuous and the optimizer (especially a gradient-based one like SLSQP) might struggle to find a proper descent direction.
    • SLSQP is a local optimization algorithm, it can only guarantee a local minimum. And I seem to remember that Excel uses GRG2, not SLSQP, but this is irrelevant to the question
    • you may want to consider the global optimization algorithms in SciPy, and in particular SHGO and DualAnnealing. You should also give NLOpt a try, it implements many good global optimization algorithms (DIRECT, CRS2, etc...).