I have a data, a sequence of integers (with repetition), multiplied by an unknown constant c, which I need to find. The data also has noise:
import pandas as pd
import numpy as np
#data
mySize=[1000,1]
#Unknown constant c to find with a solver
c=np.random.uniform(0.5,10)
#example data: df=c*[integer list]+noise
df = pd.DataFrame(c* np.random.randint(-500,500,size=mySize) +np.random.uniform(-0.8,0.5,mySize))\
.sort_values(by=0).reset_index(drop=True)
##Export to excel and open
#if True:
# import os
# myPath=os.path.join(os.environ['temp'], 'myNumbers.xlsx')
# df.to_excel(myPath,sheet_name="Python data",engine='xlsxwriter')
# os.startfile(myPath)
myNumbers=df[0].tolist()
#auxiliary calculation
absx_y=[ abs(x-y) for x in myNumbers for y in myNumbers]
Screenshot of data and Δdata
(in red is the delta (difference) between consecutive numbers. When the c*"integers" are equal, the difference is just the noise, plotted at the bottom, as small differences)
My idea is that, because c is the module separating most data, the function mod(data,c)≈0
So, I need to minimize this loss function:
def Loss(trial_c):
answer=np.sum( ((absx_y/trial_c-0.5) % 1 -0.5)**2 )
#print("c="+str(c)+"; trial_c="+str(trial_c)+"; loss(trial_c)="+str(answer))
return answer
Which is minimum for the values of c that turn the data into integers
My intention is to use a solver, but to understand the problem, I did a brute force approach: If I generate all possible values for c, and his loss: (this is awfully slow)
#generate all trial c values for
trialC=np.arange (1,1000, 1)*(df[0].diff().max()/1000)
#lossOfTrialC =[Loss(xx) for xx in trialC]# <- This is horribly slow, so I use parallel calculation
from joblib import delayed, Parallel
lossOfTrialC = Parallel(n_jobs=8)(delayed(Loss)(xx) for xx in trialC)
When I plot it:
import matplotlib.pyplot as plt
def PlotearXY(X,Y,Title=""):
#plt.ion()
fig = plt.figure()
fig.subplots_adjust(bottom=0.2)
ax = plt.gca()
#ax.scatter(X,Y,marker='o',s=1)#.abs()
ax.plot(X,Y,marker='o')#.abs()
#ax.set_yscale('log')
plt.title(Title)
#plt.draw()
plt.show()
plt.close()
PlotearXY(trialC,lossOfTrialC,"objective c="+str(c))
I get a clear minimum for the loss function at the right trialC, but the loss is very noisy, full of local minimum
I tried this method in excel, and it works. Because excel uses SLSQP, I tried scipy SLSQP solver:
from scipy.optimize import minimize
#Constraints
maxC=max(exampleData)
def constraint1(trial_c):
return trial_c-maxC
#Initial value for trialC
trial_c=[df[0].diff().max()]
#Bounds for trial_c
myBounds=[(0.0000001,df[0].diff().max())]
#inequalities for trial_c (not sure if necessary)
con1 = {'type': 'ineq', 'fun': constraint1}
cons = ([con1])
solution = minimize(Loss,trial_c,method='SLSQP',\
bounds=myBounds,constraints=cons)
But it generally fails, getting stuck into a local minimum.
The question is, "which solver should I use?"
The documentation of scipy.minimize has a large list of different solver, but I have no clue which one would be better for this problem.
Or my entire approach is wrong?
I haven’t looked too much into your implementation, however I have a couple of points and suggestions: