Search code examples
pythondataframescipycoefficients

Python - Find coefficients minimizing error in csv data


I've recently run into a problem. I have data looking like this :

Value 1 Value 2 Target
1345 4590 2.45
1278 3567 2.48
1378 4890 2.46
1589 4987 2.50
... ... ...

The data goes on for a few thousand lines.

I need to find two values (A & B), that minimize the error when the data is inputted like so :

Value 1 * A + Value 2 * B = Target

I've looked into scipy.optimize.curve_fit, but I can't seem to understand how it would work, because the function changes at every iteration of the data (since Value 1 and Value 2 are not the same over every row).

Any help is greatly appreciated, thanks in advance !


Solution

  • Unfortunataly you have not provided any test data so I have come up with my own:

    import pandas as pd
    import numpy as np
    from scipy.optimize import minimize
    import matplotlib.pyplot as plt
    
    def f(V1,V2,A,B): #Target function
        return V1*A+V2*B
    
    # Generate Test-Data
    def generateData(A,B): 
        np.random.seed(0)
        V1=np.random.uniform(low=1000, high=1500, size=(100,))
        V2=np.random.uniform(low=3500, high=5000, size=(100,))
        Target=f(V1,V2,A,B) +np.random.normal(0,1,100)
        return V1,V2,Target
    data=generateData(2,3) #Important: 
    data={"Value 1":data[0], "Value 2":data[1], "Target":data[2]}
    df=pd.DataFrame(data) #Similar structure as given in Table
    

    df.head() looks like this:

        Value 1 Value 2 Target
    0   1292.0525763109854  3662.162080896163   13570.276523473405
    1   1155.0421489258965  4907.133274663096   17033.392287295104
    2   1430.7172112685223  4844.422515098364   17395.412651006143
    3   1396.0480757043242  4076.5845114488666  15022.720636830541
    4   1346.2120476329646  3570.9567326419674  13406.565815022896
    

    Your question is answered in the following:

    ## Plot Data to check whether linear function is useful 
    
    df.head()
    fig=plt.figure()
    ax1=fig.add_subplot(211)
    ax2=fig.add_subplot(212)
    ax1.scatter(df["Value 1"], df["Target"])
    ax2.scatter(df["Value 2"], df["Target"])
    
    
    
    def fmin(x, df): #Returns Error at given parameters
        def RMSE(y,y_target): #Definition for error term 
            return np.sqrt(np.mean((y-y_target)**2))
        A,B=x
        V1,V2,y_target=df["Value 1"], df["Value 2"], df["Target"]
        y=f(V1,V2,A,B) #Calculate target value with given parameter set
        return RMSE(y,y_target)
    
    res=minimize(fmin,x0=[1,1],args=df, options={"disp":True})
    print(res.x)
    

    I prefere scipy.optimize.minimize() over curve_fit since you can define the error function yourself. The documentation can be found here. You need:

    • a function fun that returns the error for a given set of parameter x (here fmin with RMSE)
    • an initial guess x0 (here [1,1]), if your guess is totally off you will probably do not find a solution or (with more complex problems) just a local one
    • additional arguments args provided to the fun here the data df but also helpful for fixed parameters
    • options={"disp":True} is for printing additional information
    • your parameters can be found besides further information in the returned variable res

    For this case the result is:

    [1.9987209 3.0004212]
    

    Similar to the given parameters when generating the data.