Search code examples
pythonscipy-optimizedata-fittingmodel-fittinglmfit

Fit a model with multiple inputs, multiple outputs, multiple parameters, and covariance matrices for each data point using Python


Note: A similar question has been asked at the Cross Validated site for the mathematical/theoretical part, as suggested by the community. The question here is mostly about the implementation of the fitting procedure in Python.

I want to fit a model that has

  1. Multiple inputs
  2. Multiple outputs
  3. Multiple parameters

Let me do a toy example to make it easier. Let's assume, for example, that I have a model for a 2D vector field which I want to fit with some data. This vector field is therefore a model with 2 inputs (the x,y positions) and 2 outputs (the u,v components of the vectors at those positions). The model can be tuned using 3 parameters (a, b and c).

The code below defines the model at hand and creates some noisy data that would constitute our set of measurements:

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def model(XY, a, b, c):
    x, y = XY
    u = a*x+b*y
    v = a*x*y-c
    return u, v

# TRUE MODEL FIELD
a, b, c = -1, 7, 11 # True model parameters
x, y = np.meshgrid(np.linspace(-5, 5, 10), np.linspace(-5, 5, 10))
u, v = model((x, y), a, b, c)

# PLOT MODEL
plt.style.use('dark_background')
plt.quiver(x, y, u, v, color = 'white')
plt.xlabel('x')
plt.ylabel('y')
plt.savefig('Model_Field.png', format = 'png', dpi = 400)
plt.close('all')

# CREATE DATA
data_size = 100
np.random.seed(0)
x_data, y_data = np.random.uniform(-5, 5, data_size), np.random.uniform(-5, 5, data_size)
u_data_expected, v_data_expected = model((x_data, y_data), a, b, c)
noise_field = [np.random.normal(0, 5, data_size), np.random.normal(0, 5, data_size)]
u_data_real, v_data_real = u_data_expected + noise_field[0], v_data_expected + noise_field[1]

# PLOT DATA
plt.quiver(x_data, y_data, u_data_real, v_data_real, color = 'crimson')
plt.quiver(x_data, y_data, u_data_expected, v_data_expected, color = 'aqua', alpha = 0.6)
plt.xlabel('x')
plt.ylabel('y')
plt.savefig('Data_Field.png', format = 'png', dpi = 400)
plt.close('all')

We can have an idea of what the model of the field looks like for a = -1, b = 7 and c = 11 with the following plot:

enter image description here

The second plot shows the fabricated (x,y,u,v) data on which we will fit the model. In blue the expected vector field at the "measured" positions, and in red the "measured" data (I added some noise):

enter image description here

So, to keep it simple, I want perform a fit to this data (red vectors in the previous figure) and obtain the best fit parameters and their covariances (this las thing is important to me). I don't mind using Scipy, lmfit or any other libraries that might be appropriate for this task.

If this can be done, I additionally want to do the fitting considering the fact that my data has:

  1. Uncertainties in the outputs (u_err,v_err)
  2. Uncertainties in the inputs (x_err,y_err)
  3. Correlations between the input data (x_y_corr)
  4. Correlations between the output data (u_v_corr)

In short, there is a covariance matrix for each of the input coordinates and a covariance matrix for each of the output/measured vectors, and all of that needs to be taken into account while fitting the model parameters.

Here I give some example covariance matrices for the data:

# CREATE DATA UNCERTAINTIES AND CORRELATIONS
# Input data (positions) uncertainties and correlations:
x_err, y_err = np.random.normal(0, 0.1, data_size), np.random.normal(0, 0.1, data_size)
x_y_corr = np.random.uniform(-1, 1, data_size)
cov_matrices_xy = np.array([[[x_err[i]**2, x_err[i]*y_err[i]*x_y_corr[i]],
                             [x_err[i]*y_err[i]*x_y_corr[i], y_err[i]**2]] for i in range(data_size)])

# Output data (vectors) uncertainties and correlations:
u_err, v_err = np.random.normal(0, 0.1, data_size), np.random.normal(0, 0.1, data_size)
u_v_corr = np.random.uniform(-1, 1, data_size)
cov_matrices_uv = np.array([[[u_err[i]**2, u_err[i]*v_err[i]*u_v_corr[i]],
                             [u_err[i]*v_err[i]*u_v_corr[i], v_err[i]**2]] for i in range(data_size)])

So, in short, my question is how can I fit a model in Python taking the 1-7 properties listed into account?


What have I tried?

Well, I've been able to satisfy points 1, 3 and 4 (so, fitting the model with all the input variables, but only on one of the output variables, and everything while considering the uncertainty in that output variable).

from scipy.optimize import curve_fit

# Redefine the model function so that it only yields one of the vector components:
def model(XY, a, b, c):
    x, y = XY
    u = a*x+b*y
    v = a*x*y-c
    return u

# FITTING THE MODEL
init_parameters = [1, 3, 5] # a, b, c
params_fit, cov_params_fit = curve_fit(f = model,
                                       xdata = (x_data, y_data),
                                       ydata = u_data_real,
                                       sigma = u_err,
                                       p0 = init_parameters)

The resulting parameters are a = -0.78162045 (quite close to the expected a = -1), b = 6.41216698 (quite close to the expected b = 7), and c = 5 (which has nothing to do with the expected c = 11, because the fitting is been performed only on the u component and the c parameter doesn't play any role on that, so it remains as it was initially). For some reason I can't get the covariance matrix for the parameters, cov_params_fit, but that might be unrelated to this question.

So, I still don't know how to do everything considering also points 2, 5, 6 and 7 (so, considering the uncertainties of the inputs and the correlations between input variables and those between output variables, and also while fitting 2D vector data not just one of the components).

To solve point 2, I could split the model function into two model functions (one per component) and fit each one separately, but then I would have two sets of a,b,c parameters and the fit would not be as good as if it was done simultaneously on both components.

I've also tried to solve 2 and 4 with this code:

# Redefine the model function so that it yields both components again:
def model(XY, a, b, c):
    x, y = XY
    u = a*x+b*y
    v = a*x*y-c
    return u, v

# FITTING THE MODEL
init_parameters = [1, 3, 5] # a, b, c
params_fit, cov_params_fit = curve_fit(f = model,
                                       xdata = (x_data, y_data),
                                       ydata = (u_data_real, v_data_real),
                                       sigma = (u_err, v_err),
                                       p0 = init_parameters)

But this gives an error. It seems that curve_fit() might not be able to handle such an advanced fitting, but perhaps someone know how to do it.

Another idea to solve 2 (the multiple outputs for the model thing) could go on the line of converting the model function to one that returns only one complex valued output, u+vj, and do something similar to the data vectors. Perhaps then the data could be fitted with a procedure that works with one output value only.


Solution

  • Almost complete Solution with ODR

    You are looking for Total Least Square which is implemented in ODR. It allows multi dimensional input/output and variance propagation.

    First, we reshape your data:

    X = np.array([x_data.ravel(), y_data.ravel()])
    U = np.array([u_data_real.ravel(), v_data_real.ravel()])
    

    And compute covariance matrices:

    Cx = (X.T - X.mean(axis=1)).T @ (X.T - X.mean(axis=1))
    Cu = (U.T - U.mean(axis=1)).T @ (U.T - U.mean(axis=1))
    

    Then we build the model to comply with ODR signature:

    def fitter(p, x):
        u = p[0] * x[0] + p[1] * x[1]
        v = p[0] * x[0] * x[1] - p[2]
        return np.array([u, v])
    

    Now we bind everything together and fit:

    model = odr.Model(fitter)
    data = odr.RealData(X, U, covx=Cx, covy=Cu)
    fit = odr.ODR(data, model, beta0=[1., 1., 1.])
    result = fit.run()
    

    It gives the following result:

    result.pprint()
    # Beta: [-1.2483959   7.10308325 11.59454377]
    # Beta Std Error: [0.10658793 0.42941014 0.58865156]
    # Beta Covariance: [[  6.49905209  -8.76721063   3.66030481]
    #  [ -8.76721063 105.48205575 -17.93062359]
    #  [  3.66030481 -17.93062359 198.22141981]]
    # Residual Variance: 0.001748098943636363
    # Inverse Condition #: 0.18760356570269432
    # Reason(s) for Halting:
    #  Sum of squares convergence
    

    Which seems compliant with your initial model parameters up to errors and fulfills all you requirements, mainly:

    • Multi dimensional input/output;
    • Covariance on input/output;
    • Covariance of the regressed parameters.

    As stated in documentation, covariance matrix is not normalized by the residual variance. You can reproduce the result as follow:

    np.sqrt(np.diag(result.cov_beta * result.res_var))
    # array([0.10658793, 0.42941014, 0.58865156])
    

    Quick partial fix with curve_fit

    If you want to stick with curve_fit, you can modify the signature of your model to stack field dimension. That would solve your field fitting problem as stated in your OP.

    def fitter(XY, a, b, c):
        x, y = XY
        u = a * x + b * y
        v = a * x * y - c
        return np.hstack([np.ravel(u), np.ravel(v)])
    

    Then you can regress your dataset:

    target = np.hstack([np.ravel(u_data_real), np.ravel(v_data_real)])
    optimize.curve_fit(fitter, (x_data, y_data), target)
    
    # (array([-1.05860421,  6.86975359, 11.70709473]),
    #  array([[ 3.45604885e-03,  2.69747434e-04, -2.08667038e-03],
    #         [ 2.69747434e-04,  3.03704636e-02, -1.62866327e-04],
    #         [-2.08667038e-03, -1.62866327e-04,  2.36032115e-01]]))
    

    Which seems to agree with your model up to errors, but does not take into account covariance on inputs as ODR can.