Search code examples
pythonscikit-learnregressionfeature-selectionsupervised-learning

Reduce Number of Measurements in Calibration


For calibration purposes I am making N measurements of water flow, each of which is time-intensive. I want to reduce the number of measurements. It sounds like this is part of feature selection as I am reducing the number of columns I have. BUT - I need to predict the measurements I will be dropping.

Here is a sample of the data:

    SerialNumber  val      speed
0   193604048   1.350254    105.0
1   193604048   1.507517    3125.0
2   193604048   1.455142    525.0
6   193604048   1.211184    12.8
7   193604048   1.238835    20.0

For each serial number I have a complete set of speed-val measurements. Ideally I would like a model whose output is the vector of all N val measurements, but it seems the options are all neural networks, which I am trying to avoid for now. Are there are any other options?

If I feed this data into a regression model, how do I differentiate between each serialNumber dataset?

To make sure my goal is clear - I want to learn the historical measurements I have of N measurements and find which speed-val I can drop to still accurately predict all N output values.

Thank you!


Solution

  • I tried to find the simplest equation that would give a good fit to the example data you posted, and from my equation search the Harris Yield Density equation, "y = 1.0 / (a + b * pow(x, c))", is an good candidate. Here is a graphical Python fitter using that equation and your data, with initial parameter estimates for the non-linear fitter calculated directly from the data max and min values. Note that SerialNumber itself is unrelated to the data and would not be used in regressions.

    My hope is that you might find this equation generally useful in your work, and it might be possible that after performing similar regressions on several different data sets that parameters a, b, and c are very similar in all cases - that is the best outcome. If your measurement accuracy is high, I personally would expect that with this three-parameter equation it should be possible to use a minimum of four data points per calibration, with max, min and two other well-spaced points along the expected calibration curve.

    Note that here the fitted parameters a = -1.91719091e-03. b = 1.11357103e+00, and c = -1.51294798e+01 yield RMSE = 3.191 and R-squared = 0.9999

    plot

    import numpy, scipy, matplotlib
    import matplotlib.pyplot as plt
    from scipy.optimize import curve_fit
    
    xData = numpy.array([1.350254, 1.507517, 1.455142, 1.211184, 1.238835])
    yData = numpy.array([105.0, 3125.0, 525.0, 12.8, 20.0])
    
    
    def func(x, a, b, c): # Harris yield density equation
        return 1.0 / (a + b*numpy.power(x, c))
    
    
    initialParameters = numpy.array([0.0, min(xData), -10.0 * max(xData)])
    
    # curve fit the test data
    fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
    
    modelPredictions = func(xData, *fittedParameters) 
    
    absError = modelPredictions - yData
    
    SE = numpy.square(absError) # squared errors
    MSE = numpy.mean(SE) # mean squared errors
    RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
    Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
    
    print('Parameters:', fittedParameters)
    print('RMSE:', RMSE)
    print('R-squared:', Rsquared)
    
    print()
    
    
    ##########################################################
    # graphics output section
    def ModelAndScatterPlot(graphWidth, graphHeight):
        f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
        axes = f.add_subplot(111)
    
        # first the raw data as a scatter plot
        axes.plot(xData, yData,  'D')
    
        # create data for the fitted equation plot
        xModel = numpy.linspace(min(xData), max(xData))
        yModel = func(xModel, *fittedParameters)
    
        # now the model as a line plot
        axes.plot(xModel, yModel)
    
        axes.set_title('Harris Yield Density Equation') # title
        axes.set_xlabel('Val') # X axis data label
        axes.set_ylabel('Speed') # Y axis data label
    
        plt.show()
        plt.close('all') # clean up after using pyplot
    
    graphWidth = 800
    graphHeight = 600
    ModelAndScatterPlot(graphWidth, graphHeight)
    

    UPDATE using reversed X and Y

    Per the comments, here is a three-parameter equation Mixed Power and Eponential "a * pow(x, b) * exp(c * x)" graphical fitter with X and Y reversed from the previous code. Here the fitted parameters a = 1.05910664e+00, b = 5.26304345e-02, and -2.25604946e-05 yield RMSE = 0.0003602 and R-squared= 0.9999

    plot2

    import numpy, scipy, matplotlib
    import matplotlib.pyplot as plt
    from scipy.optimize import curve_fit
    
    xData = numpy.array([105.0, 3125.0, 525.0, 12.8, 20.0])
    yData = numpy.array([1.350254, 1.507517, 1.455142, 1.211184, 1.238835])
    
    
    def func(x, a, b, c): # mixed power and exponential equation
        return a * numpy.power(x, b) * numpy.exp(c * x)
    
    
    initialParameters = [1.0, 0.01, -0.01]
    
    # curve fit the test data
    fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
    
    modelPredictions = func(xData, *fittedParameters) 
    
    absError = modelPredictions - yData
    
    SE = numpy.square(absError) # squared errors
    MSE = numpy.mean(SE) # mean squared errors
    RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
    Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
    
    print('Parameters:', fittedParameters)
    print('RMSE:', RMSE)
    print('R-squared:', Rsquared)
    
    print()
    
    
    ##########################################################
    # graphics output section
    def ModelAndScatterPlot(graphWidth, graphHeight):
        f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
        axes = f.add_subplot(111)
    
        # first the raw data as a scatter plot
        axes.plot(xData, yData,  'D')
    
        # create data for the fitted equation plot
        xModel = numpy.linspace(min(xData), max(xData))
        yModel = func(xModel, *fittedParameters)
    
        # now the model as a line plot
        axes.plot(xModel, yModel)
    
        axes.set_title('Mixed Power and Exponential Equation') # title
        axes.set_xlabel('Speed') # X axis data label
        axes.set_ylabel('Val') # Y axis data label
    
        plt.show()
        plt.close('all') # clean up after using pyplot
    
    graphWidth = 800
    graphHeight = 600
    ModelAndScatterPlot(graphWidth, graphHeight)