For calibration purposes I am making N measurements of water flow, each of which is time-intensive. I want to reduce the number of measurements. It sounds like this is part of feature selection as I am reducing the number of columns I have. BUT - I need to predict the measurements I will be dropping.
Here is a sample of the data:
SerialNumber val speed
0 193604048 1.350254 105.0
1 193604048 1.507517 3125.0
2 193604048 1.455142 525.0
6 193604048 1.211184 12.8
7 193604048 1.238835 20.0
For each serial number I have a complete set of speed-val measurements. Ideally I would like a model whose output is the vector of all N val measurements, but it seems the options are all neural networks, which I am trying to avoid for now. Are there are any other options?
If I feed this data into a regression model, how do I differentiate between each serialNumber dataset?
To make sure my goal is clear - I want to learn the historical measurements I have of N measurements and find which speed-val I can drop to still accurately predict all N output values.
Thank you!
I tried to find the simplest equation that would give a good fit to the example data you posted, and from my equation search the Harris Yield Density equation, "y = 1.0 / (a + b * pow(x, c))", is an good candidate. Here is a graphical Python fitter using that equation and your data, with initial parameter estimates for the non-linear fitter calculated directly from the data max and min values. Note that SerialNumber itself is unrelated to the data and would not be used in regressions.
My hope is that you might find this equation generally useful in your work, and it might be possible that after performing similar regressions on several different data sets that parameters a, b, and c are very similar in all cases - that is the best outcome. If your measurement accuracy is high, I personally would expect that with this three-parameter equation it should be possible to use a minimum of four data points per calibration, with max, min and two other well-spaced points along the expected calibration curve.
Note that here the fitted parameters a = -1.91719091e-03. b = 1.11357103e+00, and c = -1.51294798e+01 yield RMSE = 3.191 and R-squared = 0.9999
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
xData = numpy.array([1.350254, 1.507517, 1.455142, 1.211184, 1.238835])
yData = numpy.array([105.0, 3125.0, 525.0, 12.8, 20.0])
def func(x, a, b, c): # Harris yield density equation
return 1.0 / (a + b*numpy.power(x, c))
initialParameters = numpy.array([0.0, min(xData), -10.0 * max(xData)])
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('Parameters:', fittedParameters)
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_title('Harris Yield Density Equation') # title
axes.set_xlabel('Val') # X axis data label
axes.set_ylabel('Speed') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
UPDATE using reversed X and Y
Per the comments, here is a three-parameter equation Mixed Power and Eponential "a * pow(x, b) * exp(c * x)" graphical fitter with X and Y reversed from the previous code. Here the fitted parameters a = 1.05910664e+00, b = 5.26304345e-02, and -2.25604946e-05 yield RMSE = 0.0003602 and R-squared= 0.9999
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
xData = numpy.array([105.0, 3125.0, 525.0, 12.8, 20.0])
yData = numpy.array([1.350254, 1.507517, 1.455142, 1.211184, 1.238835])
def func(x, a, b, c): # mixed power and exponential equation
return a * numpy.power(x, b) * numpy.exp(c * x)
initialParameters = [1.0, 0.01, -0.01]
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('Parameters:', fittedParameters)
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_title('Mixed Power and Exponential Equation') # title
axes.set_xlabel('Speed') # X axis data label
axes.set_ylabel('Val') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)