Search code examples
pythonscikit-learnregressionnon-linear-regression

Scikit-learn regression on two variables given a 2D matrix of reference values


I have a matrix of reference values and would like to learn how Scikit-learn can be used to generate a regression model for it. I have done several types of univariate regressions in the past but it's not clear to me how to use two variables in sklearn.

I have two features (A and B) and a table of output values for certain input A/B values. See table and 3D surface below. I'd like to see how I can translate this to a two variable equation that relates the A/B inputs to the single value output, like shown in the table. The relationship looks nonlinear and it could also be quadratic, logarithmic, etc...

How do I use sklearn to perform a nonlinear regression on this tabular data?

A/B 1000    1100    1200    1300    1400    1500    1600    1700    1800    1900    2000
0   8.78    8.21    7.64    7.07    6.50    5.92    5.35    4.78    4.21    3.63    3.06
5   8.06    7.56    7.07    6.58    6.08    5.59    5.10    4.60    4.11    3.62    3.12
10  7.33    6.91    6.50    6.09    5.67    5.26    4.84    4.43    4.01    3.60    3.19
15  6.60    6.27    5.93    5.59    5.26    4.92    4.59    4.25    3.92    3.58    3.25
20  5.87    5.62    5.36    5.10    4.85    4.59    4.33    4.08    3.82    3.57    3.31
25  5.14    4.97    4.79    4.61    4.44    4.26    4.08    3.90    3.73    3.55    3.37
30  4.42    4.32    4.22    4.12    4.02    3.93    3.83    3.73    3.63    3.53    3.43
35  3.80    3.78    3.75    3.72    3.70    3.67    3.64    3.62    3.59    3.56    3.54
40  2.86    2.93    2.99    3.05    3.12    3.18    3.24    3.31    3.37    3.43    3.50
45  2.08    2.24    2.39    2.54    2.70    2.85    3.00    3.16    3.31    3.46    3.62
50  1.64    1.84    2.05    2.26    2.46    2.67    2.88    3.08    3.29    3.50    3.70
55  1.55    1.77    1.98    2.19    2.41    2.62    2.83    3.05    3.26    3.47    3.69
60  2.09    2.22    2.35    2.48    2.61    2.74    2.87    3.00    3.13    3.26    3.39
65  3.12    3.08    3.05    3.02    2.98    2.95    2.92    2.88    2.85    2.82    2.78
70  3.50    3.39    3.28    3.17    3.06    2.95    2.84    2.73    2.62    2.51    2.40
75  3.42    3.32    3.21    3.10    3.00    2.89    2.78    2.68    2.57    2.46    2.36
80  3.68    3.55    3.43    3.31    3.18    3.06    2.94    2.81    2.69    2.57    2.44
85  3.43    3.35    3.28    3.21    3.13    3.06    2.99    2.91    2.84    2.77    2.69
90  3.43    3.35    3.28    3.21    3.13    3.06    2.99    2.91    2.84    2.77    2.69
95  3.43    3.35    3.28    3.21    3.13    3.06    2.99    2.91    2.84    2.77    2.69
100 3.43    3.35    3.28    3.21    3.13    3.06    2.99    2.91    2.84    2.77    2.69

Plot


Solution

  • There is probably a succinct nonlinear relationship between your A, B, and table values, but without some knowledge of this system nor with any sophisticated nonlinear modeling, here's a ridiculous model with a decent score.

    the_table = """1000    1100    1200    1300    1400    1500    1600    1700    1800    1900    2000
    0   8.78    8.21    7.64    7.07    6.50    5.92    5.35    4.78    4.21    3.63    3.06
    5   8.06    7.56    7.07    6.58    6.08    5.59    5.10    4.60    4.11    3.62    3.12
    10  7.33    6.91    6.50    6.09    5.67    5.26    4.84    4.43    4.01    3.60    3.19
    15  6.60    6.27    5.93    5.59    5.26    4.92    4.59    4.25    3.92    3.58    3.25
    20  5.87    5.62    5.36    5.10    4.85    4.59    4.33    4.08    3.82    3.57    3.31
    25  5.14    4.97    4.79    4.61    4.44    4.26    4.08    3.90    3.73    3.55    3.37
    30  4.42    4.32    4.22    4.12    4.02    3.93    3.83    3.73    3.63    3.53    3.43
    35  3.80    3.78    3.75    3.72    3.70    3.67    3.64    3.62    3.59    3.56    3.54
    40  2.86    2.93    2.99    3.05    3.12    3.18    3.24    3.31    3.37    3.43    3.50
    45  2.08    2.24    2.39    2.54    2.70    2.85    3.00    3.16    3.31    3.46    3.62
    50  1.64    1.84    2.05    2.26    2.46    2.67    2.88    3.08    3.29    3.50    3.70
    55  1.55    1.77    1.98    2.19    2.41    2.62    2.83    3.05    3.26    3.47    3.69
    60  2.09    2.22    2.35    2.48    2.61    2.74    2.87    3.00    3.13    3.26    3.39
    65  3.12    3.08    3.05    3.02    2.98    2.95    2.92    2.88    2.85    2.82    2.78
    70  3.50    3.39    3.28    3.17    3.06    2.95    2.84    2.73    2.62    2.51    2.40
    75  3.42    3.32    3.21    3.10    3.00    2.89    2.78    2.68    2.57    2.46    2.36
    80  3.68    3.55    3.43    3.31    3.18    3.06    2.94    2.81    2.69    2.57    2.44
    85  3.43    3.35    3.28    3.21    3.13    3.06    2.99    2.91    2.84    2.77    2.69
    90  3.43    3.35    3.28    3.21    3.13    3.06    2.99    2.91    2.84    2.77    2.69
    95  3.43    3.35    3.28    3.21    3.13    3.06    2.99    2.91    2.84    2.77    2.69
    100 3.43    3.35    3.28    3.21    3.13    3.06    2.99    2.91    2.84    2.77    2.69"""
    
    import numpy as np
    import pandas as pd
    import io
    
    df = pd.read_csv(io.StringIO(initial_value=the_table), sep="\s+")
    df.columns = df.columns.astype(np.uint64)
    
    df_unstacked = df.unstack()
    
    X = df_unstacked.index.tolist()
    y = df_unstacked.to_list()
    
    from sklearn.preprocessing import PolynomialFeatures
    
    poly = PolynomialFeatures(6)
    poly_X = poly.fit_transform(X)
    
    from sklearn.linear_model import LinearRegression
    
    model = LinearRegression()
    model.fit(poly_X,y)
    print(model.score(poly_X,y))
    # 0.9762180339233807
    

    To predict values given an A and a B using this model, you need to transform the input like was done for creating the model. So for example,

    model.predict(poly.transform([(1534,56)]))
    # array([2.75275659])
    

    Even more outrageously ridiculous ...

    more_X = [(a,b,np.log1p(a),np.log1p(b),np.cos(np.pi*b/100)) for a,b in X]
    poly = PolynomialFeatures(5)
    poly_X = poly.fit_transform(more_X)
    model.fit(poly_X,y)
    print(model.score(poly_X,y))
    # 0.9982994398684035
    

    ... and to predict:

    more_X = [(a,b,np.log1p(a),np.log1p(b),np.cos(np.pi*b/100)) for a,b in [(1534,56)]]
    model.predict(poly.transform(more_X))
    # array([2.74577017])
    

    N.B.: There's probably better ways to Pythonically program these ridiculous models.