Search code examples
pythonmachine-learningscikit-learnregressionoverfitting-underfitting

PolynomialFeatures and LinearRegression returns undesirable coefficients


import os
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

csv_path = os.path.join('', 'graph.csv')
graph = pd.read_csv(csv_path)

y = graph['y'].copy()
x = graph.drop('y', axis=1)

pipeline = Pipeline([('pf', PolynomialFeatures(2)), ('clf', LinearRegression())])
pipeline.fit(x, y)

predict = [[16], [20], [30]]

plt.plot(x, y, '.', color='blue')
plt.plot(x, pipeline.predict(x), '-', color='black')
plt.plot(predict, pipeline.predict(predict), 'o', color='red')
plt.show()

My graph.csv:

x,y
1,1
2,2
3,3
4,4
5,5
6,5.5
7,6
8,6.25
9,6.4
10,6.6
11,6.8

The result produced:

enter image description here

It clearly is producing wrong predictions; with each x, y should increase.

What am I missing? I tried changing degrees, but it doesn't get much better. When I use degree of 4 for example, y increases very very rapidly.


Solution

  • @iacob provided a very good answer which I will only extend.

    If you are certain that with each x, y should increase, then perhaps your datapoints follow a logarithmic scaling pattern. Adapting your code for that yields this curve:

    Log Scale fitting

    Here is the code snippet if that corresponds to what you are looking for:

    import os
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    
    from sklearn.pipeline import Pipeline
    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import PolynomialFeatures
    
    csv_path = os.path.join('', 'graph.csv')
    graph = pd.read_csv(csv_path)
    
    y = graph['y'].copy()
    x = graph.drop('y', axis=1)
    
    x_log = np.log(x)
    
    pipeline = Pipeline([('pf', PolynomialFeatures(1)), ('clf', LinearRegression())])
    pipeline.fit(x_log, y)
    
    predict = np.log([[16], [20], [30]])
    
    plt.plot(np.exp(x_log), y, '.', color='blue')
    plt.plot(np.exp(x_log), pipeline.predict(x_log), '-', color='black')
    plt.plot(np.exp(predict), pipeline.predict(predict), 'o', color='red')
    plt.show()
    

    Notice that we are merely doing polynomial regression (here linear regression is sufficient) on the logarithm of the x datapoints ( x_log).