Search code examples
pythonscikit-learnsklearn-pandas

Sklearn - price prediction using cross validation


This is my code:

from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split

%matplotlib inline

boston_properties = load_boston()

l_distance = boston_properties['data'][:, np.newaxis, 7]
linreg = LinearRegression()

X_train, X_test, y_train, y_test = train_test_split(l_distance, boston_properties['target'], test_size = 0.3)
y_pred = cross_val_predict(linreg, l_distance, boston_properties.target, cv=5)

plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=2)
plt.show()
print(y_pred.shape)

The error which I'm receiving is the following:

ValueError: x and y must have same first dimension, but have shapes (152, 1) and (506,)

How can I make this work?


Solution

  • You made a train_test_split, but you're not using it to train the model. Then you predict on your entire training data, and compare it with y_test. This makes no sense. Use these lines instead:

    l_distance = boston_properties['data'][:, np.newaxis, 7]
    linreg = LinearRegression()
    
    X_train, X_test, y_train, y_test = train_test_split(l_distance,
        boston_properties['target'], test_size = 0.3) # now you have a train/test set
    y_pred = cross_val_predict(linreg, X_train, y_train, cv=5) 
    
    plt.scatter(X_train, y_train, color='black')
    plt.plot(X_train, y_pred, color='blue', linewidth=2)
    plt.show()
    

    enter image description here

    Edit: You can also use this line to make a straight line through your points:

    plt.scatter(X_train, y_train, color='black')
    plt.plot([X_train[np.argmin(X_train)], X_train[np.argmax(X_train)]],
             [y_pred[np.argmin(X_train)], y_pred[np.argmax(X_train)]],
             color='blue')
    plt.show()
    

    enter image description here