This is my code:
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split
%matplotlib inline
boston_properties = load_boston()
l_distance = boston_properties['data'][:, np.newaxis, 7]
linreg = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(l_distance, boston_properties['target'], test_size = 0.3)
y_pred = cross_val_predict(linreg, l_distance, boston_properties.target, cv=5)
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=2)
plt.show()
print(y_pred.shape)
The error which I'm receiving is the following:
ValueError: x and y must have same first dimension, but have shapes (152, 1) and (506,)
How can I make this work?
You made a train_test_split
, but you're not using it to train the model. Then you predict on your entire training data, and compare it with y_test
. This makes no sense. Use these lines instead:
l_distance = boston_properties['data'][:, np.newaxis, 7]
linreg = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(l_distance,
boston_properties['target'], test_size = 0.3) # now you have a train/test set
y_pred = cross_val_predict(linreg, X_train, y_train, cv=5)
plt.scatter(X_train, y_train, color='black')
plt.plot(X_train, y_pred, color='blue', linewidth=2)
plt.show()
Edit: You can also use this line to make a straight line through your points:
plt.scatter(X_train, y_train, color='black')
plt.plot([X_train[np.argmin(X_train)], X_train[np.argmax(X_train)]],
[y_pred[np.argmin(X_train)], y_pred[np.argmax(X_train)]],
color='blue')
plt.show()