Search code examples
pythonscikit-learnpca

Plotting multidimensional data on a graph


I have a data which has 1700 rows, each with 9 features of houses and an array holding the prices corresponding to those features. I have built a linear regression model on this data, but I would like to visualize it. Can I somehow convert these 9 features into single feature using PCA? I have tried this but I still get error saying that x, y need to be of same size.

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.decomposition import PCA

clf = linear_model.LinearRegression()
pca = PCA(n_components = 9)
# features contains 1700 rows of 9 feature data, y contains 1700 price values
x_train, x_test, y_train, y_test = train_test_split(features, y)
pca.fit(x_train)
x_train = pca.transform(x_train)
x_test = pca.transform(x_test)
clf.fit(x_train, y_train)
pred = clf.predict(x_test)
plt.scatter(x_train, y_train)
plt.show()

Error occurs at plt.scatter() function, feature.shape = (17000, 9), y.shape = (17000, 1)


Solution

  • The error you get is due to the unequal shapes of the inputs for plt.scatter. x_train is an array with 17000 rows and 9 columns. Whereas y_train is an array of 17000 rows and one column.

    You can fix that error by either indexing x_train and only select a single column form it.

    x_train[:,0] selects the first column of x_train.

    An alternative approach is to set n_components to 1. the n_components determines how many features pca returns. Setting it to 1 returns 1 setting it to 9 returns 9.