I have movielens dataset which I want to apply PCA on it, but sklearn PCA function dose not seems to do it correctly.
I have 718*8913 matrix which rows indicate the users and columns indicate movies
here is my python code :
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
ratings.drop(['timestamp'], axis=1, inplace=True)
def replace_name(x):
return movies[movies['movieId']==x].title.values[0]
ratings.movieId = ratings.movieId.map(replace_name)
M = ratings.pivot_table(index=['userId'], columns=['movieId'], values='rating')
df1 = M.replace(np.nan, 0, regex=True)
X_std = StandardScaler().fit_transform(df1)
pca = PCA()
result = pca.fit_transform(X_std)
print result.shape
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()
I did't set any component number so I expect that PCA return 718*8913 matrix in new dimension but pca result size is 718*718 and pca.explained_variance_ratio_
size is 718, and sum of all members of it is 1, but how this is possible!!!
I have 8913 features and it return only 718 and sum of variance of them is equal to 1 can any one explain what is wrong here ?
my plot picture result:
As you can see in the above picture it just contain 718 component and sum of it is 1 but I have 8913 features where they gone?
I even try with scikit learn PCA example which can be found in documentation page of pca Here is the Link I change the example and just increase the number of features
import numpy as np
from sklearn.decomposition import PCA
import pandas as pd
X = np.array([[-1, -1,3,4,-1, -1,3,4], [-2, -1,5,-1, -1,3,4,2], [-3, -2,1,-1, -1,3,4,1],
[1, 1,4,-1, -1,3,4,2], [2, 1,0,-1, -1,3,4,2], [3, 2,10,-1, -1,3,4,10]])
ipca = PCA(n_components = 7)
print (X.shape)
ipca.fit(X)
result = ipca.transform(X)
print (result.shape);
and in this example we have 6 sample and 8 feauters I set the n_components
to 7 but the result size is 6*6.
I think when the number of features is bigger than number of samples the maximum number of components scikit learn pca will return is equal to number of samples
See the documentation on PCA.
Because you did not pass an n_components
parameter to PCA()
, sklearn uses min(n_samples, n_features)
as the value of n_components
, which is why you get a reduced feature set equal to n_samples.
I believe your variance is equal to 1 because you didn't set the n_components
, from the documentation:
If n_components is not set then all components are stored and the sum of explained variances is equal to 1.0.