Search code examples
pythonscikit-learnpca

apply sklearn PCA on movielens dataset


I have movielens dataset which I want to apply PCA on it, but sklearn PCA function dose not seems to do it correctly.
I have 718*8913 matrix which rows indicate the users and columns indicate movies here is my python code :

Load movie names and movie ratings

movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
ratings.drop(['timestamp'], axis=1, inplace=True)
def replace_name(x):
    return movies[movies['movieId']==x].title.values[0]
ratings.movieId = ratings.movieId.map(replace_name)
M = ratings.pivot_table(index=['userId'], columns=['movieId'], values='rating')
df1 = M.replace(np.nan, 0, regex=True)

Standardizing

X_std = StandardScaler().fit_transform(df1)

Apply PCA

pca = PCA()
result = pca.fit_transform(X_std)
print result.shape
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()

I did't set any component number so I expect that PCA return 718*8913 matrix in new dimension but pca result size is 718*718 and pca.explained_variance_ratio_ size is 718, and sum of all members of it is 1, but how this is possible!!!
I have 8913 features and it return only 718 and sum of variance of them is equal to 1 can any one explain what is wrong here ?
my plot picture result: enter image description here As you can see in the above picture it just contain 718 component and sum of it is 1 but I have 8913 features where they gone?

Test with smaller example

I even try with scikit learn PCA example which can be found in documentation page of pca Here is the Link I change the example and just increase the number of features

import numpy as np
from sklearn.decomposition import PCA
import pandas as pd
X = np.array([[-1, -1,3,4,-1, -1,3,4], [-2, -1,5,-1, -1,3,4,2], [-3, -2,1,-1, -1,3,4,1],
[1, 1,4,-1, -1,3,4,2], [2, 1,0,-1, -1,3,4,2], [3, 2,10,-1, -1,3,4,10]])
ipca = PCA(n_components = 7)
print (X.shape)
ipca.fit(X)
result = ipca.transform(X)
print (result.shape);

and in this example we have 6 sample and 8 feauters I set the n_components to 7 but the result size is 6*6.
I think when the number of features is bigger than number of samples the maximum number of components scikit learn pca will return is equal to number of samples


Solution

  • See the documentation on PCA. Because you did not pass an n_components parameter to PCA(), sklearn uses min(n_samples, n_features) as the value of n_components, which is why you get a reduced feature set equal to n_samples.

    I believe your variance is equal to 1 because you didn't set the n_components, from the documentation:

    If n_components is not set then all components are stored and the sum of explained variances is equal to 1.0.