I am using scikit-learn PCA to find the principal components of a dataset with about 20000 features and 400+ samples.
However, comparing with Orange3 PCA which should be using scikit-learn PCA, I get different results. I also unchecked the normalization option proposed by Orange3 PCA.
With scikit-learn the first Principal Component accounts for ~14% of total variance, the second for ~13% and so on.
With Orange3 I get a very different result (~65% of variance for the first Principal Component and so on):
My code using scikit-learn is the following:
import pandas as pd
from sklearn.decomposition import PCA
matrix = pd.read_table("matrix.csv", sep='\t', index_col=0)
sk_pca = PCA(n_components=None)
result = sk_pca.fit(matrix.T.values)
print(result.explained_variance_ratio_)
With Orange3, I loaded the csv using the file block. Then I connected this block to the PCA block, in which I unchecked the normalization option.
Where is the difference between the two methods?
Thanks to K3---rnc's answer, I inspected how I loaded data.
But the data were correctly loaded, there were no missing data. The problem was that Orange3 loads the data putting the features on the columns and the samples on the rows, which is the opposite of what I was expecting it to do.
So I transposed the data and the result is the same of the result given by the scikit-learn module:
Thanks