Search code examples
pythonscikit-learnpca

Why does sklearn's PCA not return reproducible results?


I am using scikit's PCA and have noticed some really strange behavior. Essentially, when using more than 500 samples, the result is not reproducible. This example shows what's happening:

import numpy as np
from sklearn.decomposition import PCA

Ncomp = 15
Nsamp = 501
Nfeat = 30

PCAnalyzer = PCA(n_components = Ncomp)
ManySamples = np.random.rand(Nsamp, Nfeat)
TestSample = np.ones((1, Nfeat))

print(PCAnalyzer.fit(ManySamples).transform(TestSample))
print(PCAnalyzer.fit(ManySamples).transform(TestSample))
print(PCAnalyzer.fit(ManySamples).transform(TestSample))
print(PCAnalyzer.fit(ManySamples).transform(TestSample))

It outputs:

>>> print(PCAnalyzer.fit(ManySamples).transform(TestSample))
[[-0.25641111  0.42327221  0.4616427  -0.72047479 -0.12386481  0.10608497
   0.28739712 -0.26003239  1.27305465  1.05307604 -0.53915119 -0.07127874
   0.25312454 -0.12052255 -0.06738885]]
>>> print(PCAnalyzer.fit(ManySamples).transform(TestSample))
[[-0.26656397  0.42293446  0.45487161 -0.7339531  -0.16134778  0.15389179
   0.27052166 -0.33565591  1.26289845  0.96118269  0.5362569  -0.54688338
   0.08329318 -0.08423136 -0.00253318]]
>>> print(PCAnalyzer.fit(ManySamples).transform(TestSample))
[[-0.21899525  0.38527988  0.45101669 -0.73443888 -0.20501978  0.09640448
   0.17826649 -0.37653009  1.04856884  1.10948052  0.60700417 -0.39864793
   0.18020651  0.08061955  0.05383696]]
>>> print(PCAnalyzer.fit(ManySamples).transform(TestSample))
[[-0.27070256  0.41532602  0.45936926 -0.73820121 -0.18160026 -0.13139435
   0.28015907 -0.28144421  1.16554587  1.00472104  0.16983399 -0.67157762
  -0.3005816   0.54645421  0.09807374]]

Reducing the number of samples (Nsamp) to 500 or less, or increasing the number of components (Ncomp) to 20 or more, fixes the issue - but this is not practical for me.


Solution

  • This is because of the default solver used by sklearn. From the docs:

    the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.

    If you need reproducible results, use a different solver, or set the random_state