Search code examples
scikit-learnpca

What does `fit_transform` do in the context of Scikit Learn PCA?


I don't understand what fit_transform does compared to fit in the context of Scikit Learn and PCA.

PCA takes some data and attempts to measure a set of eigenvectors, where each vector is orthogonal to all others and aligned in the direction of maximum remaining variance.

Put another way, the first eigenvector found is oriented along the axis of maximal data variance.

What transformation does fit_transform do, and what interpretation does it have in the context of PCA?

In other words, what transformation is being done by the transform step?


Solution

  • In simple terms:

    • fit(X): Calculates the eigenvectors of the covariance matrix of X using eigendecomposition. You can retrieve the eigenvectors after you have fit the PCA (pca.fit(X)) via pca.components_.
    • transform(X): Converts the input data from the input vector space to the PCA vector space. That is - the vector space defined by the eigenvectors obtained from the PCA algorithm. The transformed data are commonly referred to as the principal components (PCs).
    • fit_transform(X): Combines both steps - first finding the eigenvectors and then projecting the data onto them.

    In practice, Scikit-learn’s PCA implementation uses Singular Value Decomposition (SVD) on X, which gives you both the eigenvectors and principal components in one step during fit(). However, if you have new data to project into the principal component space, you’ll need the transform() method to do that projection.

    Note on Scikit-learn's terminology: eigenvectors = components_