Search code examples
pythonpca

Issue in finding rmse in PCA Reconstruction in python


I am trying to find the root mean squared error between an original sample from Xdata and a reconstructed sample recon for different numbers of components. However when I use the code below:

components = [2,6,10,20]    
for n in components:
    pca = PCA(n_components=n)
    recon = pca.inverse_transform(pca.fit_transform(Xdata[0].reshape(1, -1)))
    rmse = math.sqrt(mean_squared_error(Xdata[0].reshape(1, -1), recon))
    print("RMSE: {} with {} components".format(rmse, n))

I always get an RMSE of 0.0 for each component?

For reference, this is what Xdata[0] holds:

array([-8.47058824e-06, -6.12352941e-05, -3.18529412e-04, -1.09905882e-03, -2.64370588e-03, -4.39111765e-03, -8.70000000e-03, -2.35560000e-02, -6.03388235e-02, -1.52837471e-01, -3.48945353e-01, -4.86196588e-01, -5.51568706e-01, -5.38629706e-01, -5.34948000e-01, -5.70773824e-01, -5.45583000e-01, -4.30446353e-01, -2.76558000e-01, -1.10208882e-01, -4.35031765e-02, -2.09613529e-02, -1.25080588e-02, -9.00317647e-03, -5.04900000e-03, -2.75576471e-03, -1.03394118e-03, -1.78058824e-04, -7.53529412e-05, -2.54647059e-04])


Solution

  • PCA is a type dimension reduction and I quote wiki:

    It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible.

    To me, your data X[0] and only 1 dimension.. How much more can you reduce it?

    If it is a case of testing the rmse for the first entry, you still need to fit the pca on the full data (to capture the variance), and only subset the rmse on 1 data point (though it might be meaningless, because for n=1 it is not rmse but square of residuals)

    You can see below:

    import numpy as np
    from sklearn import datasets
    from sklearn.decomposition import PCA
    from sklearn.metrics import mean_squared_error
    
    iris = datasets.load_iris()
    Xdata = iris.data
    
    components = [2,3]    
    for n in components:
        pca = PCA(n_components=n)
        recon = pca.inverse_transform(pca.fit_transform(Xdata))
        rmse = mean_squared_error(Xdata[0], recon[0],squared=False)
        print("RMSE: {} with {} components".format(rmse, n))
    

    The output:

    RMSE: 0.014003180182090432 with 2 components
    RMSE: 0.0011312185356586826 with 3 components