Search code examples
pythonnumpyindexingscikit-learnpca

Preserve index after PCA in scikit-learn


I would like to preserve the column Unique_id of my dataframe my_df to facilitate further analysis after applying PCA. But during dimensionality reduction the column will be lost. Is there a way to keep the column?

my_df looks like this:


    Unique_id EndValue Peak_val Score       f1          f2          f3          f4          f5          f6
0   15         44.5   46.5     377.17   38.366667   17.757471   -0.610802   -1.028477   45.372727   0.150168
1   15         45.0   47.0     1268.37  46.909091   0.090909    -2.846050   6.100000    46.909091   0.090909
2   18         45.0   47.0     373.16   45.030303   0.030303    5.480078    28.031250   45.651685   0.252298
3   18         45.0   47.0     369.68   45.000000   0.000000    0.000000    -3.000000   46.052632   0.052632
4   19         45.0   47.0     1414.97  46.000000   0.000000    0.000000    -3.000000   46.000000   0.000000

I followed method suggested by this thread to preserve the index of the original data and attach it to the my_df after applying PCA.

During PCA, I removed the Unique_id column as it is not a feature. I then tried:

def pca2(data, n, pc_count = None):
    return PCA(n_components = n).fit_transform(data)

results_2d = pca2(my_df, 2)

df_temp = pd.DataFrame(results_2d, index=my_df.index)
df_temp

Output looked like this:

     0              1
0   -1.863714e+03   -14.793301
1   -2.754914e+03   -10.330997
2   -1.859704e+03   23.473387
3   -1.856224e+03   5.703049
4   -2.901514e+03   -19.540132
5   -1.786054e+03   17.621220
6   -2.555565e+03   38.636828
7   -1.667134e+03   11.647753

I realise that it's hard to verify that the indexing is correct. Is there a way I can check my results? And is there a better way to preserve the actual Unique_id?


Solution

  • You should use ColumnTransformer:

    >>> import pandas as pd
    >>> from sklearn.compose import ColumnTransformer
    >>> from sklearn.decomposition import PCA
    >>> df = pd.DataFrame({'c1': [1, 2, 3, 4],
    ...                    'c2': [3., 5.5, 8., 10.5],
    ...                    'c_to_preserve': [-5, -3, 6, 10]})
    >>> featurizer = ColumnTransformer([('pca', PCA(n_components=1), ['c1', 'c2']),
    ...                                 ('preserve', 'passthrough', ['c_to_preserve'])])
    >>> featurizer.fit_transform(df)
    array([[ 4.03887361, -5.        ],
           [ 1.3462912 , -3.        ],
           [-1.3462912 ,  6.        ],
           [-4.03887361, 10.        ]])
    

    Check sklearn.compose.ColumnTransformer for more information.