Search code examples
pythonscikit-learnpca

How to preserve row headers after PCA from sklearn


I have an array like so:

sampleA 1 2 2 1 
sampleB 1 3 2 1
sampleC 2 3 1 2

My goal is to run PCA across the samples and see their clustering. However, I need to preserve the sample names in the row header. Is there any way I can do this? Desired PCA result includes the row headers:

sampleA 0.13 0.1
sampleB 0.1 0.4
sampleC 0.1 0.1

Currently just running these two simple lines:

my_pca = PCA(n_components=8)
trans = my_pca.fit_transform(in_array)

Solution

  • According to the source, you input will be transformed by np.array() before doing PCA. So you will lose the row index during PCA.fit_transform(X) even you use a structured array or a pandas DataFrame. However, the order of your data is preserved, meaning you can attach the index back if you want:

    import io
    
    import numpy as np
    import pandas as pd
    from sklearn.decomposition import PCA
    
    s = """sampleA 1 2 2 1
    sampleB 1 3 2 1
    sampleC 2 3 1 2"""
    in_array = pd.read_table(io.StringIO(s), sep=' ', header=None, index_col=0)
    my_pca = PCA(n_components=2)
    trans = my_pca.fit_transform(in_array)
    df = pd.DataFrame(trans, index=in_array.index)
    print(df)
    #                 0         1
    # 0                          
    # sampleA -0.773866 -0.422976
    # sampleB -0.424531  0.514022
    # sampleC  1.198397 -0.091046