Search code examples
pandaspca

PCA plot with large dataframe


I have a large data frame (~1400 rows) with the following columns:

    protein   IHD          CM         ARR         VD        CHD           CCD         VOO      
0   q9uku9  0.000000    0.039457    0.032901    0.014793    0.006614    0.006591    0.000000    
1   o75461  0.000000    0.005832    0.027698    0.000000    0.000000    0.006634    0.000000

etc.

I want to perform a PCA analysis and plot with the vectors, but I'm not sure how to do so with such a large data set. Does anyone have any suggestions?


Solution

  • Actually a 1400 x 8 dataframe is not that big on modern computers. You can use scikit-learn to perform PCA on your dataset. It is relatively simple:

    import pandas as pd
    import numpy as np
    from sklearn.decomposition import PCA
    cols = ['IHD', 'CM', 'ARR', 'VD', 'CHD', 'CCD', 'VOO']
    df = pd.DataFrame(np.random.random((1400, 7)), columns = cols)
    pca = PCA(n_components=2)
    pca.fit(df)
    print(pca.components_)
    print(pca.explained_variance_)
    
    # [[-0.38406974  0.02775874 -0.59754361 -0.55464116 -0.03878488
    #   -0.41944628 0.09795539]
    #  [-0.03181143 -0.52699813  0.14325425  0.02742668 -0.48571934 
    #   -0.33915335 0.590795  ]]
    # [0.0913989  0.08975106]
    

    You cannot plot the principal components, since they live in a 7-dimensional space. What you can do, as long as you keep the number of components less than three, is to plot the resulting dataset:

    df2 = pd.DataFrame(pca.transform(df), columns = ['first', 'second'])
    df2.plot.scatter(x = 'first', y = 'second')
    

    enter image description here

    As you can notice, I did not considered the column protein in doing PCA. The reason is that PCA works properly only with numerical column. See this discussion for some hints to handle categorical columns.