I have a large data frame (~1400 rows) with the following columns:
protein IHD CM ARR VD CHD CCD VOO
0 q9uku9 0.000000 0.039457 0.032901 0.014793 0.006614 0.006591 0.000000
1 o75461 0.000000 0.005832 0.027698 0.000000 0.000000 0.006634 0.000000
etc.
I want to perform a PCA analysis and plot with the vectors, but I'm not sure how to do so with such a large data set. Does anyone have any suggestions?
Actually a 1400 x 8 dataframe is not that big on modern computers. You can use scikit-learn to perform PCA on your dataset. It is relatively simple:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
cols = ['IHD', 'CM', 'ARR', 'VD', 'CHD', 'CCD', 'VOO']
df = pd.DataFrame(np.random.random((1400, 7)), columns = cols)
pca = PCA(n_components=2)
pca.fit(df)
print(pca.components_)
print(pca.explained_variance_)
# [[-0.38406974 0.02775874 -0.59754361 -0.55464116 -0.03878488
# -0.41944628 0.09795539]
# [-0.03181143 -0.52699813 0.14325425 0.02742668 -0.48571934
# -0.33915335 0.590795 ]]
# [0.0913989 0.08975106]
You cannot plot the principal components, since they live in a 7-dimensional space. What you can do, as long as you keep the number of components less than three, is to plot the resulting dataset:
df2 = pd.DataFrame(pca.transform(df), columns = ['first', 'second'])
df2.plot.scatter(x = 'first', y = 'second')
As you can notice, I did not considered the column protein in doing PCA. The reason is that PCA works properly only with numerical column. See this discussion for some hints to handle categorical columns.