After performing a PCA analysis in R we can do:
ggbiplot(pca, choices=1:2, groups=factor(row.names(df_t)))
That will plot the data in the 2 PC space, and the direction and weight of the variables in such space as vectors (with different length and direction).
In Python I can plot the data in the 2 PC space, and I can get the weights of the variables, but how do I know the direction.
In other words, how could I plot the variable contribution to both PC (weight and direction) in Python?
I am not aware of any pre-made implementation of this kind of plot, but it can be created using matplotlib.pyplot.quiver
. Here's an example I quickly put together. You can use this as a basis to create a nice plot that works well for your data.
This generates some example data. It is reused from this answer.
# User input
n_samples = 100
n_features = 5
# Prep
data = np.empty((n_samples,n_features))
np.random.seed(42)
# Generate
for i,mu in enumerate(np.random.choice([0,1,2,3], n_samples, replace=True)):
data[i,:] = np.random.normal(loc=mu, scale=1.5, size=n_features)
pca = PCA().fit(data)
Here we go:
# Get the PCA components (loadings)
PCs = pca.components_
# Use quiver to generate the basic plot
fig = plt.figure(figsize=(5,5))
plt.quiver(np.zeros(PCs.shape[1]), np.zeros(PCs.shape[1]),
PCs[0,:], PCs[1,:],
angles='xy', scale_units='xy', scale=1)
# Add labels based on feature names (here just numbers)
feature_names = np.arange(PCs.shape[1])
for i,j,z in zip(PCs[1,:]+0.02, PCs[0,:]+0.02, feature_names):
plt.text(j, i, z, ha='center', va='center')
# Add unit circle
circle = plt.Circle((0,0), 1, facecolor='none', edgecolor='b')
plt.gca().add_artist(circle)
# Ensure correct aspect ratio and axis limits
plt.axis('equal')
plt.xlim([-1.0,1.0])
plt.ylim([-1.0,1.0])
# Label axes
plt.xlabel('PC 0')
plt.ylabel('PC 1')
# Done
plt.show()
I struggled a bit with the scaling of the arrows. Please make sure they correctly reflect the loadings for your data. A quick check of whether feature 4
really correlates strongly with PC 1
(as this example would suggest) looks promising:
data_pca = pca.transform(data)
plt.scatter(data_pca[:,1], data[:,4])
plt.xlabel('PC 2') and plt.ylabel('feature 4')
plt.show()