Project variables in PCA plot in Python

After performing a PCA analysis in R we can do:

ggbiplot(pca, choices=1:2, groups=factor(row.names(df_t)))

That will plot the data in the 2 PC space, and the direction and weight of the variables in such space as vectors (with different length and direction).

In Python I can plot the data in the 2 PC space, and I can get the weights of the variables, but how do I know the direction.

In other words, how could I plot the variable contribution to both PC (weight and direction) in Python?

Solution

I am not aware of any pre-made implementation of this kind of plot, but it can be created using matplotlib.pyplot.quiver. Here's an example I quickly put together. You can use this as a basis to create a nice plot that works well for your data.

Example Data

This generates some example data. It is reused from this answer.

# User input
n_samples  = 100
n_features =   5

# Prep
data  = np.empty((n_samples,n_features))
np.random.seed(42)

# Generate
for i,mu in enumerate(np.random.choice([0,1,2,3], n_samples, replace=True)):
    data[i,:] = np.random.normal(loc=mu, scale=1.5, size=n_features)

PCA

pca = PCA().fit(data)

Variables Factor Map

Here we go:

# Get the PCA components (loadings)
PCs = pca.components_

# Use quiver to generate the basic plot
fig = plt.figure(figsize=(5,5))
plt.quiver(np.zeros(PCs.shape[1]), np.zeros(PCs.shape[1]),
           PCs[0,:], PCs[1,:], 
           angles='xy', scale_units='xy', scale=1)

# Add labels based on feature names (here just numbers)
feature_names = np.arange(PCs.shape[1])
for i,j,z in zip(PCs[1,:]+0.02, PCs[0,:]+0.02, feature_names):
    plt.text(j, i, z, ha='center', va='center')

# Add unit circle
circle = plt.Circle((0,0), 1, facecolor='none', edgecolor='b')
plt.gca().add_artist(circle)

# Ensure correct aspect ratio and axis limits
plt.axis('equal')
plt.xlim([-1.0,1.0])
plt.ylim([-1.0,1.0])

# Label axes
plt.xlabel('PC 0')
plt.ylabel('PC 1')

# Done
plt.show()

Being Uncertain

I struggled a bit with the scaling of the arrows. Please make sure they correctly reflect the loadings for your data. A quick check of whether feature 4 really correlates strongly with PC 1 (as this example would suggest) looks promising:

data_pca = pca.transform(data)
plt.scatter(data_pca[:,1], data[:,4])
plt.xlabel('PC 2') and plt.ylabel('feature 4')
plt.show()