Search code examples
pythonrpca

Project variables in PCA plot in Python


After performing a PCA analysis in R we can do:

ggbiplot(pca, choices=1:2, groups=factor(row.names(df_t)))

That will plot the data in the 2 PC space, and the direction and weight of the variables in such space as vectors (with different length and direction).

In Python I can plot the data in the 2 PC space, and I can get the weights of the variables, but how do I know the direction.

In other words, how could I plot the variable contribution to both PC (weight and direction) in Python?


Solution

  • I am not aware of any pre-made implementation of this kind of plot, but it can be created using matplotlib.pyplot.quiver. Here's an example I quickly put together. You can use this as a basis to create a nice plot that works well for your data.


    Example Data

    This generates some example data. It is reused from this answer.

    # User input
    n_samples  = 100
    n_features =   5
    
    # Prep
    data  = np.empty((n_samples,n_features))
    np.random.seed(42)
    
    # Generate
    for i,mu in enumerate(np.random.choice([0,1,2,3], n_samples, replace=True)):
        data[i,:] = np.random.normal(loc=mu, scale=1.5, size=n_features)
    

    PCA

    pca = PCA().fit(data)
    

    Variables Factor Map

    Here we go:

    # Get the PCA components (loadings)
    PCs = pca.components_
    
    # Use quiver to generate the basic plot
    fig = plt.figure(figsize=(5,5))
    plt.quiver(np.zeros(PCs.shape[1]), np.zeros(PCs.shape[1]),
               PCs[0,:], PCs[1,:], 
               angles='xy', scale_units='xy', scale=1)
    
    # Add labels based on feature names (here just numbers)
    feature_names = np.arange(PCs.shape[1])
    for i,j,z in zip(PCs[1,:]+0.02, PCs[0,:]+0.02, feature_names):
        plt.text(j, i, z, ha='center', va='center')
    
    # Add unit circle
    circle = plt.Circle((0,0), 1, facecolor='none', edgecolor='b')
    plt.gca().add_artist(circle)
    
    # Ensure correct aspect ratio and axis limits
    plt.axis('equal')
    plt.xlim([-1.0,1.0])
    plt.ylim([-1.0,1.0])
    
    # Label axes
    plt.xlabel('PC 0')
    plt.ylabel('PC 1')
    
    # Done
    plt.show()
    

    enter image description here


    Being Uncertain

    I struggled a bit with the scaling of the arrows. Please make sure they correctly reflect the loadings for your data. A quick check of whether feature 4 really correlates strongly with PC 1 (as this example would suggest) looks promising:

    data_pca = pca.transform(data)
    plt.scatter(data_pca[:,1], data[:,4])
    plt.xlabel('PC 2') and plt.ylabel('feature 4')
    plt.show()
    

    enter image description here