Search code examples
juliapca

Exploratory PCA in Julia


I try to understand how to perform a simple, exploratory PCA in Julia using the package MultivariateStats.jl.

For instance, in R, one may do the following:

library(FactoMineR)
data(iris)

## PCA model:
res_pca <- PCA(iris, quali.sup = 5, graph = FALSE)
## Retrieve coordinates of individuals on all 4 PCs:
res_pca$ind$coord
## Simple plots:
plot(res_pca, choix = "ind", habillage = 5)
plot(res_pca, choix = "var")

I don't manage to get any equivalent of these very basic operations in Julia (using "native" functions from MultivariateStats.jl, I mean). Let's begin with:

using MultivariateStats
using RDatasets

iris = dataset("datasets", "iris")

## PCA model:
M = fit(PCA, Array(iris[:, 1:4]); pratio=1, maxoutdim=4)

Once I get the PCA "model" M, how can I retrieve easily the individuals coordinates? How can I display easily the correlation circle, etc.?

As I understand it, PCA is implemented in Julia as a "model" from a quasi machine learning perspective, and offers basically no way to display easily all usual graphs, statistics and insights one usually need in exploratory PCA? (I mean, cos², contributions, etc.)

Is there any other Julia package more oriented towards exploratory multivariate analysis which could offer a rough equivalent of famous R packages such as FactoMineR or ade4?

Thanks!


Solution

  • Ok, so firstly you're putting the data in with the wrong orientation. As you can see from the docs

    fit(PCA, X; ...)
    Perform PCA over the data given in a matrix X. Each column of X is an observation.

    you need observations as columns and variables as rows. This may seem confusing given that we normally think of variables as columns, but it makes a bit more sense in the context of the underlying linear algebra. So to get that right, let's start with:

    using MultivariateStats, Statistics
    using RDatasets: dataset
    
    iris = dataset("datasets", "iris")
    iris_matrix = Array(iris[:, 1:4])'
    
    ## PCA model:
    M = fit(PCA, iris_matrix; pratio=1, maxoutdim=4)
    

    As you've noticed, this returns an object of type PCA. As it turns out, the available methods for this type are well described in the documentation:

    Properties

    Let M be an instance of PCA, d be the dimension of observations, and p be the output dimension (i.e the dimension of the principal subspace)

    indim(M)
    Get the input dimension d, i.e the dimension of the observation space.

    outdim(M)
    Get the output dimension p, i.e the dimension of the principal subspace.

    mean(M)
    Get the mean vector (of length d).

    projection(M)
    Get the projection matrix (of size (d, p)). Each column of the projection matrix corresponds to a principal component.

    The principal components are arranged in descending order of the corresponding variances.

    principalvars(M)
    The variances of principal components.

    tprincipalvar(M)
    The total variance of principal components, which is equal to sum(principalvars(M)).

    tresidualvar(M)
    The total residual variance.

    tvar(M)
    The total observation variance, which is equal to tprincipalvar(M) + tresidualvar(M).

    principalratio(M)
    The ratio of variance preserved in the principal subspace, which is equal to tprincipalvar(M) / tvar(M).

    but if you didn't know that or had trouble finding the docs, you could get a full list of all the available methods (as for any type in Julia) by using the methodswith function -- in this case specifically methodswith(PCA).

    Key among these is projection, which gives us the actual projection matrix:

    julia> proj = projection(M)
    4×4 Matrix{Float64}:
     -0.361387    0.656589  -0.58203     0.315487
      0.0845225   0.730161   0.597911   -0.319723
     -0.856671   -0.173373   0.0762361  -0.479839
     -0.358289   -0.075481   0.545831    0.753657
    

    As the documentation told us

    each column of the projection matrix corresponds to a principal component

    so proj[:,1] contains the weights / "loadings" for PC1, proj[:,2] contains the loadings for PC2, etc.

    Since you asked about contributions, by which I am guessing you mean the contributions of each principal component towards explaining the total variance, the docs tell us you can get that with principalvars:

    julia> principalvars(M)
    4-element Vector{Float64}:
     4.2282417060348605
     0.24267074792863352
     0.07820950004291898
     0.023835092973449976
    

    so PC1 is really doing most of the heavy lifting here. Or if you prefer that in terms of percent of the total variance explained by each component, then:

    julia> principalvars(M) ./ tvar(M) * 100
    4-element Vector{Float64}:
     92.46187232017269
      5.306648311706788
      1.7102609807929683
      0.5212183873275495
    

    Looking ahead to the "transformation" section of the same docs

    Transformation and Construction

    Given a PCA model M, one can use it to transform observations into principal components, as

    𝐲 = 𝐏ᵀ (𝐱 - 𝛍)

    or use it to reconstruct (approximately) the observations from principal components, as

    𝐱̃ = 𝐏 𝐲 + 𝛍

    Here, 𝐏 is the projection matrix.

    The package provides methods to do so:

    transform(M, x)
    Transform observations x into principal components.

    Here, x can be either a vector of length d or a matrix where each column is an observation.

    reconstruct(M, y)
    Approximately reconstruct observations from the principal components given in y.

    Here, y can be either a vector of length p or a matrix where each column gives the principal components for an observation.

    we can see that the way to apply this transformation to our data is

    iris_transformed = transform(M, iris_matrix)
    

    or if you want to do the linear algebra yourself

    iris_transformed = projection(M)' * (iris_matrix .- mean(M))
    

    Then once we have the transformed data, we could, say, plot the first two principal components against each other with

    using Plots
    h = plot(iris_transformed[1,:], iris_transformed[2,:], seriestype=:scatter, label="")
    plot!(xlabel="PC1", ylabel="PC2", framestyle=:box) # A few formatting options
    

    PCA of the Iris dataset, plotting PC1 vs PC2

    Finally, if you want to add arrows for the different variables, it turns out we already have everything we need in projection(M), which we have already stored as proj

    for i=1:4; plot!([0,proj[i,1]], [0,proj[i,2]], arrow=true, label=names(iris)[i], legend=:bottomleft); end
    display(h)
    

    PCA plot showing vectors for each variable

    Edit: I should probably also mention that while there isn't (yet) a StatsPlots recipe for the PCA type, there is one for MDS, so you can in this case obtain an equivalent plot as above by writing simply

    M = fit(MDS, iris_matrix; distances=false)
    using StatsPlots
    plot(M)
    

    (PCA and MDS being equivalent for the case as here where the distance metric is euclidean distance in the original (i.e., high-dimensional) vector space)