Search code examples
rpca

R equivalent to the SAS "BY" statement in PRINCOMP Procedure


I am using R princomp for PCA, however, I have a dataset with a factor variable, and I would like to run princomp on each factor.

This can be done in SAS with the "BY" statement that "performs BY group processing, which enables you to obtain separate analyses on grouped observations" (from https://support.sas.com/rnd/app/stat/procedures/princomp.html)

Can this be done by princomp in R or do I have to split my data into several datasets and run princomp on each?

All the best,


Solution

  • It is very simple in R once you understand a bit about how lists work. For that you should spend a bit of time with an R tutorial that includes a discussion of lists. Using a data set available on R:

    data(iris)
    str(iris)
    # 'data.frame': 150 obs. of  5 variables:
    #  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
    #  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
    #  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
    #  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
    #  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
    

    First split the data frame into 3 separate data frames, one for each Species and store them in a list. We'll leave out the Species label since it will not be used in the principal components and then run the analysis on each group:

    iris.spl <- split(iris[, 1:4], iris$Species)
    iris.spl.pca <- lapply(iris.spl, prcomp, scale.=TRUE)
    

    To preserve Species in each data frame in the list, you would use the following code:

    iris.spl <- split(iris, iris$Species)
    iris.spl.pca <- lapply(iris.spl, function(x) prcomp(x[, 1:4], scale.=TRUE))
    

    To get the basic results:

    iris.spl.pca
    

    To get a particular result use:

    iris.spl.pca[[1]] # or iris.spl.pca[["setosa"]]
    

    I used prcomp based on the advice given in the Details section of the manual page for princomp. Using scale.=TRUE analyzes the correlation matrix, removing it would analyze the covariance matrix.