Search code examples
rdplyrpca

Remove columns of equal variance after dplyr::group_by and before prcomp


I am performing pcas for multiple groups in a large data set using dplyr. When the data is split into groups using the group_by function, some variables have equal variance, so the pca can't run. How can I remove any columns of equal variance and then perform prcomp on what remains? Dummy data below. Thank you.

Add equal variance for setosa - Sepal.Length.

library(dplyr)

iris[1:50,1]<-0

Attempt to run pcas with equal variance

> iris%>%
+   group_by(Species)%>%
+   group_map(~prcomp(.[,1:4], scale.=T))
Error in prcomp.default(.[, 1:4], scale. = T) : 
  cannot rescale a constant/zero column to unit variance

Check for equal variance

> iris%>%
+   group_by(Species)%>%
+   group_map(~names(.[,1:4][, sapply(.[,1:4], function(v) var(v, na.rm=TRUE)==0)]))
[[1]]
[1] "Sepal.Length"

[[2]]
character(0)

[[3]]
character(0)

Attempt to exclude equal variance column and run pcas

> iris%>%
+   group_by(Species)%>%
+   group_map(~sapply(.[,1:4], function(v) var(v, na.rm=TRUE)>0))%>%
+   group_map(~prcomp(.[,1:4], scale.=T))
Error in UseMethod("group_split") : 
  no applicable method for 'group_split' applied to an object of class "list"

Solution

  • We can use map_if which checks for a condition and then applies the function.

    library(tidyverse)
    
    iris %>%
       group_split(Species, keep = FALSE) %>%
       map_if(~all(map_dbl(.x, var) != 0), ~prcomp(.x, scale. = TRUE),
              .else = function(x) return(NULL))
    
    #[[1]]
    #NULL
    
    #[[2]]
    #Standard deviations (1, .., p=4):
    #[1] 1.7106550 0.7391040 0.6284883 0.3638504
    
    #Rotation (n x k) = (4 x 4):
    #                    PC1        PC2        PC3        PC4
    #Sepal.Length -0.4823284 -0.6107980  0.4906296  0.3918772
    #Sepal.Width  -0.4648460  0.6727830  0.5399025 -0.1994658
    #Petal.Length -0.5345136 -0.3068495 -0.3402185 -0.7102042
    #Petal.Width  -0.5153375  0.2830765 -0.5933290  0.5497778
    
    #[[3]]
    #Standard deviations (1, .., p=4):
    #[1] 1.5667601 0.9821979 0.6725116 0.3581596
    
    #Rotation (n x k) = (4 x 4):
    #                   PC1        PC2         PC3         PC4
    #Sepal.Length 0.5544765 -0.4324382  0.01239569  0.71091442
    #Sepal.Width  0.4755317  0.4401787  0.75272551 -0.11626101
    #Petal.Length 0.5501112 -0.4296642 -0.20236407 -0.68688796
    #Petal.Width  0.4047258  0.6592637 -0.62633812  0.09627561
    

    If we want to remove only the columns which have 0 variance and not the entire group, we can use select_if to select columns

    iris %>%
      group_split(Species, keep = FALSE) %>%
      map(~.x %>% select_if(~var(.) != 0) %>% prcomp(scale. = TRUE))