Search code examples
rdplyrtidyversepurrrforcats

Drop variables with one factor level excluding NAs


I need to drop the factor variables with one level (excluding NAs) in the nested dataset. The function below 'drop_fixed_factors' considers NA as one level while evaluating the number of factor levels. How can I fix that so that for A==Y, B has one level (A), not two (A, NA)?

 df <- tibble::tribble(
  ~A,  ~B,
  "X", "A",
  "X", "B",
  "Y", "A",
  "Y", NA_character_,
  "Z", "A",
  "Z", "B",
  "Z", NA_character_,
  "K", "A",
  "K", "A",
  "L", NA_character_,
  "L", NA_character_,
  )

df$B <- as.factor(df$B)

dfgrp <- df %>% 
  group_by(A) %>% 
  nest() 

drop_fixed_factors <- function(x) {
  x %>% discard(~is.factor(.x) & length(unique(.x))<2)
}

dfgrp1 <- dfgrp %>% 
  mutate(data_1 = map(data, ~drop_fixed_factors(.x)))

dfgrp1

dfgrp1$data_1[[2]]

The desired output should not have variable B for the group A == "Y".


Solution

  • You could manually remove the NA values within unique:

    drop_fixed_factors <- function(x) {
         x %>% discard(~is.factor(.x) & length(unique(na.omit(.x)))<2)
    }
    

    Alternatively you could use dplyr::n_distinct and use the na.rm argument:

    drop_fixed_factors <- function(x) {
         x %>% discard(~is.factor(.x) & n_distinct(.x, na.rm = TRUE)<2)
    }
    

    Both options return nothing for group "Y".

    dfgrp1
    # A tibble: 5 x 3
      A     data             data_1          
      <chr> <list>           <list>          
    1 X     <tibble [2 x 1]> <tibble [2 x 1]>
    2 Y     <tibble [2 x 1]> <tibble [2 x 0]>
    3 Z     <tibble [3 x 1]> <tibble [3 x 1]>
    4 K     <tibble [2 x 1]> <tibble [2 x 0]>
    5 L     <tibble [2 x 1]> <tibble [2 x 0]>