Search code examples
rdplyracross

Convert numerical variables into factors when the number of levels is lower than a given threshold with dplyr


I want to convert numerical variables into factors when the number of levels is lower than a given threshold with dplyr.

This would be most useful with binary variables coded as numerical '0/1'.

example data:

threshold<-5

data<-data.frame(binary1=rep(c(0,1), 5), binary_2=sample(c(0,1), 10, replace = TRUE), multilevel=sample(c(1:4), 10, replace=TRUE), numerical=1:10)

> data
   binary1 binary_2 multilevel numerical
1        0        1          2         1
2        1        0          3         2
3        0        1          2         3
4        1        0          1         4
5        0        1          2         5
6        1        1          4         6
7        0        1          1         7
8        1        1          3         8
9        0        1          1         9
10       1        0          4        10

sapply(data, class)
   binary1   binary_2 multilevel  numerical 
 "numeric"  "numeric"  "integer"  "integer" 

I could easily transform all variables into factors with mutate(), across() and where(), like this:

data<-data%>%mutate(across(where(is.numeric), as.factor))

> sapply(data, class)
   binary1   binary_2 multilevel  numerical 
  "factor"   "factor"   "factor"   "factor"

However, I cant find a way to mutate with multiple conditions, including my threshold argument, for the where() function. I wanted to have this output:

sapply(data, class)
   binary1   binary_2 multilevel  numerical 
 "factor"  "factor"  "factor"  "integer"

Tried the following, but failed:

data%>%mutate(across(where(is.numeric & length(unique(.x))<threshold), as.factor))

error message:

Error: Problem with `mutate()` input `..1`.
x object '.x' not found
ℹ Input `..1` is `across(where(!is.factor & length(unique(.x)) < threshold), as.factor)`.
Run `rlang::last_error()` to see where the error occurred.

Maybe I don't understand across() and where() well enough. Suggestions are welcomed.

Additional question: why including a negation operator (!) before is.factor gets me an error when the version without (!) is perfectly fine?

data<-data%>%mutate(across(where(!is.factor), as.factor))

Error: Problem with mutate() input ..1. x invalid argument type ℹ Input ..1 is across(where(!is.factor), as.factor). Run rlang::last_error() to see where the error occurred.


Solution

  • Use an anonymous or lambda function in where.

    library(dplyr)
    
    data <- data %>% 
         mutate(across(where(~is.numeric(.) && n_distinct(.) < threshold), factor))
    
    sapply(data, class)
    
    #   binary1   binary_2 multilevel  numerical 
    #  "factor"   "factor"   "factor"  "integer" 
    

    To answer your additional question, !is.factor is not a function like is.factor. Use the function in the same way as above.

    data %>% mutate(across(where(~!is.factor(.)), factor))