Search code examples
rdplyrruntimenaany

group_by slow when filtering any command


If I want to group_by and filter those with any NA or factor value in a dataset, I want to use any function within dplyr but finding it slow to run for NAs or factor (but not for finding any numeric value). Example data:

library(tidyverse)    
set.seed(10)
    df <- data.frame( group = rep((paste("g", seq(1, 50000, 1), sep = "" )), each =500, length.out = 2500000),
                      binary = rbinom(2500000, 1, 0.5),
                      narow = rep(letters[1:26], each = 2, length.out = 2500000))
    df <- df %>% 
      dplyr::mutate(narow = replace(narow, row_number() == 345 | row_number() == 77777, NA) )

    str(df)
        #'data.frame':  2500000 obs. of  3 variables:
        #$ group : Factor w/ 5000 levels "g1","g10","g100",..: 1 1 1 1 1 1 1 1 1 1 ...
        #$ binary: int  1 0 0 1 0 0 0 0 1 0 ...
        #$ narow : Factor w/ 26 levels "a","b","c","d",..: 1 1 2 2 3 3 4 4 5 5 ...

Now lets group_by and extract those groups with any binary==1:

system.time(
  dfnew <- df %>% 
    group_by(group) %>% 
    filter(any(binary == 1))
)
# user  system elapsed 
# 0.1     0.0     0.1

This runs quickly but when I do the same thing for finding any NAs it is very slow (I have a much bigger dataset):

system.time(
  dfnew <- df %>% 
    group_by(group) %>% 
    filter(any(is.na(narow)))
  )
# user  system elapsed 
# 5.25    8.49   13.75 

This seems extremely slow considering it is quick for the previous code which is very similar (1 vs 13.75s). Is this to be expected or am I doing something wrong? I would like to continue to use any function as I find it intuitive.

EDIT

It seems to go beyond just NAs. If I filter any factor variable I get a slow response too:

system.time(
   dfnew <- df %>% 
     group_by(group) %>% 
     filter(any(narow == "a"))
 )
   user  system elapsed 
   5.32    7.45   12.83 

Solution

  • As @NelsonGon mention, anyNA is the function to use in your case.

    The problem has already been mentioned here : https://stackoverflow.com/a/35713234/10580543

    For the binary exemple, any will be satisfy at the first occurence of binary == 1 while is.na will go though the entire vector, here of length 2500000.

    filter(anyNA(narow)) should be much faster than filter(any(is.na(narow))

    Edit : in practice the gain in time is very limited (4% faster) for factor.

    However, converting factor in character makes the filtering very fast (about 100 times faster). The explanation of the "why" here if you are interested : https://stackoverflow.com/a/34865113/10580543

    If you are not interested in ordering levels, the use of characters instead of factors for categorical variables is usually prefered, and way more efficient.