Search code examples
rdplyrtidyeval

Filter Dataframe with Tidy Evaluation


I am handling a large dataset. First, for certain columns (X1, X2, ...), I am trying to identify a range of value (a, b) consists of repeated value (a > n, b > n). Next, I wish to filter row based on the condition which matches respective columns to result given in the previous step.

Here is a reproducible example simulating the scenario I am facing,

library(tidyverse)

set.seed(1122)

vecs <- lapply(X = 1:2, function(x) rep(c(1, 2, 3), times = 10) %>% sample() %>% head(10))
names(vecs) <- paste0("col_", 1:2)
dat <- vecs %>% as.data.frame()
dat

   col_1 col_2
1      3     2
2      1     1
3      1     1
4      1     2
5      1     2
6      3     3
7      3     3
8      2     1
9      1     3
10     2     2

I am able to identify the range by the following method,

# Which col has repeated value more than 3 appearances?
more_than_3 <- function(df, var){
    var <- rlang::sym(var)

    df %>% 
        group_by(!!var) %>% 
        summarise(n = n()) %>% 
        filter(n > 3) %>% 
        pull(!!var) %>% 
        range()
}
cols_name <- c("col_1", "col_2")
some_range <- purrr::map(cols_name, more_than_3, df = dat)
names(some_range) <- cols_name
some_range

$col_1
[1] 1 1

$col_2
[1] 2 2

However, to filter out values that fall outside the upper limit, this is what I do.

dat %>% 
    filter(col_1 <= some_range[["col_1"]][2], 
           col_2 <= some_range[["col_2"]][2])

  col_1 col_2
1     1     1
2     1     1
3     1     2
4     1     2

I believe there must be a more efficient and elegant way of filtering the result based on tidy evaluation. Can someone point me to the right direction?

Many thanks in advance.


Solution

  • First let's try to create a small function that creates a single filter expression for one column. This function will take a symbol and then transform to string but it could be the other way around:

    new_my_filter_call_upper <- function(sym, range) {
      col_name <- as.character(sym)
    
      col_range <- range[[col_name]]
      if (is.null(col_range)) {
        stop(sprintf("Can't find column `%s` to compute range", col_name), call. = FALSE)
      }
    
      expr(!!sym < !!col_range[[2]])
    }
    

    Let's try it:

    new_my_filter_call_upper(quote(foobar), some_range)
    #> Error: Can't find column `foobar` to compute range
    
    # It works!
    new_my_filter_call_upper(quote(col_1), some_range)
    #> col_1 < 3
    

    Now we're ready to create a pipeline verbs that will take a data frame and bare column names.

    # Probably cleaner to pass range as argument. Prefix with dot to allow
    # columns named `range`.
    my_filter <- function(.data, ..., .range) {
      # ensyms() guarantees there won't be complex expressions
      syms <- rlang::ensyms(...)
    
      # Now let's map the function to create many calls:
      calls <- purrr::map(syms, new_my_filter_call_upper, range = .range)
    
      # And we're ready to filter with those expressions:
      dplyr::filter(.data, !!!calls)
    }
    

    Let's try it:

    dat %>% my_filter(col_1, col_2, .range = some_range)
    #>   col_1 col_2 NA.
    #> 1     2     1   1
    #> 2     2     2   1