Search code examples
rfunctiondataframedplyrlazy-evaluation

Execute dplyr operation only if column exists


Drawing on the discussion on conditional dplyr evaluation I would like conditionally execute a step in pipeline depending on whether the reference column exists in the passed data frame.

Example

The results generated by 1) and 2) should be identical.

Existing column

# 1)
mtcars %>% 
  filter(am == 1) %>%
  filter(cyl == 4)

# 2)
mtcars %>%
  filter(am == 1) %>%
  {
    if("cyl" %in% names(.)) filter(cyl == 4) else .
  }

Unavailable column

# 1)
mtcars %>% 
  filter(am == 1)

# 2)    
mtcars %>%
  filter(am == 1) %>%
  {
    if("absent_column" %in% names(.)) filter(absent_column == 4) else .
  }

Problem

For the available column the passed object does not correspond to the initial data frame. The original code returns the error message:

Error in filter(cyl == 4) : object 'cyl' not found

I have tried alternative syntax (with no luck):

>> mtcars %>%
...   filter(am == 1) %>%
...   {
...     if("cyl" %in% names(.)) filter(.$cyl == 4) else .
...   }
 Show Traceback

 Rerun with Debug
 Error in UseMethod("filter_") : 
  no applicable method for 'filter_' applied to an object of class "logical" 

Follow-up

I wanted to expand this question that would account for the evaluation on the right-hand side of the == in filter call. For instance the syntax below attempts to filter on the first available value. mtcars %>%

filter({
    if ("does_not_ex" %in% names(.))
      does_not_ex
    else
      NULL
  } == {
    if ("does_not_ex" %in% names(.))
      unique(.[['does_not_ex']])
    else
      NULL
  })

Expectedly, the call evaluates to an error message:

Error in filter_impl(.data, quo) : Result must have length 32, not 0

When applied to existing column:

mtcars %>%
  filter({
    if ("mpg" %in% names(.))
      mpg
    else
      NULL
  } == {
    if ("mpg" %in% names(.))
      unique(.[['mpg']])
    else
      NULL
  })

It works with a warning message:

  mpg cyl disp  hp drat   wt  qsec vs am gear carb
1  21   6  160 110  3.9 2.62 16.46  0  1    4    4

Warning message: In { : longer object length is not a multiple of shorter object length

Follow-up question

Is there a neat way of expending the existing syntax in order to get conditional evaluation on the right-hand side of the filter call, ideally staying within dplyr workflow?


Solution

  • For dplyr version > 1.0.4

    With if_any in dplyr > 1.0.4 you can achieve this:

    mtcars %>% 
      select(!cyl) %>% 
      filter(am == 1) %>% 
      filter(if_any(matches("cyl"),  \(cl) cl == 4))
    

    Or if you're using R < 4.1 you can use the old purrr style anonymous function ~.x ==4:

    mtcars %>% 
      select(!cyl) %>% 
      filter(am == 1) %>% 
      filter(if_any(matches("cyl"),  ~.x == 4))
    

    See the tidyverse blog for more details.

    Old answer, for dplyr 1.0.0-1.0.7

    This gives a deprecation warning in dplyr > 1.0.7

    With across() in dplyr > 1.0.0 you can now use any_of when filtering. Compare original with all columns:

    mtcars %>% 
      filter(am == 1) %>% 
      filter(cyl == 4)
    

    With cyl removed, it throws an error:

    mtcars %>% 
      select(!cyl) %>% 
      filter(am == 1) %>% 
      filter(cyl == 4)
    

    Using any_of (note you have to write "cyl" and not cyl):

    mtcars %>% 
      select(!cyl) %>% 
      filter(am == 1) %>% 
      filter(across(any_of("cyl"), ~.x == 4))
    #N.B. this is equivalent to just filtering by `am == 1`.