Search code examples
rdplyr

Filtering by multiple columns at once in `dplyr`


Here is some sample data

library(tidyverse)

data <- matrix(runif(20), ncol = 4) 
colnames(data) <- c("mt100", "cp001", "cp002", "cp003")
data <- as_tibble(data)

The real data set has many more columns but it stands that there are many columns that all start with "cp". In dplyr I can select all these columns

data %>%
  select(starts_with("cp"))

Is there a way in which I can use the starts_with (or similar function) to filter by multiple columns without having to explicitly write them all? I'm thinking something like this

data %>%
  filter(starts_with("cp") > 0.2)

Solution

  • We could use if_all or if_any as Anil is pointing in his comments: For your code this would be:

    https://www.tidyverse.org/blog/2021/02/dplyr-1-0-4-if-any/

    if_any() and if_all()

    "across() is very useful within summarise() and mutate(), but it’s hard to use it with filter() because it is not clear how the results would be combined into one logical vector. So to fill the gap, we’re introducing two new functions if_all() and if_any()."

    if_all

    data %>% 
      filter(if_all(starts_with("cp"), ~ . > 0.2))
    
      mt100 cp001 cp002 cp003
      <dbl> <dbl> <dbl> <dbl>
    1 0.688 0.402 0.467 0.646
    2 0.663 0.757 0.728 0.335
    3 0.472 0.533 0.717 0.638
    

    if_any:

    data %>% 
      filter(if_any(starts_with("cp"), ~ . > 0.2))
    
      mt100 cp001   cp002 cp003
      <dbl> <dbl>   <dbl> <dbl>
    1 0.554 0.970 0.874   0.187
    2 0.688 0.402 0.467   0.646
    3 0.658 0.850 0.00813 0.542
    4 0.663 0.757 0.728   0.335
    5 0.472 0.533 0.717   0.638