Search code examples
rfiltertidyverse

Filtering a dataframe for specific combinations of variables using a named list


I want to filter a nested dataframe for specific combinations of variables using a named list and I can't manage to exclude some unwanted combinations. Here's an example :

library(tidyverse)

# Create fake data
set.seed(1234)
data <- tibble(
    c1 = rep(letters[1:3], each = 10),
    c2 = sample(letters[4:6], size = 30, replace = T),
    var1 = rnorm(30),
    var2 = rnorm(30)
)
nested_data <- data %>% 
    nest(.by = c(c1, c2))

# Create list of the specific combinations I want
criteria <- list(a = c("d", "e"), b = "d")

I tried to use the functions names() and unique() to do it but the result does not exclude unwanted combinations with overlapping criterias.

# Filter for the specific combinations
c1_criteria <- names(criteria)
c2_criteria <- unique(unlist(criteria))
nested_data %>% 
    filter(c1 %in% c1_criteria,
           c2 %in% c2_criteria) %>% 
    unnest(data)

Here's the output

# A tibble: 4 × 3
  c1    c2    data            
  <chr> <chr> <list>          
1 a     e     <tibble [5 × 2]>
2 a     d     <tibble [3 × 2]>
3 b     e     <tibble [6 × 2]>
4 b     d     <tibble [1 × 2]>

I intended to have only the following combinations :
c1 == "a" & c2 == "d", c1 == "a" & c2 == "e", c1 == "b" & c2 == "d"

However, the output also includes the combination, c1 == "b" & c2 == "e". So the intended output is the following :

# A tibble: 3 × 3
  c1    c2    data            
  <chr> <chr> <list>          
1 a     e     <tibble [5 × 2]>
2 a     d     <tibble [3 × 2]>
3 b     d     <tibble [1 × 2]>

I think there might be a way to generate a list of specific logical conditions from the named list criterias and supply it as argument to the filter function but I'm not sure how to do it.


Solution

  • We could turn that list into a 2-column filtering table that holds desired combinations of c1 & c2, then it can be used with semi_join():

    library(tidyverse)
    
    set.seed(1234)
    data <- tibble(
      c1 = rep(letters[1:3], each = 10),
      c2 = sample(letters[4:6], size = 30, replace = T),
      var1 = rnorm(30),
      var2 = rnorm(30)
    )
    nested_data <- data %>% 
      nest(.by = c(c1, c2))
    
    criteria <- list(a = c("d", "e"), b = "d")
    
    enframe(criteria, "c1", "c2") %>% 
      unnest(c2) %>% 
      semi_join(nested_data, .)
    #> Joining with `by = join_by(c1, c2)`
    #> # A tibble: 3 × 3
    #>   c1    c2    data            
    #>   <chr> <chr> <list>          
    #> 1 a     e     <tibble [5 × 2]>
    #> 2 a     d     <tibble [3 × 2]>
    #> 3 b     d     <tibble [1 × 2]>
    

    List transformation steps:

    enframe(criteria, name = "c1", value = "c2")
    #> # A tibble: 2 × 2
    #>   c1    c2       
    #>   <chr> <list>   
    #> 1 a     <chr [2]>
    #> 2 b     <chr [1]>
    
    enframe(criteria, "c1", "c2") %>% unnest(c2)
    #> # A tibble: 3 × 2
    #>   c1    c2   
    #>   <chr> <chr>
    #> 1 a     d    
    #> 2 a     e    
    #> 3 b     d
    

    Created on 2023-10-02 with reprex v2.0.2