Search code examples
rsqlitedbplyr

Filter data in one SQLite database table by ids from another table in the same database


I have a local SQLite database with two tables. I want to make a query from one table filtering rows by ids from another table in the same database using dbplyr.

As an example, here is what I would do without a database involved:

# install.packages("tidyverse")
library(tidyverse)

data <-
  tibble(id = seq(1, 8),
         data = LETTERS[seq(1, 8)])

condition <- 
  tibble(id = seq(1, 8),
         group = c(rep("low", 4), rep("high", 4)))

data %>% 
  filter(id %in% pull(condition %>% 
                        filter(group == "high") %>% 
                        select(id)))

which gives me

# A tibble: 4 × 2
     id data 
  <int> <chr>
1     5 E    
2     6 F    
3     7 G    
4     8 H

Using the same code with tables in the database instead of tibbles in the memory

# install.packages("RSQLite")
library(RSQLite)

test_db <- 
  dbConnect(RSQLite::SQLite(), "test_db.sqlite")

dbWriteTable(test_db, "data_tbl", data)
dbWriteTable(test_db, "condition_tbl", condition)

dbListTables(test_db)

data_db <- 
  tbl(test_db, "data_tbl")

condition_db <- 
  tbl(test_db, "condition_tbl")      

data_db %>% 
  filter(id %in% 
           pull(condition_db %>% 
                  filter(group == "high") %>% 
                  select(id)))

produces the following error:

Error in UseMethod("escape") : 
 not applicable method for 'escape' applied on object of class "c('tbl_SQLiteConnection', 'tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')"

When I load the ids in the memory first the query works without a problem.

detour <- 
  pull(condition_db %>% 
         filter(group == "high") %>% 
         select(id) %>% 
         collect())

data_db %>% 
  filter(id %in% detour)

# A tibble: 4 × 2
     id data 
  <int> <chr>
1     5 E    
2     6 F    
3     7 G    
4     8 H

Since pull() seems to be the problem here - how can I make such queries without the "detour" provided above?


Solution

  • I use a semi-join for this problem when working with dbplyr. A semi-join between two tables returns every record from the first table where there is at least one match with a record in the second table. (An anti-join is similar, returning where there are no matches in the second table.)

    This would look like:

    prepared_condition = condition %>%
      filter(group == "high")
    
    output = data %>%
      semi_join(prepared_condition, by = "id")