Search code examples
rdplyrevalrlangdbplyr

How to build a wrapper function for querying database using dbplyr and dplyr, having the query vary


I'm trying to build a wrapper function for querying an SQL database using {dplyr} and {dbplyr}. It is always the same database, accessed via the same connection. The only thing that varies is the query.

Let's use an example based on code from Hadley's book here:

library(DBI)
library(dplyr, warn.conflicts = FALSE)

con <- DBI::dbConnect(RSQLite::SQLite(), filename = ":memory:")
mtcars_db <- dplyr::copy_to(con, mtcars)

mtcars_db %>%
  filter(cyl > 2) %>%
  select(mpg:hp) %>%
  head(10) %>%
  collect()
#> # A tibble: 10 x 4
#>      mpg   cyl  disp    hp
#>    <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110
#>  2  21       6  160    110
#>  3  22.8     4  108     93
#>  4  21.4     6  258    110
#>  5  18.7     8  360    175
#>  6  18.1     6  225    105
#>  7  14.3     8  360    245
#>  8  24.4     4  147.    62
#>  9  22.8     4  141.    95
#> 10  19.2     6  168.   123

Created on 2021-09-13 by the reprex package (v2.0.0)

Alternatively, we might have wanted a different query, such as to get the min() value for several columns (e.g., mpg, disp, and drat).

library(DBI)
library(dplyr, warn.conflicts = FALSE)

con <- DBI::dbConnect(RSQLite::SQLite(), filename = ":memory:")
mtcars_db <- dplyr::copy_to(con, mtcars)

mtcars_db %>%
  summarise(min_mpg = min(mpg), min_disp = min(disp), min_drat = min(drat)) %>%
  collect()
#> # A tibble: 1 x 3
#>   min_mpg min_disp min_drat
#>     <dbl>    <dbl>    <dbl>
#> 1    10.4     71.1     2.76

Created on 2021-09-13 by the reprex package (v2.0.0)


So given the structure above (mtcars_db -> "query" -> collect()) I want to build a wrapper function get_data_from_db() that could flexibly accept different queries.

My unsuccessful attempt

library(dplyr, warn.conflicts = FALSE)

get_data_from_db <- function(kind_of_query) {
  
  con <- DBI::dbConnect(RSQLite::SQLite(), filename = ":memory:")
  mtcars_db <- dplyr::copy_to(con, mtcars)
  
  if (kind_of_query == "from_hadley_book") {
    my_query <-
      rlang::expr(
        filter(cyl > 2) %>%
          select(mpg:hp) %>%
          head(10)
      )
  }
  
  if (kind_of_query == "mins_for_mpg_disp_drat") {
    my_query <- 
      rlang::expr(
      summarise(min_mpg = min(mpg), min_disp = min(disp), min_drat = min(drat))
      )
  }
  
  mtcars_db %>%
    eval(my_query) %>%
    collect()
}

get_data_from_db("from_hadley_book")
#> Error in eval(., my_query): invalid 'envir' argument of type 'language'
get_data_from_db("mins_for_mpg_disp_drat")
#> Error in eval(., my_query): invalid 'envir' argument of type 'language'

Created on 2021-09-13 by the reprex package (v2.0.0)


I just gave it a shot using rlang::expr() and then eval(), but this strategy might be incorrect in general for solving this problem. Would be happy to learn how to fix get_data_from_db() using any relevant approach.


EDIT


I'd like to ask about another scenario in the same context of this question.

Let's take get_data_from_db() and its argument kind_of_query. What if I wanted greater flexibility in what I pass to kind_of_query, such that I could pass the chain of dplyr verbs to the argument?

That is, instead of get_data_from_db("from_hadley_book") how could I do get_data_from_db(kind_of_query = filter(cyl > 2) %>% select(mpg:hp) %>% head(10))?

Basically it means that get_data_from_db() is just a wrapper that "sandwiches" mtcars_db and collect() around the query passed via kind_of_query argument.

So this "flexible" version of get_data_from_db() would look like:

get_data_from_db <- function(kind_of_query) {
  
  con <- DBI::dbConnect(RSQLite::SQLite(), filename = ":memory:")
  mtcars_db <- dplyr::copy_to(con, mtcars)
  
  mtcars_db %>%
    eval(kind_of_query) %>%
    collect()
}

## calling the function
get_data_from_db(kind_of_query = 
                   filter(cyl > 2) %>% 
                   select(mpg:hp) %>% 
                   head(10)
                 )

Any idea how to achieve this?


Solution

  • @Waldi hits on the crux of the problem, which is the pipe expects a function not an expression as the rhs. In the specific/choose from a list case, you control the expression building so this is manageable. You can use magrittr semantics and the dot placeholder to build from kind_of_query. This in turn can be used to create the complete expression (query) with rlang::quo and the !! operator.

    get_data_from_db <- function(kind_of_query) {
      
      con <- DBI::dbConnect(RSQLite::SQLite(), filename = ":memory:")
      on.exit(DBI::dbDisconnect(con))
      mtcars_db <- dplyr::copy_to(con, mtcars)
      
      if (kind_of_query == "from_hadley_book") {
        my_query <-
          rlang::expr(
            {
                filter(., cyl > 2) %>%
                select(mpg:hp) %>%
                head(10)
            }
          )
      }
      
      if (kind_of_query == "mins_for_mpg_disp_drat") {
        my_query <- 
          rlang::expr(
            {summarise(., min_mpg = min(mpg), min_disp = min(disp), min_drat = min(drat))}
          )
      }
      
      query <- quo(
        mtcars_db %>%  
          !!my_query %>%
          collect()
      )
      
      eval_tidy(query)
      
    }
    

    This is actually an overly sophisticated approach. If you're writing the expression for the kind_of_query, you might as well just simplify it by writing a function.

    get_data_from_db2 <- function(kind_of_query) {
      
      con <- DBI::dbConnect(RSQLite::SQLite(), filename = ":memory:")
      on.exit(DBI::dbDisconnect(con))
      mtcars_db <- dplyr::copy_to(con, mtcars)
      
      if (kind_of_query == "from_hadley_book") {
        my_fx <- function(x){
          x %>% 
            filter(cyl > 2) %>%
            select(mpg:hp) %>%
            head(10)
        }
      }
      
      if (kind_of_query == "mins_for_mpg_disp_drat") {
        my_fx <- function(x){
          summarise(x, min_mpg = min(mpg), min_disp = min(disp), min_drat = min(drat))
        }
      }
      
        mtcars_db %>%  
          my_fx %>%
          collect()
        
    }
    

    The problem comes with the general case. In the current proposed interface, you are trying to inject an argument value into a user-defined expression. The !! operator forces evaluation so when building the new expression, the user expression is inserted within () to force its evaluation before anything is passed from the lhs of the pipe. Manipulating the expression then likely requires deparse as suggested by @Waldi or some low level manipulation of the abstract syntax tree.

    The simpler solution, if possible, would be to have your users pass in a function, similar to purrr::map or lapply. This would drastically simplify the function implementation

    get_data_from_db_general <- function(kind_of_query) {
      
      con <- DBI::dbConnect(RSQLite::SQLite(), filename = ":memory:")
      on.exit(DBI::dbDisconnect(con))
      mtcars_db <- dplyr::copy_to(con, mtcars)
    
      mtcars_db %>%
        kind_of_query %>%
        collect()
    }
    
    get_data_from_db_general(
      kind_of_query = function(x){
        x %>%
          filter(cyl > 2) %>%
          select(mpg:hp) %>%
          head(10)
      }
    )
    
    # A tibble: 10 x 4
         mpg   cyl  disp    hp
       <dbl> <dbl> <dbl> <dbl>
     1  21       6  160    110
     2  21       6  160    110
     3  22.8     4  108     93
     4  21.4     6  258    110
     5  18.7     8  360    175
     6  18.1     6  225    105
     7  14.3     8  360    245
     8  24.4     4  147.    62
     9  22.8     4  141.    95
    10  19.2     6  168.   123