Search code examples
rdatabasedplyrlapplydbplyr

Is there a way to append tables within a do.call() when using dplyr verbs within a database?


Say there's a list of dataframes that is generated following a lapply call on some vector and the aim is to append the list of dataframes to one another to form a single dataframe. In R, one approach is do.call(rbind, data.ls), where data.ls is a list object composed of a list of dataframes. This approach also works with dplyr's bind_rows in the exact same way but it does NOT work with the SQL's eponymous union_all when one is binding more than two dataframes. In most cases that's fine but what does one do when working in a database enviornment and one's data is too heavy to collect before the end of the lapply call (and there are more than two dataframes to append)? This phenomena is not unique to dbplyr either. The authors of the sparklyr package recognized the limitations of union_all and implemented a sparklyr verb that accomplished what rbind/bind_rows does in a local environment -sdf_bind_rows. I'm curious if there's an analogous dplyr/dbplyr verb to accomplish or some other approach that can accomplish this that still relies on the results of the lapply call -i.e. the list of dataframes.

From sparklyr documentation:

Description

sdf_bind_rows() and sdf_bind_cols() are implementation of the common pattern of do.call(rbind, sdfs) or do.call(cbind, sdfs) for binding many Spark DataFrames into one.


Solution

  • Perhaps something like the following for rbind:

    dbplyr_union_all <- function(...){
      tbl_list <- list(...)
      db_con <- list[[1]]$src$con # extract database connection
    
      # construct query that defines unioned table
      query_text = sapply(tbl_list, dbplyr::sql_render)
      sql_query = paste0(query_text, collapse = "\nUNION ALL\n")
    
      sql_query <- dbplyr::build_sql(sql_query, con = db_con)
      return(dplyr::tbl(db_con, dbplyr::sql(sql_query)))
    }