Say there's a list of dataframes that is generated following a lapply
call on some vector and the aim is to append the list of dataframes to one another to form a single dataframe. In R, one approach is do.call(rbind, data.ls)
, where data.ls
is a list object composed of a list of dataframes. This approach also works with dplyr
's bind_rows
in the exact same way but it does NOT work with the SQL's eponymous union_all
when one is binding more than two dataframes. In most cases that's fine but what does one do when working in a database enviornment and one's data is too heavy to collect
before the end of the lapply
call (and there are more than two dataframes to append)? This phenomena is not unique to dbplyr
either. The authors of the sparklyr
package recognized the limitations of union_all
and implemented a sparklyr
verb that accomplished what rbind
/bind_rows
does in a local environment -sdf_bind_rows
. I'm curious if there's an analogous dplyr
/dbplyr
verb to accomplish or some other approach that can accomplish this that still relies on the results of the lapply
call -i.e. the list of dataframes.
From sparklyr
documentation:
Description
sdf_bind_rows() and sdf_bind_cols() are implementation of the common pattern of do.call(rbind, sdfs) or do.call(cbind, sdfs) for binding many Spark DataFrames into one.
Perhaps something like the following for rbind
:
dbplyr_union_all <- function(...){
tbl_list <- list(...)
db_con <- list[[1]]$src$con # extract database connection
# construct query that defines unioned table
query_text = sapply(tbl_list, dbplyr::sql_render)
sql_query = paste0(query_text, collapse = "\nUNION ALL\n")
sql_query <- dbplyr::build_sql(sql_query, con = db_con)
return(dplyr::tbl(db_con, dbplyr::sql(sql_query)))
}