Search code examples
rduckdb

arrow, duckdb and dbplyr --- memory error after creating several tbl_duckdb_connections


I am working with large datasets using {arrow, duckdb, dbplyr}.

I am experiencing some issues after opening a bunch of tbl_duckdb_connections. I do this as I need to manipulate and join them until I have only one tbl_duckdb_connection.

When I finally try to pull some data from the final duckdb object (like a performing a slice and uploading it, or even just counting the number of rows), it takes forever and I get an "Out of Memory" error, which also says that "Database is launched in in-memory mode and no temporary directory is specified".

What am I doing wrong?

My code looks something like this, but there are more joins involved:

db1 <- arrow::open_dataset(source = paste0(path, "data1.csv"), format = "csv") %>%
   arrow::to_duckdb()

db2 <- arrow::open_dataset(source = paste0(path, "data2.csv"), format = "csv") %>%
   arrow::to_duckdb()

db3 <- inner_join(bd1, bd2, by = "id") %>%
   collect()

This is strange because even counting the number of rows in db3 using the following code takes a long time and gives the same error:

db3 %>%
  ungroup() %>%
  summarise(num = n())

Solution

  • The arrow::to_duckdb() calls the private arrow_duck_connection() function that basically calls duckdb::duckdb(), with default args, creating an in-memory DB.

    The error message you get (Out of memory) seems to indicate thatthe duckdb DB runs out of memory.

    You could then create explicitly a duckdb connection x using duckdb::duckdb("mydir"), and pass it to arrow::to_duckdb(x).