Search code examples
rsqlitedplyrunionrsqlite

reading in many tables from SQLlite databases and combining in R


I'm working with a program that outputs a database of results. I have hundreds of these databases that are all identical in structure and I'd like to combine them into ONE big database. I'm mostly interested in 1 table from each database. I don't work with databases/sql very much, but it would simplify other steps in the process, to skip outputting a csv.

Previously I did this by exporting a csv and used these steps to combine all csvs:

Vector of all CSVs & combine

library(DBI)
library(RSQLite)
library(dplyr)

csv_locs<- list.files(newdir, recursive = TRUE, pattern="*.csv", full.names = TRUE)

pic_dat <- do.call("rbind", lapply(csv_locs, 
FUN=function(files){data.table::fread(files, data.table = FALSE)}))

How to do this with sql type database tables??

I'm basically pulling out the first table, then joining on the rest with a loop.

db_locs <- list.files(directory, recursive = TRUE, pattern="*.ddb", full.names = TRUE)


# first table
con1<- DBI::dbConnect(RSQLite::SQLite(), db_locs [1])
start <- tbl(con1, "DataTable")

# open connection to location[i], get table, union, disconnect; repeat. 
for(i in 2:length(db_locs )){
con <- DBI::dbConnect(RSQLite::SQLite(), db_locs[i])
y <- tbl(con, "DataTable")
start <- union(start, y, copy=TRUE)
dbDisconnect(con)
}

This is exceptionally slow! Well, to be fair, its large data and the csv one is also slow.

I think I honestly wrote the slowest possible way to do this :) I could not get the do.call/lapply option to work here, but maybe I'm missing something.


Solution

  • This looks similar to "iterative rbinding of frames", in that each time you do this union, it will copy the entire table into a new object (unconfirmed, but that's my gut feeling). This might work well for a few but scales very poorly. I suggest you collect all tables in a list and call data.table::rbindlist once at the end, then insert into a table.

    Without your data, I'll contrive a situation. And because I'm not entirely certain if you have just one table per sqlite3 file, I'll add two tables per database. If you only have one, the solution simplifies easily.

    for (i in 1:3) {
      con <- DBI::dbConnect(RSQLite::SQLite(), sprintf("mtcars_%d.sqlite3", i))
      DBI::dbWriteTable(con, "mt1", mtcars[1:3,1:3])
      DBI::dbWriteTable(con, "mt2", mtcars[4:5,4:7])
      DBI::dbDisconnect(con)
    }
    (lof <- list.files(pattern = "*.sqlite3", full.names = TRUE))
    # [1] "./mtcars_1.sqlite3" "./mtcars_2.sqlite3" "./mtcars_3.sqlite3"
    

    Now I'll iterate over each them and read the contents of a table

    allframes <- lapply(lof, function(fn) {
      con <- DBI::dbConnect(RSQLite::SQLite(), fn)
      mt1 <- tryCatch(DBI::dbReadTable(con, "mt1"),
                      error = function(e) NULL)
      mt2 <- tryCatch(DBI::dbReadTable(con, "mt2"),
                      error = function(e) NULL)
      DBI::dbDisconnect(con)
      list(mt1 = mt1, mt2 = mt2)
    })
    allframes
    # [[1]]
    # [[1]]$mt1
    #    mpg cyl disp
    # 1 21.0   6  160
    # 2 21.0   6  160
    # 3 22.8   4  108
    # [[1]]$mt2
    #    hp drat    wt  qsec
    # 1 110 3.08 3.215 19.44
    # 2 175 3.15 3.440 17.02
    # [[2]]
    # [[2]]$mt1
    #    mpg cyl disp
    # 1 21.0   6  160
    # 2 21.0   6  160
    # 3 22.8   4  108
    ### ... repeated
    

    From here, just combine them in R and write to a new database. While you can use do.call(rbind,...) or dplyr::bind_rows, you already mentioned data.table so I'll stick with that:

    con <- DBI::dbConnect(RSQLite::SQLite(), "mtcars_all.sqlite3")
    DBI::dbWriteTable(con, "mt1", data.table::rbindlist(lapply(allframes, `[[`, 1)))
    DBI::dbWriteTable(con, "mt2", data.table::rbindlist(lapply(allframes, `[[`, 2)))
    DBI::dbGetQuery(con, "select count(*) as n from mt1")
    #   n
    # 1 9
    DBI::dbDisconnect(con)
    

    In the event that you can't load them all into R at one time, then append them to the table in real-time:

    con <- DBI::dbConnect(RSQLite::SQLite(), "mtcars_all2.sqlite3")
    for (fn in lof) {
      con2 <- DBI::dbConnect(RSQLite::SQLite(), fn)
      mt1 <- tryCatch(DBI::dbReadTable(con2, "mt1"), error = function(e) NULL)
      if (!is.null(mt1)) DBI::dbWriteTable(con, "mt1", mt1, append = TRUE)
      mt2 <- tryCatch(DBI::dbReadTable(con2, "mt2"), error = function(e) NULL)
      if (!is.null(mt1)) DBI::dbWriteTable(con, "mt2", mt2, append = TRUE)
      DBI::dbDisconnect(con2)
    }
    DBI::dbGetQuery(con, "select count(*) as n from mt1")
    #   n
    # 1 9
    

    This doesn't suffer the iterative-slowdown that you're experiencing.