I'm working with a program that outputs a database of results. I have hundreds of these databases that are all identical in structure and I'd like to combine them into ONE big database. I'm mostly interested in 1 table from each database. I don't work with databases/sql very much, but it would simplify other steps in the process, to skip outputting a csv.
Previously I did this by exporting a csv and used these steps to combine all csvs:
library(DBI)
library(RSQLite)
library(dplyr)
csv_locs<- list.files(newdir, recursive = TRUE, pattern="*.csv", full.names = TRUE)
pic_dat <- do.call("rbind", lapply(csv_locs,
FUN=function(files){data.table::fread(files, data.table = FALSE)}))
I'm basically pulling out the first table, then joining on the rest with a loop.
db_locs <- list.files(directory, recursive = TRUE, pattern="*.ddb", full.names = TRUE)
# first table
con1<- DBI::dbConnect(RSQLite::SQLite(), db_locs [1])
start <- tbl(con1, "DataTable")
# open connection to location[i], get table, union, disconnect; repeat.
for(i in 2:length(db_locs )){
con <- DBI::dbConnect(RSQLite::SQLite(), db_locs[i])
y <- tbl(con, "DataTable")
start <- union(start, y, copy=TRUE)
dbDisconnect(con)
}
I think I honestly wrote the slowest possible way to do this :) I could not get the do.call/lapply option to work here, but maybe I'm missing something.
This looks similar to "iterative rbind
ing of frames", in that each time you do this union
, it will copy the entire table into a new object (unconfirmed, but that's my gut feeling). This might work well for a few but scales very poorly. I suggest you collect all tables in a list and call data.table::rbindlist
once at the end, then insert into a table.
Without your data, I'll contrive a situation. And because I'm not entirely certain if you have just one table per sqlite3 file, I'll add two tables per database. If you only have one, the solution simplifies easily.
for (i in 1:3) {
con <- DBI::dbConnect(RSQLite::SQLite(), sprintf("mtcars_%d.sqlite3", i))
DBI::dbWriteTable(con, "mt1", mtcars[1:3,1:3])
DBI::dbWriteTable(con, "mt2", mtcars[4:5,4:7])
DBI::dbDisconnect(con)
}
(lof <- list.files(pattern = "*.sqlite3", full.names = TRUE))
# [1] "./mtcars_1.sqlite3" "./mtcars_2.sqlite3" "./mtcars_3.sqlite3"
Now I'll iterate over each them and read the contents of a table
allframes <- lapply(lof, function(fn) {
con <- DBI::dbConnect(RSQLite::SQLite(), fn)
mt1 <- tryCatch(DBI::dbReadTable(con, "mt1"),
error = function(e) NULL)
mt2 <- tryCatch(DBI::dbReadTable(con, "mt2"),
error = function(e) NULL)
DBI::dbDisconnect(con)
list(mt1 = mt1, mt2 = mt2)
})
allframes
# [[1]]
# [[1]]$mt1
# mpg cyl disp
# 1 21.0 6 160
# 2 21.0 6 160
# 3 22.8 4 108
# [[1]]$mt2
# hp drat wt qsec
# 1 110 3.08 3.215 19.44
# 2 175 3.15 3.440 17.02
# [[2]]
# [[2]]$mt1
# mpg cyl disp
# 1 21.0 6 160
# 2 21.0 6 160
# 3 22.8 4 108
### ... repeated
From here, just combine them in R and write to a new database. While you can use do.call(rbind,...)
or dplyr::bind_rows
, you already mentioned data.table
so I'll stick with that:
con <- DBI::dbConnect(RSQLite::SQLite(), "mtcars_all.sqlite3")
DBI::dbWriteTable(con, "mt1", data.table::rbindlist(lapply(allframes, `[[`, 1)))
DBI::dbWriteTable(con, "mt2", data.table::rbindlist(lapply(allframes, `[[`, 2)))
DBI::dbGetQuery(con, "select count(*) as n from mt1")
# n
# 1 9
DBI::dbDisconnect(con)
In the event that you can't load them all into R at one time, then append them to the table in real-time:
con <- DBI::dbConnect(RSQLite::SQLite(), "mtcars_all2.sqlite3")
for (fn in lof) {
con2 <- DBI::dbConnect(RSQLite::SQLite(), fn)
mt1 <- tryCatch(DBI::dbReadTable(con2, "mt1"), error = function(e) NULL)
if (!is.null(mt1)) DBI::dbWriteTable(con, "mt1", mt1, append = TRUE)
mt2 <- tryCatch(DBI::dbReadTable(con2, "mt2"), error = function(e) NULL)
if (!is.null(mt1)) DBI::dbWriteTable(con, "mt2", mt2, append = TRUE)
DBI::dbDisconnect(con2)
}
DBI::dbGetQuery(con, "select count(*) as n from mt1")
# n
# 1 9
This doesn't suffer the iterative-slowdown that you're experiencing.