duckdb
comes with a IMPORT DATABASE 'folder'
feature.
First, a schema.sql
is executed. This file contains a lot of CREATE TABLE
statements.
Then, a load.sql
file is executed. This file contains a lot of COPY
statements that import, e.g. CSV files.
I wonder if I can verify that the headers of the CSV file are the same as the field names in the CREATE TABLE
statements. I am using R, so a minimal example looks like this:
# write example schema.sql, load.sql, and iris.csv
folder <- tempdir()
curr_wd <- getwd()
#on.exit(setwd(curr_wd))
setwd(folder)
dir.create("dbdump") |> suppressWarnings()
write.csv(iris, "dbdump/iris.csv", row.names=FALSE)
writeLines(paste(
"CREATE TABLE iris(",
" sepal_length DOUBLE,",
" sepal_width DOUBLE,",
" petal_length DOUBLE,",
" petal_width DOUBLE,",
" species VARCHAR);", sep="\n"), "dbdump/schema.sql"
)
writeLines(
"COPY iris FROM 'iris.csv' (DELIMITER ',', HEADER);",
"dbdump/load.sql"
)
Now I can import the data by
library(duckdb)
con <- DBI::dbConnect(drv=duckdb::duckdb(), dbdir=":memory:")
#on.exit(DBI::dbDisconnect(con, shutdown=TRUE))
DBI::dbExecute(con, "IMPORT DATABASE 'dbdump';")
If I check the column names,
DBI::dbReadTable(con, "iris") |> str()
#'data.frame': 150 obs. of 5 variables:
# $ sepal_length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ sepal_width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ petal_length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ petal_width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ species : chr "setosa" "setosa" "setosa" "setosa" ...
The field names from the CREATE TABLE
statement are used. However, the iris.csv
has a different header:
colnames(iris)
# [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
# [5] "Species"
I wish I could get a warning in this case. Especially, I worry that the column order of the CSV is different from the CREATE TABLE
statement and goes undetected if the type is the same.
Do I have to program this check manually or is there a more convenient way?
Just for reference, I ended up coding my own test. In the example of the question:
writeLines(paste(
"stopifnot(",
" all(colnames(read.csv('dbdump/iris.csv', nrows=1)) ==",
" c('sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'))",
")", sep="\n"), "dbdump/check_headers.R"
)
and then, before calling IMPORT DATABASE
:
source("dbdump/check_headers.R")
I learned from the comments that SQL offers explicit column naming, so my load.sql
now looks like this:
writeLines(paste(
"COPY iris(sepal_length, sepal_width, petal_length, petal_width, species)",
"FROM 'iris.csv' (DELIMITER ',', HEADER);"), "dbdump/load.sql"
)