Search code examples
sqlrduckdb

Check CSV headers in IMPORT DATABASE


duckdb comes with a IMPORT DATABASE 'folder' feature.

First, a schema.sql is executed. This file contains a lot of CREATE TABLE statements.

Then, a load.sql file is executed. This file contains a lot of COPY statements that import, e.g. CSV files.

I wonder if I can verify that the headers of the CSV file are the same as the field names in the CREATE TABLE statements. I am using R, so a minimal example looks like this:

    # write example schema.sql, load.sql, and iris.csv
    folder <- tempdir()
    curr_wd <- getwd()
    #on.exit(setwd(curr_wd))
    setwd(folder)
    
    dir.create("dbdump") |> suppressWarnings()
    write.csv(iris, "dbdump/iris.csv", row.names=FALSE)
    
    writeLines(paste(
        "CREATE TABLE iris(",
        "  sepal_length DOUBLE,",
        "  sepal_width DOUBLE,",
        "  petal_length DOUBLE,", 
        "  petal_width DOUBLE,", 
        "  species VARCHAR);", sep="\n"), "dbdump/schema.sql"
    )
    
    writeLines(
        "COPY iris FROM 'iris.csv' (DELIMITER ',', HEADER);", 
        "dbdump/load.sql"
    )

Now I can import the data by

    library(duckdb)
    con <- DBI::dbConnect(drv=duckdb::duckdb(), dbdir=":memory:")
    #on.exit(DBI::dbDisconnect(con, shutdown=TRUE))
    DBI::dbExecute(con, "IMPORT DATABASE 'dbdump';")

If I check the column names,

    DBI::dbReadTable(con, "iris") |> str()
    #'data.frame':   150 obs. of  5 variables:
    # $ sepal_length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
    # $ sepal_width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
    # $ petal_length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
    # $ petal_width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
    # $ species     : chr  "setosa" "setosa" "setosa" "setosa" ...

The field names from the CREATE TABLE statement are used. However, the iris.csv has a different header:

    colnames(iris)
    # [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
    # [5] "Species"  

I wish I could get a warning in this case. Especially, I worry that the column order of the CSV is different from the CREATE TABLE statement and goes undetected if the type is the same.

Do I have to program this check manually or is there a more convenient way?


Solution

  • Just for reference, I ended up coding my own test. In the example of the question:

    writeLines(paste(
        "stopifnot(",
        " all(colnames(read.csv('dbdump/iris.csv', nrows=1)) ==",
        " c('sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'))",
        ")", sep="\n"), "dbdump/check_headers.R"
    )
    

    and then, before calling IMPORT DATABASE:

    source("dbdump/check_headers.R")
    

    I learned from the comments that SQL offers explicit column naming, so my load.sql now looks like this:

    writeLines(paste(
        "COPY iris(sepal_length, sepal_width, petal_length, petal_width, species)",
        "FROM 'iris.csv' (DELIMITER ',', HEADER);"), "dbdump/load.sql"
    )