I have been converting csv files to parquet files using the duckdb::dbExecute()
R function but running into a type issue with one of my files. I have a column called code that I want to be VARCHAR not BIGINT. I cannot seem to override the BIGINT conversion and it's throwing an error.
dbExecute(con, sprintf("CREATE VIEW csv_data AS SELECT * FROM read_csv_auto('%s')", csv_file))
dbExecute(con, sprintf("COPY (SELECT * FROM csv_data) TO '%s' (FORMAT 'parquet', CODEC 'snappy')", parquet_file))
where the error thrown is after the second line.
Error when converting column "code". Could not convert string "FELH234713" to 'BIGINT'
Column code is being converted as type BIGINT
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g. types={'code': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.
I have tried setting the type explicitly but that gives me a syntax error. I think one of the issues is that the code column contains only integer looking values as well as a mix of character and integer values. The csv file is too big to load into memory ~500gb.
my attempts at querying when setting type: all throw the same syntax error
dbExecute(con, sprintf("COPY (SELECT * FROM csv_data) TO '%s' (FORMAT 'parquet', CODEC 'snappy', types={'code': 'VARCHAR'})", parquet_file))
dbExecute(con, sprintf("COPY (SELECT * FROM csv_data) TO '%s' (FORMAT 'parquet', CODEC 'snappy', TYPES ('code' 'VARCHAR')))", parquet_file))
dbExecute(con, sprintf("COPY (SELECT * FROM csv_data) TO '%s' (FORMAT 'parquet', CODEC 'snappy', TYPES (code VARCHAR))", parquet_file))
error:
Error in `dbSendQuery()`:
! rapi_prepare: Failed to extract statements:
Parser Error: syntax error at or near "'VARCHAR'"
LINE 1: ...arquet', CODEC 'snappy', TYPES ('code' 'VARCHAR'))
DuckDB supports a couple of ways to work with this.
Setup:
writeLines(c("a,b","1,A"), "~/quux.csv")
conn <- duckdb::dbConnect(duckdb::duckdb(), ":memory:")
DBI::dbGetQuery(conn, "select * from read_csv('/Users/r2/quux.csv')") |>
str()
# 'data.frame': 1 obs. of 2 variables:
# $ a: num 1
# $ b: chr "A"
For each column, you can specify types=
(ref: https://duckdb.org/docs/stable/data/csv/overview.html):
DBI::dbGetQuery(conn, "select * from read_csv('~/quux.csv', types={'A':'VARCHAR'})") |>
str()
# 'data.frame': 1 obs. of 2 variables:
# $ a: chr "1"
# $ b: chr "A"
This means your view can be defined as
dbExecute(con, sprintf("CREATE VIEW csv_data AS SELECT * FROM read_csv_auto('%s', types={'code':'VARCHAR'})", csv_file))
If you don't know beforehand the column name beforehand, then you can mess with sample_size
. It defaults to 20,480 (ref: https://duckdb.org/docs/stable/data/csv/auto_detection.html), you can set it to a large enough positive value if you know where the strings will be found. You can disable the limit by setting it to -1
, which means it'll read in all data before auto-determining class, with the risk of a performance hit.