arrow::to_duckdb()
converts int64 columns to a double in the duckdb table. This happens if the .data
being converted is an R data frame or a parquet file. How can I maintain the int64 data type?
Example
library(arrow, warn.conflicts = FALSE)
library(tidyverse, warn.conflicts = FALSE)
library(vroom, warn.conflicts = FALSE)
# tibble with an int64 column
dd <- vroom(I("id\n9007199254740993\n"), col_type = "I", delim = ",")
dd
#> # A tibble: 1 × 1
#> id
#> <int64>
#> 1 9e15
# it's coereced to a double
to_duckdb(dd)
#> # Source: table<arrow_001> [1 x 1]
#> # Database: DuckDB 0.8.1 [root@Darwin 22.5.0:R 4.3.1/:memory:]
#> id
#> <dbl>
#> 1 9.01e15
If you look at ?to_duckdb
, its con
parameter defaults to arrow_duck_connection()
, which if you look at it creates a DuckDB DBI connection with
on <- DBI::dbConnect(duckdb::duckdb())
If you look at ?duckdb::duckdb()
, it has a bigint
parameter which defaults to "numeric"
documented as
How 64-bit integers should be returned, default is double/numeric. Set to integer64 for bit64 encoding.
So we can set the con
parameter of to_duckdb()
to our own DBI connection with that parameter set to "integer64"
:
da <- arrow::arrow_table(id = bit64::as.integer64("9007199254740993"))
da
#> Table
#> 1 rows x 1 columns
#> $id <int64>
# default for comparison
con1 <- DBI::dbConnect(duckdb::duckdb())
# how we want it
con2 <- DBI::dbConnect(duckdb::duckdb(bigint = "integer64"))
# using default connection
arrow::to_duckdb(da)
#> # Source: table<arrow_001> [1 x 1]
#> # Database: DuckDB 0.8.1 [root@Darwin 22.6.0:R 4.3.1/:memory:]
#> id
#> <dbl>
#> 1 9.01e15
# comparison
arrow::to_duckdb(da, con = con1)
#> # Source: table<arrow_002> [1 x 1]
#> # Database: DuckDB 0.8.1 [root@Darwin 22.6.0:R 4.3.1/:memory:]
#> id
#> <dbl>
#> 1 9.01e15
# how we want it
arrow::to_duckdb(da, con = con2)
#> # Source: table<arrow_003> [1 x 1]
#> # Database: DuckDB 0.8.1 [root@Darwin 22.6.0:R 4.3.1/:memory:]
#> id
#> <int64>
#> 1 9e15