{arrow}s auto-detection of column types is causing me some trouble when opening a large csv file. In particular, it drops leading zeroes for some identifiers and does some other unfortunate stuff. As the dataset is quite wide (a few hundred cols) and I don't want to set all schema values manually, I would like to somehow programatically set it.
A good start would be to convert all columns to character when opening the dataset with arrow::open_dataset
. Or correct the existing datase_connection$schema
object for particular columns.
However, I was not able to find out how to do so.
When you use arrow::open_dataset()
you can manually define a schema which determines the column names and types. I've pasted an example below, which shows the default behaviour of auto-detecting column names types first, and then using a schema to override this and specify your own column names and types. The example here does this programmatically as requested but you can define a schema by hand too.
library(arrow)
write_dataset(mtcars, "mtcars")
# opens the dataset with column detection
dataset <- open_dataset("mtcars")
dataset
#> FileSystemDataset with 1 Parquet file
#> mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#>
#> See $metadata for additional Schema metadata
# define new schema automatically
chosen_schema <- schema(
purrr::map(names(dataset), ~Field$create(name = .x, type = string()))
)
# now opens the dataset with the chosen schema
open_dataset("mtcars", schema = chosen_schema)
#> FileSystemDataset with 1 Parquet file
#> mpg: string
#> cyl: string
#> disp: string
#> hp: string
#> drat: string
#> wt: string
#> qsec: string
#> vs: string
#> am: string
#> gear: string
#> carb: string