Search code examples
rapache-arrow

r arrow set column type/schema to char for all columns


{arrow}s auto-detection of column types is causing me some trouble when opening a large csv file. In particular, it drops leading zeroes for some identifiers and does some other unfortunate stuff. As the dataset is quite wide (a few hundred cols) and I don't want to set all schema values manually, I would like to somehow programatically set it.

A good start would be to convert all columns to character when opening the dataset with arrow::open_dataset. Or correct the existing datase_connection$schema object for particular columns.

However, I was not able to find out how to do so.


Solution

  • When you use arrow::open_dataset() you can manually define a schema which determines the column names and types. I've pasted an example below, which shows the default behaviour of auto-detecting column names types first, and then using a schema to override this and specify your own column names and types. The example here does this programmatically as requested but you can define a schema by hand too.

    library(arrow)
    
    write_dataset(mtcars, "mtcars")
    
    # opens the dataset with column detection
    dataset <- open_dataset("mtcars")
    dataset
    #> FileSystemDataset with 1 Parquet file
    #> mpg: double
    #> cyl: double
    #> disp: double
    #> hp: double
    #> drat: double
    #> wt: double
    #> qsec: double
    #> vs: double
    #> am: double
    #> gear: double
    #> carb: double
    #> 
    #> See $metadata for additional Schema metadata
    
    # define new schema automatically
    chosen_schema <- schema(
      purrr::map(names(dataset), ~Field$create(name = .x, type = string()))
    )
    
    # now opens the dataset with the chosen schema
    open_dataset("mtcars", schema = chosen_schema) 
    #> FileSystemDataset with 1 Parquet file
    #> mpg: string
    #> cyl: string
    #> disp: string
    #> hp: string
    #> drat: string
    #> wt: string
    #> qsec: string
    #> vs: string
    #> am: string
    #> gear: string
    #> carb: string