Search code examples
rpyarrowapache-arrow

r arrow schema update


I have multiple .csv files that I am trying to read with arrow::open_dataset() but it is throwing an error due to column type inconsistency.

I found this question mostly related to my problem, but I am trying a slightly different approach.

  1. I want to utilize autodetection from the arrow type, using one sample CSV file. It is time-consuming to figure out all the types of columns.

  2. Then, I take the schema and correct some of the columns that cause problems.

  3. And then I use the updated schema to read all files.

Below is my approach:

data = read_csv_arrow('data.csv.gz', as_data_frame = F) # has more than 30 columns
sch = data$schema
print(sch)
Schema
trade_id: int64
secid: int64
side: int64
...
nonstd: int64
flags: string

I would like to change the 'trade_id' column type from int64 to string and leave other columns to be the same.

How can I update the schema?

I'm using R arrow, but I guess answers related pyarrow could be applicable.


Solution

  • There are a couple of different ways to do this; you could either extract the code for the schema and manually update it yourself, or you could save the schema as a variable and update it programmatically.

    library(arrow)
    
    
    # set up an arrow table
    cars_table <- arrow_table(mtcars)
    
    # view the schema
    sch <- cars_table$schema
    
    # print the code that makes up the schema - you could now copy this and edit it
    sch$code()
    #> schema(mpg = float64(), cyl = float64(), disp = float64(), hp = float64(), 
    #>     drat = float64(), wt = float64(), qsec = float64(), vs = float64(), 
    #>     am = float64(), gear = float64(), carb = float64())
    
    # look at an individual element in the schema
    sch[[2]]
    #> Field
    #> cyl: double
    
    # update this element
    sch[[2]] <- Field$create("cylinders", int32())
    sch[[2]]
    #> Field
    #> cylinders: int32
    
    sch$code()
    #> schema(mpg = float64(), cylinders = int32(), disp = float64(), hp = float64(), 
    #>     drat = float64(), wt = float64(), qsec = float64(), vs = float64(), 
    #>     am = float64(), gear = float64(), carb = float64())