Search code examples
pythonfiletypesformatpyarrow

How to correct csv file mixed types if using pyarrow write dataset to parquet?


I am currently using pyarrow to read a bunch of .csv files from a directory into a dataset like so:

import pyarrow.dataset as ds

# create dataset from csv files
dataset = ds.dataset(input_pat,
                         format="csv",
                         exclude_invalid_files = True)

After creating the dataset I write it to parquet format like so:

ds.write_dataset(dataset, 
                 format = "parquet", 
                 base_dir = output_path,
                 basename_template = "name_data" +'_{i}.parquet',
                 existing_data_behavior = "overwrite_or_ignore")
 

Now I use this for two datasets, where for the first dataset it works perfectly well. For the second dataset I am receiving an error:

ArrowInvalid: In CSV column #14: Row #111060: CSV conversion error to null: invalid value '0'  

As I understand PyArrow does not like it if there are integer values ("0") in my string columns. Now, if this is the only violation, is there a way for me to explicitly correct it when creating the dataset ? For example, I would like to replace "0" with "unknown" at reading time.

This would be very nice as I do not want to correct the mistakes in an additional function beforehand. The data can be found here. For the yellow taxis there are no problems. The problem occurs when reading csv files for the green taxis.

If I define the schema, will the error be solved? Will it understand, that it should treat "0" as a string?


Solution

  • My undrestanding is that in most files, the 14th column (ehail_fee) contains empty value.

    When loading the csv dataset, arrow tries to guess the type of each column when it opens the first file it finds. At that point it assumes that the 14th columns is of type pyarrow.null(). When it finds a file that contains a non-empty value for that column it throws an error.

    If I define the schema, will the error be solved? Will it understand, that it should treat "0" as a string?

    That should work (but I think it should be a pyarrow.float() not a pyarrow.string()).