Search code examples
pythoncsvparquetpyarrow

Read multiple csv files with pyarrow


I'm looking for reading multiple csv files with PyArrow from hadoop and i don't find how to do it.

To give you more context, i have some folders with multiple csv files

folder:

  • file0
  • file1
  • file2
  • etc..

And I want to read them all in order so that I can then produce parquet files smaller than 150 MB (for example). For example, if all these files have produced a 300 MB parquet file, I want to split this file into 2 files of 150 MB.

Right now, I have just that but it produce 1 parquet file for 1 csv and it's not what i am expecting

I didn't find something on the internet... The only things i saw :

  • is reading with a pyarrow dataset but i can't because i need read_options, parse_options and convert_options while reading a csv
  • and use open_input_stream rather than open_input_file but i don't know what it does...

And to export in multiple parquet files i didn't find something ...

from pyarrow import csv
from pyarrow import parquet
from pyarrow import fs

hdfs = fs.HadoopFileSystem("default")
with hdfs.open_input_file("path") as f:
    csv_file = csv.read_csv(f)
    
parquet.write_table(csv_file, "newpath")

Many thanks by advance !!


Solution

  • but i can't because i need read_options, parse_options and convert_options while reading a csv

    You can pass a file_format argument to pyarrow.dataset.dataset, of type CsvFileFormat, which can accept parse_options and convert_option.