I'm looking for reading multiple csv files with PyArrow from hadoop and i don't find how to do it.
To give you more context, i have some folders with multiple csv files
folder:
And I want to read them all in order so that I can then produce parquet files smaller than 150 MB (for example). For example, if all these files have produced a 300 MB parquet file, I want to split this file into 2 files of 150 MB.
Right now, I have just that but it produce 1 parquet file for 1 csv and it's not what i am expecting
I didn't find something on the internet... The only things i saw :
And to export in multiple parquet files i didn't find something ...
from pyarrow import csv
from pyarrow import parquet
from pyarrow import fs
hdfs = fs.HadoopFileSystem("default")
with hdfs.open_input_file("path") as f:
csv_file = csv.read_csv(f)
parquet.write_table(csv_file, "newpath")
Many thanks by advance !!
but i can't because i need read_options, parse_options and convert_options while reading a csv
You can pass a file_format
argument to pyarrow.dataset.dataset, of type CsvFileFormat, which can accept parse_options and convert_option
.