Search code examples
pythoncsvparquet

Splitting a large CSV file and converting into multiple Parquet files - Safe?


I learnt, the parquet file format stores a bunch of metadata and uses various compressions to store data in an efficient way, when it comes to size and query-speed.

And it possibly generates multiple files out of, let's say: one input, like from a Panda dataframe.

Now, I have a large CSV file and I want to convert it into a parquet file format. Naively, I would remove the header (store elsewhere for later) and chunk the file up in blocks with n lines. Then turn each chunk into parquet (here Python):

table = pyarrow.csv.read_csv(fileName)
pyarrow.parquet.write_table(table, fileName.replace('csv', 'parquet'))

I guess the method doesn't much matter. From what I see, at least with a small test data set and no extra context, I get one parquet file per csv file (1:1).

For now that is all I need, as I am not doing queries on "the whole", logical data set. I use the raw files, as input to a further cleaning step that is nifty to do with the csv format. And I haven't yet tried reading the files...

Do I have to readd the header to each CSV chunk at the least?

Is this as straight-forward as I think, or am I missing something?


Solution

  • When creating a parquet dataset with Mutiple files, All the files should have matching schema. In your case, when you split the csv file into Mutiple parquet files, you will have to include the csv headers in each chunk to create a valid parquet file.

    Note that parquet is a compressed format (with a high compression ratio). Parquet data will be much smaller than the csv data. On top of that, applications that read parquet file usually prefer fewer large parquet file and not many small parquet files.