I have a large number of CSV files (~100,000) some of which themselves are large CSV files (i.e., >128G) and I am trying to convert them to Parquet files. The files contain a mix of character, numeric, and date data stored in CSV format.
I am having a problem converting them for two reasons: 1) when the scan/sink call works the files are ~10X of their CSV versions; 2) the call frequently fails because the predictive schema fails in inferring the data types.
So, my question is: how can I force the scan/sink call to create only character data types?
My code is as follows:
import os
import polars as pl
dir_list = os.listdir()
for filename in dir_list:
if ".txt" in filename: pl.scan_csv(filename,separator="|").sink_parquet(filename.replace(".txt",".parquet"),type_coercion=False,compression="zstd",compression_level=11)
When this runs and it comes across a column that originally looks like one data type for a batch, it infers that data type. However, if the next batch includes something inconsistent, then it throws an error.
Given the file sizes, I cannot guarantee that any column will always have the same data type. Thus, I want to force the data type of each column to be a char and then deal with problem columns/switch to numeric/date types later. How do I do that?
Thanks for any help.
Regards, James
You can use pass infer_schema=False
to pl.scan_csv
to read in all columns as String.