Force Schema Type Using Polars Scan/Sink

I have a large number of CSV files (~100,000) some of which themselves are large CSV files (i.e., >128G) and I am trying to convert them to Parquet files. The files contain a mix of character, numeric, and date data stored in CSV format.

I am having a problem converting them for two reasons: 1) when the scan/sink call works the files are ~10X of their CSV versions; 2) the call frequently fails because the predictive schema fails in inferring the data types.

So, my question is: how can I force the scan/sink call to create only character data types?

My code is as follows:

import os
import polars as pl


dir_list = os.listdir()

for filename in dir_list:
    if ".txt" in filename: pl.scan_csv(filename,separator="|").sink_parquet(filename.replace(".txt",".parquet"),type_coercion=False,compression="zstd",compression_level=11)

When this runs and it comes across a column that originally looks like one data type for a batch, it infers that data type. However, if the next batch includes something inconsistent, then it throws an error.

Given the file sizes, I cannot guarantee that any column will always have the same data type. Thus, I want to force the data type of each column to be a char and then deal with problem columns/switch to numeric/date types later. How do I do that?

Thanks for any help.

Regards, James

Solution

You can use pass infer_schema=False to pl.scan_csv to read in all columns as String.