I am working with a legacy system that generates delimited files (e.g., CSV), but the number of columns in these files is inconsistent across rows (ragged rows). I am reading the file from ADLS with Polars, but I'm encountering an issue depending on the structure of the second row in the file.
pl.read_csv('sample.csv', has_header=False, skip_rows=1, infer_schema=False, infer_schema_length=None, ignore_errors=True)
null
.ComputeError: found more fields than defined in 'Schema' Consider setting 'truncate_ragged_lines=True'.
Is there a way to handle such cases dynamically in Polars, or do I need to preprocess the file to fix these inconsistencies before reading? Any alternative approaches or solutions to this problem would be appreciated!
Example Data - Failure
ID,Name,Age
1,John,28
2,Jane,35,California,USA
3,Emily,22
4,Michael,40,Australia,Melbourne
Example Data - Success
ID,Name,Age
2,Jane,35,California,USA
1,John,28
3,Emily,22
4,Michael,40,Australia,Melbourne
Read it in as a single column by setting the separator to (hopefully) an unused utf8 character with no header and then use .str.split.list.to_struct
followed by unnest
to allow a dynamic number of columns. Then you have to rename the columns and slice out the first row.
import polars as pl
import io
from warnings import catch_warnings, filterwarnings
input_file = io.StringIO("""ID,Name,Age
1,John,28
2,Jane,35,California,USA
3,Emily,22
4,Michael,40,Australia,Melbourne"""
)
input_file.seek(0)
with catch_warnings():
filterwarnings("ignore")
## this suppresses the warning from `to_struct` which wants explicit field names.
df = (
pl.read_csv(input_file, separator="\x00", has_header=False)
.with_columns(
pl.col("column_1")
.str.split(",")
.list.to_struct(n_field_strategy="max_width")
)
.unnest("column_1")
)
df = df.rename({x:y for x,y in zip(df.columns, df.row(0)) if y is not None})
df = df.slice(1,)
Now you've got a df of all strings. You could try to do a for loop with all the columns, trying to cast them but turns out that is slower (at least in a few tests that I did) than writing the existing df to a csv and then rereading it to force polars's auto-infer mechanism.
from tempfile import NamedTemporaryFile
with NamedTemporaryFile() as ff:
df.write_csv(ff)
ff.seek(0)
df= pl.read_csv(ff)
If you've got enough memory then replacing the tempfile with an io.BytesIO()
will be even faster.