Search code examples
pythonpython-polars

How to handle inconsistent columns (ragged rows) in a delimited file using Polars?


I am working with a legacy system that generates delimited files (e.g., CSV), but the number of columns in these files is inconsistent across rows (ragged rows). I am reading the file from ADLS with Polars, but I'm encountering an issue depending on the structure of the second row in the file.

pl.read_csv('sample.csv', has_header=False, skip_rows=1, infer_schema=False, infer_schema_length=None, ignore_errors=True)
  • If the second row has more columns than subsequent rows, Polars reads the file successfully and fills the missing values in subsequent rows with null.
  • However, if subsequent rows have more columns than the second row, I get the following exception

ComputeError: found more fields than defined in 'Schema' Consider setting 'truncate_ragged_lines=True'.

Is there a way to handle such cases dynamically in Polars, or do I need to preprocess the file to fix these inconsistencies before reading? Any alternative approaches or solutions to this problem would be appreciated!

Example Data - Failure

ID,Name,Age
1,John,28
2,Jane,35,California,USA
3,Emily,22
4,Michael,40,Australia,Melbourne

Example Data - Success

ID,Name,Age
2,Jane,35,California,USA
1,John,28
3,Emily,22
4,Michael,40,Australia,Melbourne

Solution

  • Read it in as a single column by setting the separator to (hopefully) an unused utf8 character with no header and then use .str.split.list.to_struct followed by unnest to allow a dynamic number of columns. Then you have to rename the columns and slice out the first row.

    import polars as pl
    import io
    from warnings import catch_warnings, filterwarnings
    
    input_file = io.StringIO("""ID,Name,Age
    1,John,28
    2,Jane,35,California,USA
    3,Emily,22
    4,Michael,40,Australia,Melbourne"""
                       )
    input_file.seek(0)
    
    
    with catch_warnings():
        filterwarnings("ignore")
        ## this suppresses the warning from `to_struct` which wants explicit field names.
        df = (
            pl.read_csv(input_file, separator="\x00", has_header=False)
            .with_columns(
                pl.col("column_1")
                .str.split(",")
                .list.to_struct(n_field_strategy="max_width")
                )
            .unnest("column_1")
        )
    
    df = df.rename({x:y for x,y in zip(df.columns, df.row(0)) if y is not None})
    df = df.slice(1,)
    

    Now you've got a df of all strings. You could try to do a for loop with all the columns, trying to cast them but turns out that is slower (at least in a few tests that I did) than writing the existing df to a csv and then rereading it to force polars's auto-infer mechanism.

    from tempfile import NamedTemporaryFile
    with NamedTemporaryFile() as ff:
        df.write_csv(ff)
        ff.seek(0)
        df= pl.read_csv(ff)
    

    If you've got enough memory then replacing the tempfile with an io.BytesIO() will be even faster.