Search code examples
pythoncsvschemapython-polars

How to re-infer datatypes on existing polars dataframe?


I have the following problem:

I have a csv-file with faulty values (strings instead of integers) in some rows. To remedy that, I read it into polars and filter the dataframe.

To be able to read it, I have to set infer_schema_length = 0, since otherwise the read would fail. This reads every column as a string, though. How would I re-infer the data types/schema of the corrected dataframe? I'd like to try to avoid setting every column individually, as there are a lot.

I unfortunately can't edit the csv itself.

ids_df = pl.read_csv(dataset_path, infer_schema_length=0)

filtered_df = ids_df.filter(~(pl.col("Label") == "Label"))

filtered_df.dtypes

[Utf8,
 Utf8,
 Utf8,
 Utf8,
 Utf8,
 Utf8,
 Utf8,
 Utf8,
 Utf8,
 Utf8,
 ...

Thanks for your help.


Solution

  • I don't think Polars has this funtionality yet, but I think I found a valid way to solve your problem:

    from io import BytesIO
    import polars as pl
    dataset_path = "./test_data.csv"
    ids_df = pl.read_csv(dataset_path, infer_schema_length=0)
    print("ids_df",ids_df)
    
    filtered_df = ids_df.filter(~(pl.col("Label") == "Label"))
    print("filtered_df", filtered_df)
    
    # Save data to memory as a IO stream
    bytes_io = BytesIO()
    filtered_df.write_csv(bytes_io)
    
    # Read from IO stream with infer_schema_lenth != 0
    new_df = pl.read_csv(bytes_io)
    print("new_df", new_df)
    bytes_io.close()
    

    output