I have the following problem:
I have a csv-file with faulty values (strings instead of integers) in some rows. To remedy that, I read it into polars and filter the dataframe.
To be able to read it, I have to set infer_schema_length = 0
, since otherwise the read would fail. This reads every column as a string, though. How would I re-infer the data types/schema of the corrected dataframe? I'd like to try to avoid setting every column individually, as there are a lot.
I unfortunately can't edit the csv itself.
ids_df = pl.read_csv(dataset_path, infer_schema_length=0)
filtered_df = ids_df.filter(~(pl.col("Label") == "Label"))
filtered_df.dtypes
[Utf8,
Utf8,
Utf8,
Utf8,
Utf8,
Utf8,
Utf8,
Utf8,
Utf8,
Utf8,
...
Thanks for your help.
I don't think Polars has this funtionality yet, but I think I found a valid way to solve your problem:
from io import BytesIO
import polars as pl
dataset_path = "./test_data.csv"
ids_df = pl.read_csv(dataset_path, infer_schema_length=0)
print("ids_df",ids_df)
filtered_df = ids_df.filter(~(pl.col("Label") == "Label"))
print("filtered_df", filtered_df)
# Save data to memory as a IO stream
bytes_io = BytesIO()
filtered_df.write_csv(bytes_io)
# Read from IO stream with infer_schema_lenth != 0
new_df = pl.read_csv(bytes_io)
print("new_df", new_df)
bytes_io.close()