In Polars, how can one specify a single dtype for all columns in read_csv
?
According to the docs, the schema_overrides
argument to read_csv
can take either a mapping (dict) in the form of {'column_name': dtype}
, or a list of dtypes, one for each column.
However, it is not clear how to specify "I want all columns to be a single dtype".
If you wanted all columns to be String for example and you knew the total number of columns, you could do:
pl.read_csv('sample.csv', schema_overrides=[pl.String]*number_of_columns)
However, this doesn't work if you don't know the total number of columns. In Pandas, you could do something like:
pd.read_csv('sample.csv', dtype=str)
But this doesn't work in Polars.
Reading all data in a csv to any other type than pl.String
likely fails with a lot of null
values. We can use expressions to declare how we want to deal with those null values.
If you read a csv with infer_schema_length=0
, polars does not know the schema and will read all columns as pl.String
as that is a super type of all polars types.
When read as String
we can use expressions to cast all columns.
(pl.read_csv("test.csv", infer_schema_length=0)
.with_columns(pl.all().cast(pl.Int32, strict=False))
Update: infer_schema=False
was added in 1.2.0 as a more user-friendly name for this feature.
pl.read_csv("test.csv", infer_schema=False) # read all as pl.String