Search code examples
pythonpython-polars

Polars: Specify dtypes for all columns at once in read_csv


In Polars, how can one specify a single dtype for all columns in read_csv?

According to the docs, the schema_overrides argument to read_csv can take either a mapping (dict) in the form of {'column_name': dtype}, or a list of dtypes, one for each column. However, it is not clear how to specify "I want all columns to be a single dtype".

If you wanted all columns to be String for example and you knew the total number of columns, you could do:

pl.read_csv('sample.csv', schema_overrides=[pl.String]*number_of_columns)

However, this doesn't work if you don't know the total number of columns. In Pandas, you could do something like:

pd.read_csv('sample.csv', dtype=str)

But this doesn't work in Polars.


Solution

  • Reading all data in a csv to any other type than pl.String likely fails with a lot of null values. We can use expressions to declare how we want to deal with those null values.

    If you read a csv with infer_schema_length=0, polars does not know the schema and will read all columns as pl.String as that is a super type of all polars types.

    When read as String we can use expressions to cast all columns.

    (pl.read_csv("test.csv", infer_schema_length=0)
       .with_columns(pl.all().cast(pl.Int32, strict=False))
    

    Update: infer_schema=False was added in 1.2.0 as a more user-friendly name for this feature.

    pl.read_csv("test.csv", infer_schema=False) # read all as pl.String