Error when reading a parquet file with polars which was saved with pandas

I'd like to read a parquet file with polars (0.19.19) that was saved using pandas (2.1.3).

test_df = pd.DataFrame({"a":[10,10,0,100,0]})
test_df["b"] = test_df.a.astype("category")
test_df.to_parquet("test_df.parquet")

test_pl_df = pl.read_parquet("test_df.parquet")

I get this error:

polars.exceptions.ComputeError: only string-like values are supported in dictionaries

How can I read the parquet file with polars?

Reading with pandas first works, but seems rather ugly and does not allow lazy methods such as scan_parquet.

test_pa_pl_df = pl.from_pandas(pd.read_parquet("test_df.parquet", dtype_backend="pyarrow"))

Solution

In a pure sense, you can't read it (at least not in its entirety) with polars because polars doesn't support categorical columns except when the underlying dtype is a string.

There is a better shortcut than round tripping through pandas (which is itself using pyarrow). To read it eagerly you can just do:

test_pl_df = pl.read_parquet("test_df.parquet", use_pyarrow=True)

and it will just turn b into a regular integer column.

If you want a lazy version then you can use a pyarrow dataset like this:

import pyarrow.dataset as ds
test_pl_lf = pl.scan_pyarrow_dataset(ds.dataset("test_df.parquet"))

Alternatively, you can lazy load it with polars and then drop the b column.

test_pl_lf = pl.scan_parquet("test_df.parquet").select('a')