I'd like to read a parquet file with polars (0.19.19) that was saved using pandas (2.1.3).
test_df = pd.DataFrame({"a":[10,10,0,100,0]})
test_df["b"] = test_df.a.astype("category")
test_df.to_parquet("test_df.parquet")
test_pl_df = pl.read_parquet("test_df.parquet")
I get this error:
polars.exceptions.ComputeError: only string-like values are supported in dictionaries
How can I read the parquet file with polars?
Reading with pandas first works, but seems rather ugly and does not allow lazy methods such as scan_parquet.
test_pa_pl_df = pl.from_pandas(pd.read_parquet("test_df.parquet", dtype_backend="pyarrow"))
In a pure sense, you can't read it (at least not in its entirety) with polars because polars doesn't support categorical columns except when the underlying dtype is a string.
There is a better shortcut than round tripping through pandas (which is itself using pyarrow). To read it eagerly you can just do:
test_pl_df = pl.read_parquet("test_df.parquet", use_pyarrow=True)
and it will just turn b
into a regular integer column.
If you want a lazy version then you can use a pyarrow dataset like this:
import pyarrow.dataset as ds
test_pl_lf = pl.scan_pyarrow_dataset(ds.dataset("test_df.parquet"))
Alternatively, you can lazy load it with polars and then drop the b
column.
test_pl_lf = pl.scan_parquet("test_df.parquet").select('a')