Search code examples
pandaspython-polars

Error when reading a parquet file with polars which was saved with pandas


I'd like to read a parquet file with polars (0.19.19) that was saved using pandas (2.1.3).

test_df = pd.DataFrame({"a":[10,10,0,100,0]})
test_df["b"] = test_df.a.astype("category")
test_df.to_parquet("test_df.parquet")

test_pl_df = pl.read_parquet("test_df.parquet")

I get this error:

polars.exceptions.ComputeError: only string-like values are supported in dictionaries

How can I read the parquet file with polars?

Reading with pandas first works, but seems rather ugly and does not allow lazy methods such as scan_parquet.

test_pa_pl_df = pl.from_pandas(pd.read_parquet("test_df.parquet", dtype_backend="pyarrow"))

Solution

  • In a pure sense, you can't read it (at least not in its entirety) with polars because polars doesn't support categorical columns except when the underlying dtype is a string.

    There is a better shortcut than round tripping through pandas (which is itself using pyarrow). To read it eagerly you can just do:

    test_pl_df = pl.read_parquet("test_df.parquet", use_pyarrow=True)
    

    and it will just turn b into a regular integer column.

    If you want a lazy version then you can use a pyarrow dataset like this:

    import pyarrow.dataset as ds
    test_pl_lf = pl.scan_pyarrow_dataset(ds.dataset("test_df.parquet"))
    

    Alternatively, you can lazy load it with polars and then drop the b column.

    test_pl_lf = pl.scan_parquet("test_df.parquet").select('a')