I'm trying to save a dataframe with a 2D list in each cell to a parquet file. As example I created a polars dataframe with a 2D list. As can be seen in the table the dtype of both columns is list[list[i64]]
│ a ┆ b │
│ --- ┆ --- │
│ list[list[i64]] ┆ list[list[i64]] │
│ [[1], [2], ... [4]] ┆ [[1], [2], ... [4]] │
│ [[1], [2], ... [4]] ┆ [[1], [2], ... [4]] │
In the code below I saved and read the dataframe to check whether it is indeed possible to write and read this dataframe to and from a parquet file.
After this step I created a numpy array from the dataframe. This is where the problem starts. Converting back to a polars dataframe is still possible. Despite the fact that the dtype of both columns now an object is.
│ a ┆ b │
│ --- ┆ --- │
│ object ┆ object │
│ [array([array([1], dtype=int64),... ┆ [array([array([1], dtype=int64),... │
│ [array([array([1], dtype=int64),... ┆ [array([array([1], dtype=int64),... │
Now, when I try to write this dataframe to a parquet file the following error pops up: Exception has occurred: PanicException cannot convert object to arrow
. Which is indeed true because the dtypes are now objects.
I tried using pl.from_numpy()
but this complains on reading 2D arrays. I also tried casting but casting from an object seems not possible. Creating the dataframe with the previous dtype does also not seem to work.
How can I still write this dataframe to a parquet file? Preferably with dtype list[list[i64]]
. I need to keep the 2D array structure.
By just creating the desired result as a list I'm able to write a read but not when it is a numpy array.
Proof code:
import polars as pl
import numpy as np
data = {
"a": [[[[1],[2],[3],[4]], [[1],[2],[3],[4]]],
[[[1],[2],[3],[4]], [[1],[2],[3],[4]]]],
"b": [[[[1],[2],[3],[4]], [[1],[2],[3],[4]]],
[[[1],[2],[3],[4]], [[1],[2],[3],[4]]]]
df = pl.DataFrame(data)
read_df = pl.read_parquet('test.parquet')
Proof result:
│ a ┆ b │
│ --- ┆ --- │
│ list[list[list[i64]]] ┆ list[list[list[i64]]] │
│ [[[1], [2], ... [4]], [[1], [2],... ┆ [[[1], [2], ... [4]], [[1], [2],... │
│ [[[1], [2], ... [4]], [[1], [2],... ┆ [[[1], [2], ... [4]], [[1], [2],... │
Sample code:
import polars as pl
import numpy as np
data = {
"a": [[[1],[2],[3],[4]], [[1],[2],[3],[4]]],
"b": [[[1],[2],[3],[4]], [[1],[2],[3],[4]]]
df = pl.DataFrame(data)
read_df = pl.read_parquet('test.parquet')
arr = np.dstack([read_df, df])
# schema={'a': list[list[pl.Int32]], 'b': list[list[pl.Int32]]}
combined = pl.DataFrame(arr.tolist(), schema=df.columns)
# combined.with_column(pl.col('a').cast(pl.List, strict=False).alias('a_list'))
Perhaps there is a simpler approach - but you could do the "stacking" with explode/groupby:
frames = df, read_df
frames = (
.explode(pl.exclude("row", "col"))
for n, frame in enumerate(frames)
combined = (
.groupby("row", "col", maintain_order=True)
.groupby("row", maintain_order=True)
shape: (2, 2)
│ a | b │
│ --- | --- │
│ list[list[list[i64]]] | list[list[list[i64]]] │
│ [[[1], [2], ... [4]], [[1], [2],... | [[[1], [2], ... [4]], [[1], [2],... │
│ [[[1], [2], ... [4]], [[1], [2],... | [[[1], [2], ... [4]], [[1], [2],... │
I thought .concat_list
may be of use - but they are merged:
.select(pl.concat_list(["a", "a_right"]))
shape: (8,)
Series: 'a' [list[i64]]