I'm trying to save a dataframe with a 2D list in each cell to a parquet file. As example I created a polars dataframe with a 2D list. As can be seen in the table the dtype of both columns is list[list[i64]]
.
┌─────────────────────┬─────────────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ list[list[i64]] ┆ list[list[i64]] │
╞═════════════════════╪═════════════════════╡
│ [[1], [2], ... [4]] ┆ [[1], [2], ... [4]] │
│ [[1], [2], ... [4]] ┆ [[1], [2], ... [4]] │
└─────────────────────┴─────────────────────┘
In the code below I saved and read the dataframe to check whether it is indeed possible to write and read this dataframe to and from a parquet file.
After this step I created a numpy array from the dataframe. This is where the problem starts. Converting back to a polars dataframe is still possible. Despite the fact that the dtype of both columns now an object is.
┌─────────────────────────────────────┬─────────────────────────────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ object ┆ object │
╞═════════════════════════════════════╪═════════════════════════════════════╡
│ [array([array([1], dtype=int64),... ┆ [array([array([1], dtype=int64),... │
│ [array([array([1], dtype=int64),... ┆ [array([array([1], dtype=int64),... │
└─────────────────────────────────────┴─────────────────────────────────────┘
Now, when I try to write this dataframe to a parquet file the following error pops up: Exception has occurred: PanicException cannot convert object to arrow
. Which is indeed true because the dtypes are now objects.
I tried using pl.from_numpy()
but this complains on reading 2D arrays. I also tried casting but casting from an object seems not possible. Creating the dataframe with the previous dtype does also not seem to work.
Question:
How can I still write this dataframe to a parquet file? Preferably with dtype list[list[i64]]
. I need to keep the 2D array structure.
By just creating the desired result as a list I'm able to write a read but not when it is a numpy array.
Proof code:
import polars as pl
import numpy as np
data = {
"a": [[[[1],[2],[3],[4]], [[1],[2],[3],[4]]],
[[[1],[2],[3],[4]], [[1],[2],[3],[4]]]],
"b": [[[[1],[2],[3],[4]], [[1],[2],[3],[4]]],
[[[1],[2],[3],[4]], [[1],[2],[3],[4]]]]
}
df = pl.DataFrame(data)
df.write_parquet('test.parquet')
read_df = pl.read_parquet('test.parquet')
print(read_df)
Proof result:
┌─────────────────────────────────────┬─────────────────────────────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ list[list[list[i64]]] ┆ list[list[list[i64]]] │
╞═════════════════════════════════════╪═════════════════════════════════════╡
│ [[[1], [2], ... [4]], [[1], [2],... ┆ [[[1], [2], ... [4]], [[1], [2],... │
│ [[[1], [2], ... [4]], [[1], [2],... ┆ [[[1], [2], ... [4]], [[1], [2],... │
└─────────────────────────────────────┴─────────────────────────────────────┘
Sample code:
import polars as pl
import numpy as np
data = {
"a": [[[1],[2],[3],[4]], [[1],[2],[3],[4]]],
"b": [[[1],[2],[3],[4]], [[1],[2],[3],[4]]]
}
df = pl.DataFrame(data)
df.write_parquet('test.parquet')
read_df = pl.read_parquet('test.parquet')
print(read_df)
arr = np.dstack([read_df, df])
# schema={'a': list[list[pl.Int32]], 'b': list[list[pl.Int32]]}
combined = pl.DataFrame(arr.tolist(), schema=df.columns)
print(combined)
# combined.with_column(pl.col('a').cast(pl.List, strict=False).alias('a_list'))
combined.write_parquet('test_result.parquet')
Perhaps there is a simpler approach - but you could do the "stacking" with explode/groupby:
frames = df, read_df
frames = (
frame.with_columns(col=n)
.with_row_count("row")
.explode(pl.exclude("row", "col"))
for n, frame in enumerate(frames)
)
combined = (
pl.concat(frames)
.groupby("row", "col", maintain_order=True)
.agg(pl.all())
.groupby("row", maintain_order=True)
.agg(pl.exclude("col"))
.drop("row")
)
shape: (2, 2)
┌─────────────────────────────────────┬─────────────────────────────────────┐
│ a | b │
│ --- | --- │
│ list[list[list[i64]]] | list[list[list[i64]]] │
╞═════════════════════════════════════╪═════════════════════════════════════╡
│ [[[1], [2], ... [4]], [[1], [2],... | [[[1], [2], ... [4]], [[1], [2],... │
│ [[[1], [2], ... [4]], [[1], [2],... | [[[1], [2], ... [4]], [[1], [2],... │
└─────────────────────────────────────┴─────────────────────────────────────┘
I thought .concat_list
may be of use - but they are merged:
(df.hstack(read_df.select(pl.all().suffix("_right")))
.select(pl.concat_list(["a", "a_right"]))
.limit(1).item())
shape: (8,)
Series: 'a' [list[i64]]
[
[1]
[2]
[3]
[4]
[1]
[2]
[3]
[4]
]