Search code examples
pythondataframenumpyparquetpython-polars

Numpy array to list of lists in polars dataframe


I'm trying to save a dataframe with a 2D list in each cell to a parquet file. As example I created a polars dataframe with a 2D list. As can be seen in the table the dtype of both columns is list[list[i64]].

┌─────────────────────┬─────────────────────┐
│ a                   ┆ b                   │
│ ---                 ┆ ---                 │
│ list[list[i64]]     ┆ list[list[i64]]     │
╞═════════════════════╪═════════════════════╡
│ [[1], [2], ... [4]] ┆ [[1], [2], ... [4]] │
│ [[1], [2], ... [4]] ┆ [[1], [2], ... [4]] │
└─────────────────────┴─────────────────────┘

In the code below I saved and read the dataframe to check whether it is indeed possible to write and read this dataframe to and from a parquet file.

After this step I created a numpy array from the dataframe. This is where the problem starts. Converting back to a polars dataframe is still possible. Despite the fact that the dtype of both columns now an object is.

┌─────────────────────────────────────┬─────────────────────────────────────┐
│ a                                   ┆ b                                   │
│ ---                                 ┆ ---                                 │
│ object                              ┆ object                              │
╞═════════════════════════════════════╪═════════════════════════════════════╡
│ [array([array([1], dtype=int64),... ┆ [array([array([1], dtype=int64),... │
│ [array([array([1], dtype=int64),... ┆ [array([array([1], dtype=int64),... │
└─────────────────────────────────────┴─────────────────────────────────────┘

Now, when I try to write this dataframe to a parquet file the following error pops up: Exception has occurred: PanicException cannot convert object to arrow. Which is indeed true because the dtypes are now objects.

I tried using pl.from_numpy() but this complains on reading 2D arrays. I also tried casting but casting from an object seems not possible. Creating the dataframe with the previous dtype does also not seem to work.

Question: How can I still write this dataframe to a parquet file? Preferably with dtype list[list[i64]]. I need to keep the 2D array structure.

By just creating the desired result as a list I'm able to write a read but not when it is a numpy array.

Proof code:

import polars as pl
import numpy as np

data = {
    "a": [[[[1],[2],[3],[4]], [[1],[2],[3],[4]]],
          [[[1],[2],[3],[4]], [[1],[2],[3],[4]]]], 
    "b": [[[[1],[2],[3],[4]], [[1],[2],[3],[4]]],
          [[[1],[2],[3],[4]], [[1],[2],[3],[4]]]]
}

df = pl.DataFrame(data)
df.write_parquet('test.parquet')

read_df = pl.read_parquet('test.parquet')
print(read_df)

Proof result:

┌─────────────────────────────────────┬─────────────────────────────────────┐
│ a                                   ┆ b                                   │
│ ---                                 ┆ ---                                 │
│ list[list[list[i64]]]               ┆ list[list[list[i64]]]               │
╞═════════════════════════════════════╪═════════════════════════════════════╡
│ [[[1], [2], ... [4]], [[1], [2],... ┆ [[[1], [2], ... [4]], [[1], [2],... │
│ [[[1], [2], ... [4]], [[1], [2],... ┆ [[[1], [2], ... [4]], [[1], [2],... │
└─────────────────────────────────────┴─────────────────────────────────────┘

Sample code:

import polars as pl
import numpy as np

data = {
    "a": [[[1],[2],[3],[4]], [[1],[2],[3],[4]]], 
    "b": [[[1],[2],[3],[4]], [[1],[2],[3],[4]]]
}

df = pl.DataFrame(data)
df.write_parquet('test.parquet')

read_df = pl.read_parquet('test.parquet')
print(read_df)

arr = np.dstack([read_df, df])

# schema={'a': list[list[pl.Int32]], 'b': list[list[pl.Int32]]}
combined = pl.DataFrame(arr.tolist(), schema=df.columns)
print(combined)

# combined.with_column(pl.col('a').cast(pl.List, strict=False).alias('a_list'))

combined.write_parquet('test_result.parquet')

Solution

  • Perhaps there is a simpler approach - but you could do the "stacking" with explode/groupby:

    frames = df, read_df
    
    frames = (
       frame.with_columns(col=n)
            .with_row_count("row")
            .explode(pl.exclude("row", "col"))
       for n, frame in enumerate(frames)
    )
    
    combined = (
       pl.concat(frames)
         .groupby("row", "col", maintain_order=True)
         .agg(pl.all())
         .groupby("row", maintain_order=True)
         .agg(pl.exclude("col"))
         .drop("row")
    )
    
    shape: (2, 2)
    ┌─────────────────────────────────────┬─────────────────────────────────────┐
    │ a                                   | b                                   │
    │ ---                                 | ---                                 │
    │ list[list[list[i64]]]               | list[list[list[i64]]]               │
    ╞═════════════════════════════════════╪═════════════════════════════════════╡
    │ [[[1], [2], ... [4]], [[1], [2],... | [[[1], [2], ... [4]], [[1], [2],... │
    │ [[[1], [2], ... [4]], [[1], [2],... | [[[1], [2], ... [4]], [[1], [2],... │
    └─────────────────────────────────────┴─────────────────────────────────────┘
    

    I thought .concat_list may be of use - but they are merged:

    (df.hstack(read_df.select(pl.all().suffix("_right")))
       .select(pl.concat_list(["a", "a_right"]))
       .limit(1).item())
    
    shape: (8,)
    Series: 'a' [list[i64]]
    [
        [1]
        [2]
        [3]
        [4]
        [1]
        [2]
        [3]
        [4]
    ]