Search code examples
pythonpython-polars

Convert column containing single element arrays into column of floats with Python polars


I've started using polars recently (https://docs.pola.rs/api/python/stable/reference/index.html)

I have a column in my data frame that contains single element arrays (output of a keras model.predict):

X
object
[0.49981183]
[0.49974033]
[0.4997973]
[0.49973667]
[0.49978396]

I want to convert this into a column of floats:

0.49981183
0.49974033
0.4997973
0.49973667
0.49978396

I've tried:

data = data.with_columns((pl.col("X")[0]).alias("Y"))

but it gives me this error:

TypeError: 'Expr' object is not subscriptable

What's the right way to do this? There are around 67 million rows so the faster the better

Cheers


Solution

  • Unfortunately, columns of type Object are often a dead-end. From the Data Types section of the Polars User Guide:

    Object: A limited supported data type that can be any value.

    Since support is limited, operations on columns of type Object often throw exceptions.

    However, there may be a way to retrieve the values in this particular situation. As an example, let's purposely create a column of type object.

    import polars as pl
    data_as_list = [[0.49981183], [0.49974033],
                    [0.4997973], [0.49973667], [0.49978396]]
    
    df = pl.DataFrame(
        pl.Series("X", values=data_as_list, dtype=pl.Object)
    )
    print(df)
    
    shape: (5, 1)
    ┌──────────────┐
    │ X            │
    │ ---          │
    │ object       │
    ╞══════════════╡
    │ [0.49981183] │
    │ [0.49974033] │
    │ [0.4997973]  │
    │ [0.49973667] │
    │ [0.49978396] │
    └──────────────┘
    

    This approach may work...

    def attempt_recover(series: pl.Series) -> pl.Series:
        return pl.Series(values=[val[0] for val in series])
    
    df.with_columns(pl.col("X").map_batches(attempt_recover).alias("X_recovered"))
    
    shape: (5, 2)
    ┌──────────────┬─────────────┐
    │ X            ┆ X_recovered │
    │ ---          ┆ ---         │
    │ object       ┆ f64         │
    ╞══════════════╪═════════════╡
    │ [0.49981183] ┆ 0.499812    │
    │ [0.49974033] ┆ 0.49974     │
    │ [0.4997973]  ┆ 0.4997973   │
    │ [0.49973667] ┆ 0.499737    │
    │ [0.49978396] ┆ 0.499784    │
    └──────────────┴─────────────┘
    

    Try this first on a tiny subset of your data. This may not work. (And it will not be fast.)

    What you'll want to do is alter the way that model prediction results from Keras are loaded into Polars to prevent getting a column of type Object. (Often this means indexing an array/list output to extract the number from the array/list before loading into Polars.)