Search code examples
python-polars

How to ensure that polars creates a column of type list rather than type object


The code below will create a column called paid that looks like a list, but is an object, and thus practically useless as a column. How can I ensure that the created column is a list column rather than an object column since .cast() cannot be applied to the object column after is has been created.

import numpy as np
import polars as pl
import scipy.stats as stats

CLUSTERS = 200 
MEAN_TRIALS = 20
MU = 0.5
SIGMA = 0.1

df_cluster = pl.DataFrame({'cluster_id': range(1, CLUSTERS+1)}) 

df_cluster = df_cluster.with_columns(
    mu = stats.truncnorm(a=0, b=1, loc=MU, scale=SIGMA).rvs(size=CLUSTERS),
    trials = np.random.poisson(lam=MEAN_TRIALS, size=CLUSTERS)
)

df_cluster = df_cluster.with_columns(
    pl.struct(["mu", "trials"])
    .map_elements(lambda x: np.random.binomial(n=1, p=x['mu'], size=x['trials']))
    .alias('paid')
)

df_cluster.head()

enter image description here


Solution

  • you can create list() out of ndarray before returning it:

    ...
    df_cluster = df_cluster.with_columns(
        pl.struct(["mu", "trials"])
        .map_elements(
            lambda x: list(np.random.binomial(n=1, p=x['mu'], size=x['trials']))
        )
        .alias('paid')
    )
    
    df_cluster.head()
    
    ┌────────────┬──────────┬────────┬─────────────┐
    │ cluster_id ┆ mu       ┆ trials ┆ paid        │
    │ ---        ┆ ---      ┆ ---    ┆ ---         │
    │ i64        ┆ f64      ┆ i32    ┆ list[i32]   │
    ╞════════════╪══════════╪════════╪═════════════╡
    │ 1          ┆ 0.508726 ┆ 25     ┆ [1, 0, … 1] │
    │ 2          ┆ 0.513275 ┆ 26     ┆ [1, 1, … 0] │
    │ 3          ┆ 0.57244  ┆ 22     ┆ [1, 0, … 1] │
    │ 4          ┆ 0.556384 ┆ 15     ┆ [0, 0, … 0] │
    │ 5          ┆ 0.51955  ┆ 15     ┆ [1, 1, … 1] │
    └────────────┴──────────┴────────┴─────────────┘